dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects
Apache License 2.0
142 stars 37 forks source link

Working without SSD #240

Closed OgnjenMilicevic closed 3 years ago

OgnjenMilicevic commented 3 years ago

Your performance consideration article states that due to a large number of intensive sorting operations one should use local SSD.

Can you clarify this further, preferably in the official documentation:

  1. Is using a regular local HDD bad because it would slow down the operation somewhat, or would it cause wear of the HDD head, or would it affect it in some other way?
  2. Does that mean that network mounted input files are OK as long as the output files are on local SSD?

Thanks!

mlin commented 3 years ago

Thanks for the question -- I clarified the doc as follows:

The working directory, used intensively as temporary space for external data sorting and scanning, should be on a local SSD to minimize I/O starvation of the available CPUs.

We're recommending local SSDs since that's what we deploy, but the real point is just that the node's storage subsystem should be fast enough to avoid leaving a lot of CPU cycles on the table due to I/O saturation. The I/O pattern of the individual threads isn't super random, it's just that there can be a lot of them at once.

Ecah GVCF input file is read sequentially, so it should be fine to read them from network storage (albeit many threads read different files at once). The output BCF is also streamed sequentially. It's the intermediate phase (managed by RocksDB) that hits the working directory storage pretty hard.

OgnjenMilicevic commented 3 years ago

Thank you for a prompt and thorough answer.

Closing this!