dnanexus-archive / viral-ngs

viral-ngs
6 stars 6 forks source link

investigate metagenomics parallelism #51

Closed mlin closed 7 years ago

mlin commented 8 years ago

It doesn't seem to be utilizing the provisioned threads well, even on HiSeq runs. Is the setting not getting through? Otherwise, try GNU parallel by sample or similar.

dpark01 commented 8 years ago

I have some thoughts on this -- I don't think the issue is kraken itself (which appears to execute in seconds according to job logs). About 20mins is for staging down the database, but the remaining time (which can be 1-2 hours) is in all the other bits and pieces that could probably be improved / parallelized: the staging of the input bams and the Picard SamToFastq conversions.

On the Broad's end, we can expose a mechanism to run kraken on fastqs instead of bams, and then the DNAnexus wrapper can parallelize the download-bam piped to SamToFastq operations for all inputs simultaneously. That'd probably best utilize the available resources. We can't parallelize the kraken call itself, because each one will ask for >100GB RAM.

mlin commented 7 years ago

It may actually be possible to parallelize kraken because it does mmap() on the database rather than loading it into its heap...the OS should be smart enough to share the virtual address space. But you're right, Kraken is so fast that the orchestration bits are the bigger burn...at least it'll be simpler to parallelize those if we can also parallelize Kraken.