fritzsedlazeck / Sniffles

Structural variation caller using third generation sequencing
Other
559 stars 93 forks source link

How to reduce memory consumption during population calling? #448

Open tnguyengel opened 10 months ago

tnguyengel commented 10 months ago

We would like to reduce memory consumption during population calling. Is it possible to split SNF files by chromosome or genomic region?

Alternatively, should we supply smaller bams to Sniffles2 by splitting bams such that each bam only contains the reads that align to a chromosome/genomic region?

Related to https://github.com/fritzsedlazeck/Sniffles/issues/282.

fritzsedlazeck commented 10 months ago

There will be a new release coming very soon (days away) that reduces this and allows to split. @hermannromanek is on it :) Thanks Fritz

tnguyengel commented 8 months ago

Has the feature to split up SNF files by chromosome already been released? If so, where can we find the new binaries?

hermannromanek commented 8 months ago

Hi,

Sorry for the delay - we encountered some issues which had to be fixed first and are in the process of re-testing.

I just pushed the current release candidate, feel free to give it a try. Bear in mind this is not yet fully tested, there is one open bug we know of causing sniffles to report the same SVs twice. Please share with us any other issues you encounter.

To enable the improved population calling, please also make sure the library psutil is installed.

Thanks, Hermann

tnguyengel commented 6 months ago

I noticed that there is a new release: https://github.com/fritzsedlazeck/Sniffles/releases/tag/v2.3.2. Does this happen to solve this issue of large RAM usage for many samples? (We estimated Sniffles v2.2. will use up ~500-600 GB of RAM to do multisample calling on 5000 Human ONT samples, with no way to parallelize the effort across multiple machines to reduce the RAM consumption). If so, how does Sniffles v2.3+ handle many samples? Does it automatically throttle the memory usage when it detects that memory usage is becoming too high? We can't seem to find a way to tell Sniffles2.3+ to process the SNF files by chromosome (thereby increasing parallism and reducing RAM usage on a single machine).

fritzsedlazeck commented 6 months ago

Hey @tnguyengel as you can imagine its a bit tricky :) What @hermannromanek implemented is a window approach that lets you scale with multithreading and memory. The tight control of the memory is tricky but Hermann can explain how to run it. Thanks Fritz

hermannromanek commented 6 months ago

Hi @tnguyengel

Yes, sniffles 2.3 should not use as high amounts of memory for merging as 2.2 did. It does so by monitoring RAM usage and freeing up memory once the memory footprint exceeds 2gb per thread/worker process (which will be hit quite soon when processing 5000 samples). Also, while with 2.2 threads were working on one chromosome each, 2.3 threads work on the same chromosome in parallel, thus you get better parallelization when processing only one chromosome.

To process a single chromosome you can use the new parameter --contig CONTIG (or -c CONTIG) with CONTIG being the contig name you want to process.

Whats the command you've been trying to run sniffles with?

Thanks for your feedback, Hermann

tnguyengel commented 6 months ago

Whats the command you've been trying to run sniffles with?

For both Sniffles v2.3.2 and Sniffles v2.2, we were running

sniffles -t ${threads} --allow-overwrite --input "${snf_list}" --vcf "${out_merged_vcf}"

To process a single chromosome you can use the new parameter --contig CONTIG (or -c CONTIG) with CONTIG being the contig name you want to process.

Facepalm! I missed that. My apologies. We'll try scaling tests again with the --contig option.

lfpaulin commented 5 months ago

Dear tnguyengel, did you manage to run the 5000 samples? We just released a new version (2.3.3) that aids with some issues and are improving on merging large datasets. Your feedback is well appreciated

tnguyengel commented 5 months ago

We don't have the full 5000 samples to run yet, but that will be the final set that we eventually run with. We will rerun scaling tests with v2.3.3, and report the results here.

fritzsedlazeck commented 5 months ago

Cool. We keep testing and optimizing. Keep us posted and we will push forward. Thanks Fritz

tnguyengel commented 5 months ago

Dear tnguyengel, did you manage to run the 5000 samples? We just released a new version (2.3.3) that aids with some issues and are improving on merging large datasets. Your feedback is well appreciated

Fyi, initial scaling test with up to 35 samples indicate v2.3.3 would theoretically use ~100GB of RAM to aggregate a contig across 5000 sample cohort. Much more reasonable in terms of resource usage. I'll report more results with more details as we go along.

hermannromanek commented 2 days ago

While there are more improvements to come, v2.5 should yet improve multisample calling on larger data sets significantly. Merging 35 samples should stay well below 10gb of RAM.