Open tnguyengel opened 10 months ago
There will be a new release coming very soon (days away) that reduces this and allows to split. @hermannromanek is on it :) Thanks Fritz
Has the feature to split up SNF files by chromosome already been released? If so, where can we find the new binaries?
Hi,
Sorry for the delay - we encountered some issues which had to be fixed first and are in the process of re-testing.
I just pushed the current release candidate, feel free to give it a try. Bear in mind this is not yet fully tested, there is one open bug we know of causing sniffles to report the same SVs twice. Please share with us any other issues you encounter.
To enable the improved population calling, please also make sure the library psutil is installed.
Thanks, Hermann
I noticed that there is a new release: https://github.com/fritzsedlazeck/Sniffles/releases/tag/v2.3.2. Does this happen to solve this issue of large RAM usage for many samples? (We estimated Sniffles v2.2. will use up ~500-600 GB of RAM to do multisample calling on 5000 Human ONT samples, with no way to parallelize the effort across multiple machines to reduce the RAM consumption). If so, how does Sniffles v2.3+ handle many samples? Does it automatically throttle the memory usage when it detects that memory usage is becoming too high? We can't seem to find a way to tell Sniffles2.3+ to process the SNF files by chromosome (thereby increasing parallism and reducing RAM usage on a single machine).
Hey @tnguyengel as you can imagine its a bit tricky :) What @hermannromanek implemented is a window approach that lets you scale with multithreading and memory. The tight control of the memory is tricky but Hermann can explain how to run it. Thanks Fritz
Hi @tnguyengel
Yes, sniffles 2.3 should not use as high amounts of memory for merging as 2.2 did. It does so by monitoring RAM usage and freeing up memory once the memory footprint exceeds 2gb per thread/worker process (which will be hit quite soon when processing 5000 samples). Also, while with 2.2 threads were working on one chromosome each, 2.3 threads work on the same chromosome in parallel, thus you get better parallelization when processing only one chromosome.
To process a single chromosome you can use the new parameter --contig CONTIG (or -c CONTIG) with CONTIG being the contig name you want to process.
Whats the command you've been trying to run sniffles with?
Thanks for your feedback, Hermann
Whats the command you've been trying to run sniffles with?
For both Sniffles v2.3.2 and Sniffles v2.2, we were running
sniffles -t ${threads} --allow-overwrite --input "${snf_list}" --vcf "${out_merged_vcf}"
To process a single chromosome you can use the new parameter --contig CONTIG (or -c CONTIG) with CONTIG being the contig name you want to process.
Facepalm! I missed that. My apologies. We'll try scaling tests again with the --contig option.
Dear tnguyengel, did you manage to run the 5000 samples? We just released a new version (2.3.3) that aids with some issues and are improving on merging large datasets. Your feedback is well appreciated
We don't have the full 5000 samples to run yet, but that will be the final set that we eventually run with. We will rerun scaling tests with v2.3.3, and report the results here.
Cool. We keep testing and optimizing. Keep us posted and we will push forward. Thanks Fritz
Dear tnguyengel, did you manage to run the 5000 samples? We just released a new version (2.3.3) that aids with some issues and are improving on merging large datasets. Your feedback is well appreciated
Fyi, initial scaling test with up to 35 samples indicate v2.3.3 would theoretically use ~100GB of RAM to aggregate a contig across 5000 sample cohort. Much more reasonable in terms of resource usage. I'll report more results with more details as we go along.
While there are more improvements to come, v2.5 should yet improve multisample calling on larger data sets significantly. Merging 35 samples should stay well below 10gb of RAM.
We would like to reduce memory consumption during population calling. Is it possible to split SNF files by chromosome or genomic region?
Alternatively, should we supply smaller bams to Sniffles2 by splitting bams such that each bam only contains the reads that align to a chromosome/genomic region?
Related to https://github.com/fritzsedlazeck/Sniffles/issues/282.