fritzsedlazeck / SURVIVOR

Toolset for SV simulation, comparison and filtering
MIT License
347 stars 46 forks source link

details of merge options #144

Open edg1983 opened 3 years ago

edg1983 commented 3 years ago

Hi,

I want to use SURVIVOR merge to merge calls from several VCFs. I've downloaded the latest release version and the help message for merge command reports these 2 options (among others) that are not clear to me:

1. max distance between breakpoints (0-1 percent of length, 1- number of bp)
2. estimate distance based on the SV size (1==yes, else no

If I set 1000 for example for option 1, I expect that SV events with both start and end breakpoints closer than 1000 bp will be merged. Instead, if I set 0.1 for option 1, then I expect that 2 SVs are merged if their start and end points are within a distance equal 10% of their respective lengths. What happens in case in the case of SVs with different lengths? The maximum allowed distance is determined by the shorter event since its fraction will be the shortest?

Given the above, which already seem to take SV length into account, which is the effect of option 2?

Thanks!

fritzsedlazeck commented 3 years ago

So point 2 is currently inactivated. I left it in there to not mess with existing pipelines.

We usually use option 1 with 1000 bp. This works very well for different data sets. The output holds the information about which SV where merged.

thanks Fritz

edg1983 commented 3 years ago

Thanks! This clarifies the conflicting options. Additional question: in your experience which is the maximum number of VCF that can be merged by survivor? I will possibly like to scale to ~50k samples. Any advice to merge such a large cohort?

fritzsedlazeck commented 3 years ago

not a huge problem. We had run it on Topmed...

You might want to split by chr... memory consumption is not bad.. but the output file can get large.

Also please make sure input VCF files are sorted. That will ease the job. Cheers Fritz

fritzsedlazeck commented 3 years ago

and @edg1983 just ping me if you encounter an issue. 50k gets you in my gold member club ;)

edg1983 commented 3 years ago

I'm trying now to merge the big dataset. Few question / suggestions:

  1. How long can it take to merge a ~50k sample dataset without splitting by chrom? My medium queue runs for 24h max, is this enough, or you advise allowing a longer run time?
  2. Can the tools read compressed VCFs? It seems not working when the input files are bgzip compressed
  3. It would be great to have a region option to restrict merge by chromosome. As you suggested, merging a large cohort can be done by chromosome, but this can be painful if I have to prepare by chromosome VCFs from ~50k individual VCFs.

I'll keep you updated! Thanks!

fritzsedlazeck commented 3 years ago

Thats hard to predict as it depends also on the number of SV per sample. @2: No sadly... I have still not included that. @3 agree. I previously just did that by splitting the original VCF files..

thx Fritz

fritzsedlazeck commented 3 years ago

Just to clarify the longest part is actually the writing out of the VCF file. :)

edg1983 commented 3 years ago

Just an update on this. After several tries, I was not able to merge the whole cohort (79530 samples) in a single run, after ~1 week of running time it crashed with a bab_alloc error. What I've noticed is that survivor merge gets slower as it progresses and this effect becomes dramatic above ~25000 samples (at least in my setting). When I tried splitting up the cohort in smaller chunks, sample sizes above 30k samples failed (I've tried 50k, 40k, 30k). The strange thing here is that according to the log the process is completed successfully, but the output VCF file was not produced. In the end, I managed to process the cohort in 3 chunks, each containing ~26k samples. Each chunk used ~22h and ~160Gb peak memory.

fritzsedlazeck commented 3 years ago

strange.. we have merged 80k WGs. We split it up per chr .. but I guess that you have done the same ? And you are saying that the memory is the issue not the wring out.? Can you confirm the version you are using ?

edg1983 commented 3 years ago

Unfortunately, I cannot split by chromosomes, since this would generate more files than my inode quota on the system can allow. So I merged PASS calls across all genome for each sample. Memory usage is not linear apparently, since I didn't see a large difference when processing 25-30k samples or 50-70k for example (it goes up to 180Gb max). I used survivor version 1.0.7 with the following command survivor merge filelist.txt 1000 1 1 1 0 50 output_vcf.vcf

fritzsedlazeck commented 3 years ago

What typically take a long time is the writing out..

Sorry for the issues. Can you do it interatively. E.g. take chr22 and test it there ?

I usually run it without the strand as this flag is sometimes hard to parse: survivor merge filelist.txt 1000 1 1 0 0 50 output_vcf.vcf

thx Fritz

edg1983 commented 3 years ago

I will do some more tests to find the best approach for merging, but right now I'm satisfied with 3 chunks. I will now test if the output VCF obtained is compatible with the downstream tools we use for SVs analysis. I found survivor the only tool (among the few I tested) able to merge such a large cohort so it's great! Adding the ability to read compressed/indexed VCF and thus process just a region of interest would make it perfect!

fritzsedlazeck commented 3 years ago

Ok great. Otherwise we would have needed to get creative :)