Open edg1983 opened 3 years ago
So point 2 is currently inactivated. I left it in there to not mess with existing pipelines.
We usually use option 1 with 1000 bp. This works very well for different data sets. The output holds the information about which SV where merged.
thanks Fritz
Thanks! This clarifies the conflicting options. Additional question: in your experience which is the maximum number of VCF that can be merged by survivor? I will possibly like to scale to ~50k samples. Any advice to merge such a large cohort?
not a huge problem. We had run it on Topmed...
You might want to split by chr... memory consumption is not bad.. but the output file can get large.
Also please make sure input VCF files are sorted. That will ease the job. Cheers Fritz
and @edg1983 just ping me if you encounter an issue. 50k gets you in my gold member club ;)
I'm trying now to merge the big dataset. Few question / suggestions:
I'll keep you updated! Thanks!
Thats hard to predict as it depends also on the number of SV per sample. @2: No sadly... I have still not included that. @3 agree. I previously just did that by splitting the original VCF files..
thx Fritz
Just to clarify the longest part is actually the writing out of the VCF file. :)
Just an update on this. After several tries, I was not able to merge the whole cohort (79530 samples) in a single run, after ~1 week of running time it crashed with a bab_alloc
error. What I've noticed is that survivor merge
gets slower as it progresses and this effect becomes dramatic above ~25000 samples (at least in my setting).
When I tried splitting up the cohort in smaller chunks, sample sizes above 30k samples failed (I've tried 50k, 40k, 30k). The strange thing here is that according to the log the process is completed successfully, but the output VCF file was not produced.
In the end, I managed to process the cohort in 3 chunks, each containing ~26k samples. Each chunk used ~22h and ~160Gb peak memory.
strange.. we have merged 80k WGs. We split it up per chr .. but I guess that you have done the same ? And you are saying that the memory is the issue not the wring out.? Can you confirm the version you are using ?
Unfortunately, I cannot split by chromosomes, since this would generate more files than my inode quota on the system can allow. So I merged PASS calls across all genome for each sample. Memory usage is not linear apparently, since I didn't see a large difference when processing 25-30k samples or 50-70k for example (it goes up to 180Gb max).
I used survivor version 1.0.7 with the following command
survivor merge filelist.txt 1000 1 1 1 0 50 output_vcf.vcf
What typically take a long time is the writing out..
Sorry for the issues. Can you do it interatively. E.g. take chr22 and test it there ?
I usually run it without the strand as this flag is sometimes hard to parse:
survivor merge filelist.txt 1000 1 1 0 0 50 output_vcf.vcf
thx Fritz
I will do some more tests to find the best approach for merging, but right now I'm satisfied with 3 chunks. I will now test if the output VCF obtained is compatible with the downstream tools we use for SVs analysis. I found survivor the only tool (among the few I tested) able to merge such a large cohort so it's great! Adding the ability to read compressed/indexed VCF and thus process just a region of interest would make it perfect!
Ok great. Otherwise we would have needed to get creative :)
Hi,
I want to use SURVIVOR merge to merge calls from several VCFs. I've downloaded the latest release version and the help message for merge command reports these 2 options (among others) that are not clear to me:
If I set 1000 for example for option 1, I expect that SV events with both start and end breakpoints closer than 1000 bp will be merged. Instead, if I set 0.1 for option 1, then I expect that 2 SVs are merged if their start and end points are within a distance equal 10% of their respective lengths. What happens in case in the case of SVs with different lengths? The maximum allowed distance is determined by the shorter event since its fraction will be the shortest?
Given the above, which already seem to take SV length into account, which is the effect of option 2?
Thanks!