EichlerLab / smrtsv2

Structural variant caller
MIT License
53 stars 6 forks source link

Best way to run SMRT-SV Genotyper on large number of samples? #52

Closed jjfarrell closed 4 years ago

jjfarrell commented 4 years ago

For using SMRT-SV Genotyper to genotype a large number of samples (eg 5000 crams), is there a practical limit for the number to run at one time. Or should they just be run on each individual sample and then merged? What is the best strategy?

paudano commented 4 years ago

The biggest limitation on the genotyper is that has to completely remap every sample to an augmented reference, and it takes longer to map the sample to this reference than it does a standard reference. If this is for human genomes, it's likely going to be prohibitively expensive to run that many samples. Other methods, such as Paragraph (https://github.com/Illumina/paragraph), might be better for this kind of scale.

The genotyper runs one sample/CRAM at a time. You can setup a job with 5,000 CRAMs and run it, but it's going to take a very long time.

If you are splitting the work over multiple machines or clusters, the easiest thing to do is run each batch and then merge with bcftools. The VCF records through INFO will be the same for each batch, so it's trivial to merge those VCFs into one by concatenating the columns. After the merge, vcffixup (https://github.com/vcflib/vcflib) should take care of correcting allele count/frequency in INFO.

Alternatively, you could run each batch with the --nt parameters (preserve temp files), then copy the subdirectories of samples to one directory, then restart the genotyper to have it do the merge and run vcffixup itself. All it does is take the SV calls from sv_calls/sv_calls.vcf.gz, read it into a table, append a column for each sample, runs vcffixup, and finishes it with bgzip/tabix.

Does this help?

jjfarrell commented 4 years ago

Thanks! This is helpful. Thanks for the suggestions. I am benchmarking a few callers to understand the compute requirements(SMRT-SV, Paragraph and BayesTyper, SVTyper). The project has lots of cluster capacity. SMRT-SV does read crams which is a plus.