brentp / smoove

structural variant calling and genotyping with existing tools, but, smoothly.
Apache License 2.0
222 stars 21 forks source link

Run smoove for single sample? #217

Closed karoliinas closed 1 year ago

karoliinas commented 1 year ago

Hi,

Many thanks for the many awesome tools! I'm currently running structural variant validation for GIAB hg002 sample, which was sequenced with our cohort of samples. My plan is to detect SVs with various tools (e.g. Smoove, Manta, CNVnator, GATK gCNV) and merge the results. So far the recall is very low with any tool. E.g. Smoove results in overall ~ 20% recall. Is this because I was running it for the hg002 sample alone, is smoove only meant for cohort analysis?

The data was sequenced with NovaSeq 6000 with 50X coverage and aligned to hg38 with bwa-mem. I have validated the SNVs (called with DeepVariant) with very good precision / recall, and thus doubt there's a problem with the data.

I lifted the HG002_SVs_Tier1_v0.6.vcf.gz to hg38 using UCSC liftover, and I'm using Truvari to validate the data with tier1 regions bed (also lifted over). The liftover is my next suspect for the poor sensitivity, since all tools seem to perform equally bad.

Sorry to be rambling a bit, my main question is:

Can I run smoove for a single sample, or is it only applicable in cohort -mode?

Best, Karoliina

brentp commented 1 year ago

Hi, are you evaluating deletions only? lumpy, which is used by smoove does not call insertions.

karoliinas commented 1 year ago

Well, that explains a lot, thanks for the tip! I should've read the docs first.. I had noticed that most of the calls were indeed DEL, however there are a small number of DUPs in there too. Good to know this is expected behaviour. I expect smoove sensitivity will be much higher when limiting the validation to deletions only, similar to what you reported.

Combining different callers improves things a bit, but still gets only slightly above 50% sensitivity, although this I've only tested for chr22 (might not be the best choice).

Thanks for your swift help, I'll keep digging to find a suitable method for better retrieving the insertions.

brentp commented 1 year ago

On short read data, the recall for insertions will be quite low, unfortunately, especially for smaller insertions. I think manta, dysgu and gridss2 are the best choices with manta being easy to run, widely-used and quite fast.

karoliinas commented 1 year ago

Thanks for pointing me in the right direction, dysgu and gridss2 had completely gone under my radar. I'll be sure to give them a go in this quest to compile a suitable SV-caller concoction to make the most of the data. And yes, it seems that short read data is not optimal for this task.