brentp / smoove

structural variant calling and genotyping with existing tools, but, smoothly.
Apache License 2.0
231 stars 21 forks source link

smoove annotate not adding exonic overlaps #171

Closed fakedrtom closed 2 years ago

fakedrtom commented 3 years ago

Hey Brent,

I've noticed for a while now that when I try to use smoove annotate on a Manta VCF it runs completely and adds smoove_gene annotations, but never exonic overlaps. It will add gene, downstream, and upstream, but I never see exon. I can compare SV calls from the same samples generated with Smoove and even when there are seemingly the same event called by Smoove and Manta, only the Smoove call will include "exon" in the smoove_gene annotation (if there is indeed an exonic overlap). Any idea why Manta calls would have this problem? I am running smoove annotate like so:

smoove annotate --gff $GFF manta.vcf.gz | bgzip -c > annotated.vcf.gz

And I'm using the same reference GRCh38 GFF file that I also use when running Smoove:

Homo_sapiens.chr.GRCh38.97.gff3.gz

Thanks

brentp commented 3 years ago

Hi Tom, can you show the variant line from Manta for a case like this? It might be something with the parsing in brentp/vcfgo

fakedrtom commented 3 years ago

Here's an example from the original Manta VCF:

chr19   605528  MantaDUP:TANDEM:1:122044:122045:1:0:0   C   <DUP:TANDEM>    999 PASS    END=608623;SVTYPE=DUP;SVLEN=3095

And this is how it looks after I tried running smoove annotate (I added the PE, SR, SU, and CIPOS/CIEND annotations before running the smoove annotate with a script modeled after svtools approach for doctoring Manta VCFs):

chr19   605528  MantaDUP:TANDEM:1:122044:122045:1:0:0   N   <DUP>   999 PASS    END=608623;SVTYPE=DUP;SVLEN=3095;PE=16;SR=17;SU=33;CIPOS=0,0;CIEND=0,0;CIPOS95=0,0;CIEND95=0,0;PRPOS=1;PREND=1;STRANDS=-+:33;MSHQ=0;smoove_gene=HCN2|gene:1:3096;GCF=0.597545

For comparison, here is the same call from Smoove with the expected exon annotation in the smoove_gene:

chr19   605528  6502    N   <DUP>   304.5   .   SVTYPE=DUP;SVLEN=3094;END=608622;STRANDS=-+:22;IMPRECISE;CIPOS=-293,49;CIEND=-33,365;CIPOS95=-18,18;CIEND95=-33,54;SU=22;PE=22;SR=0;SNAME=1024-02:5083,1024-01:5477;ALG=PROD;GCF=0.597415;AN=6;AC=2;MSHQ=4;smoove_gene=HCN2|gene:1:3095,HCN2|exon:1:218

Is smoove annotate using the values from the CIPOS/CIEND? In this case, I believe these genomic coordinates should have the DUP fully overlapping an exon regardless of the CI intervals.

brentp commented 3 years ago

I can't see any problem in the code. And I put that variant into vcfgo and it's reporting the correct End. Are you sure you annoated with the same GFF in both cases (for smoove and manta)?

fakedrtom commented 3 years ago

Pretty sure. Both Smoove and Manta are being run by our core, but I then try to add the same annotations to the Manta VCF to be consistent. I checked and made sure I am using the same GFF that the core is using when running Smoove (Homo_sapiens.chr.GRCh38.97.gff3.gz). Looks like smoove annotate expects CIPOS/CIEND so I add those to the Manta VCF with values of 0,0 and that seems to enable smoove annotate to run, but in the end, none of the exons get added. Gene, upstream, and downstream are added, but not exon. I also do not see five_prime_UTR or three_prime_UTR being added either. Could the false CIPOS/CIEND have anything to do with this?

brentp commented 3 years ago

It's not the CIPOS/CIEND, as far as I know, or at least you shoudl get a warning if something is off with those. What did you add to the header for them?

You can also try amwenger/svpack for sv annotation, that's what i've been using.

fakedrtom commented 3 years ago

So Manta VCFs have a CIPOS and CIEND, but not all variants include those annotations so I add them (with 0,0) to those that are missing, but they are listed in the header as:

##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END">

Here's all new headers for the new INFOs that I add to the Manta VCFs:

##INFO=<ID=STRANDS,Number=.,Type=String,Description="Strand orientation of the adjacency in BEDPE format (DEL:+-, DUP:-+)">
##INFO=<ID=PRPOS,Number=1,Type=String,Description="Breakpoint probability dist">
##INFO=<ID=PREND,Number=1,Type=String,Description="Breakpoint probability dist">
##INFO=<ID=SU,Number=.,Type=Integer,Description="Number of pieces of evidence supporting the variant across all samples">
##INFO=<ID=PE,Number=.,Type=Integer,Description="Number of paired-end reads supporting the variant across all samples">
##INFO=<ID=SR,Number=.,Type=Integer,Description="Number of split reads supporting the variant across all samples">
##INFO=<ID=INSLEN_ORIG,Number=.,Type=Integer,Description="Original insertion length">
##INFO=<ID=CIPOS95,Number=2,Type=Integer,Description="Confidence interval (95%) around POS for imprecise variants">
##INFO=<ID=CIEND95,Number=2,Type=Integer,Description="Confidence interval (95%) around END for imprecise variants">
##INFO=<ID=SECONDARY,Number=0,Type=Flag,Description="Secondary breakend in a multi-line variant">

I can't image those are having any effect.

I'll look into svpack. Right now I am just trying to be consistent with my Smoove and Manta VCFs. Thanks!

fakedrtom commented 2 years ago

Hey, just to follow up, looks like this was just a versioning problem on my end. Updated smoove and everything looks good now. Sorry for the bother, but I think this can be closed now. Thanks!

brentp commented 2 years ago

Good news. Thanks for following up!