PapenfussLab / StructuralVariantAnnotation

R package designed to simplify structural variant analysis
GNU General Public License v3.0
68 stars 15 forks source link

This package only takes BND notation vcf ? #34

Closed yangyxt closed 3 years ago

yangyxt commented 3 years ago

I tried to convert vcf records to grange objects and use breakpointGRangesToVCF function to normalise symbolic records to BND vcf records.

However, I found this is not available since the symbolic records will be stored as records with irange width > 1 in GRange Object. And there is an assertion in .toVcfBreakendNotationAlt all(width(gr)==1), so the records in GRange object derived from symbolic vcf records will surely fail this assertion.

I test this with a simple DELLY generated SV record VCF file. Here is a screenshot for GRange object derived from function breakpointRanges(vcf): image

Therefore, generally speaking, StructuralVariantAnnotation cannot do format normalization for SV records in vcf files from different callers? I better do the normalization myself, like convert all symbolic records to BND notation records and then load the vcf into StructuralVariantAnnotation?

hsiaoyi0504 commented 3 years ago

I have a similar question here. Probably also related to #33. What's the acceptable notation of structural variant calls for StructuralVariantAnnotation? Does it really support both notations of structural variants? Thank you.

d-cameron commented 3 years ago

@yangyxt sorry for the late reply. Can you post which version of DELLY you're using, and a VCF with a few entries in it?

symbolic records will be stored as records with irange width > 1 in GRange Object

That is actually possible for IMPRECISE events. Without the input VCF, I'm not sure whether that is the case here, or a bug in SVA.

d-cameron commented 3 years ago

Sorry for the long delay - I'm currently updating the documentation to better describe the design of StructuralVariantAnnotation.

use breakpointGRangesToVCF breakpointGRangesToVCF is only partially implemented and not officially released.

I tried to convert vcf records to grange objects to normalise symbolic records to BND vcf records.

StructuralVariantAnnotation already does this in breakpointRanges(). I have test cases for VCFs produced by crest, delly, gridss, manta, pindel, tigra, lumpy, and others.

What's the acceptable notation of structural variant calls for StructuralVariantAnnotation? Does it really support both notations of structural variants? Thank you.

Any spec-compliant VCF representation (plus a few caller-specific ones I have special-case code for). That is, sequence symbolic, breakpoint, and breakend notations are all supported. For example, StructuralVariantAnnotation can correctly parse the following VCF:

##fileformat=VCFv4.2
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=MATEID,Number=.,Type=String,Description="ID of mate breakends">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the structural variant">
##ALT=<ID=DEL,Description="Deletion">
##contig=<ID=chr,length=18,sequence="CGTGTtgtagtaCCGTAA">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
chr 5   sequence    TTGTAGTA    T   .   .   
chr 5   symbolic    T   <DEL>   .   .   SVTYPE=DEL;SVLEN=-7;END=12
chr 5   breakpoint1 T   T[chr:13[   .   .   SVTYPE=BND;MATEID=breakpoint2
chr 13  breakpoint2 C   ]chr:5]C    .   .   SVTYPE=BND;MATEID=breakpoint1
chr 5   breakend    T   T.  .   .   SVTYPE=BND
d-cameron commented 3 years ago

What is not immediately clear from the docs is that SVA turns everything into breakpoint notation. In the delly example by the OP, SVA turns DUP000000000 into breakpoint notation hence why the output includes DUP000000000_bp1 and DUP000000000_bp2, and why exists INV00026615_bp4 (since an inversion has 2 breakpoints = 4 breakends).