brentp / cyvcf2

cython + htslib == fast VCF and BCF processing
MIT License
372 stars 72 forks source link

is_indel treats "<NON_REF>" ALT as an indel #217

Open vsbuffalo opened 3 years ago

vsbuffalo commented 3 years ago

The GVCF format uses <NON_REF> as a placeholder for a possible non-reference allele in the ALT column. Because cyvcf2 compares REF to ALT length to determine if a variant is an indel these ALT entries are incorrectly classified as indels.

Given that this labeling is not defined in the VCF standard (AFAIK), and very much a part of the GATK HaplotypeCaller way of doing things, this may not be a bug necessarily, but I wanted to document this for others stumped by this behavior.

brentp commented 3 years ago

ouch. it's been a long time since I looked at the is_* functions/properties. I'll see if I can et a fix for this case. Likely there are similar oversights in the other is_ functions.

brentp commented 3 years ago

I pushed a fix for this exact case and made the is_indel handling more general. But yeah, likely other problems persist. Will fix as I see them. thanks for reporting.

blaiseli commented 7 months ago

I'm rather a beginner with variant calling, so there may be things that I don't understand well, but I noticed while using cyvcf2 to parse some freebayes + snpEff generated .vcf that there were variants that had at the same time is_mnp and is_indel set to True.

The vcf line (excluding individual sample's info) looks as follows:

KV38_Basile    94882   .       AC      AA      0.0     .       AB=0;ABP=0;AC=0;AF=0;AN=156;AO=1824;CIGAR=1M1X;DP=35043;D
PB=35196.5;DPRA=0;EPP=38.2301;EPPR=226.923;GTI=0;LEN=1;MEANALT=1.65385;MQM=44.6255;MQMR=46.4073;NS=78;NUMALT=1;ODDS=270.0
26;PAIRED=1;PAIREDR=1;PAO=155;PQA=5691;PQR=5580;PRO=152;QA=27536;QR=1112793;RO=33155;RPL=1004;RPP=43.3159;RPPR=19.8456;RP
R=820;RUN=1;SAF=12;SAP=3860.23;SAR=1812;SRF=17251;SRP=121.844;SRR=15904;TYPE=snp;technology.Illumina=1;ANN=AA|intergenic_
region|MODIFIER|FNLLGLLA_00047-FNLLGLLA_00051|FNLLGLLA_00047-FNLLGLLA_00051|intergenic_region|FNLLGLLA_00047-FNLLGLLA_000
51|||n.94883C>A||||||      GT:DP:AD:RO:QR:AO:QA:GL 0/0:524:474,46:474:15578:46:771:0,-87.3916,-1302.1

This record, explored from within an ipython session:

In [43]: rec.is_snp
Out[43]: False
In [44]: rec.is_mnp
Out[44]: True
In [45]: rec.REF
Out[45]: 'AC'
In [46]: rec.ALT
Out[46]: ['AA']
In [47]: rec.is_indel
Out[47]: True
In [48]: rec.POS
Out[48]: 94882

I also notice a TYPE=snp item in the vcf line above.

So I'm lost: Should this be considered an indel, a snp or an mnp?

The following line of code in property is_indel suggests me that any variant that is not is_sv and has a REF longer than 1 will be is_indel == True:

https://github.com/brentp/cyvcf2/blob/main/cyvcf2/cyvcf2.pyx#L1928

if len(self.REF) > 1: return True

This would make any mnp an indel also. Intuitively, this doesn't sound correct to me.