andyrimmer / Platypus

Platypus Variant Caller
GNU General Public License v3.0
105 stars 38 forks source link

Interpretation of FORMAT/NR and FORMAT/NV for multiallelic sites #61

Open holtgrewe opened 7 years ago

holtgrewe commented 7 years ago

Dear Andy,

How do I interpret the NR and NV fields in the presence of multi-allelic alleles?

Consider the following line, for example.

hs37d5 35158918 . TGCGGC CGCGGC,CGCGGT 1553 PASS FR=0.25,0.25;MMLQ=33.0;TCR=357;HP=1;WE=35158931;Source=Platypus;WS=35158881;PP=1090.0,1553.0;TR=241,117;NF=88,46;TCF=230;NR=153,71;TC=587;MGOF=40;SbPval=0.56;MQ=58.9;QD=4.667100467;SC=GCCCTGGAGATGCGGCCCCCA;BRF=0.1;HapScore=3 GT:GL:GOF:GQ:NR:NV 1/0:-1.0,-1.0,-1.0:35.0:99:235,235:99,34 2/0:-1.0,-1.0,-1.0:29.0:99:89,89:35,33 2/0:-1.0,-1.0,-1.0:37.0:99:183,183:73,33 1/0:-1.0,-1.0,-1.0:40.0:99:80,80:34,17

What is given here?

andyrimmer commented 7 years ago

The NR and NV values are given for each allele, in the order which the alleles are listed in the ALT column. So the NR values will always be the same (number of reads supporting the reference), and the NV values will be different for each variant allele. These numbers are harder to interpret when you have e.g. repetitive indels because some reads could support either variant allele.

baraaorabi commented 7 years ago

Hey Andy,

I ran Platypus with a single sample, and I got this multi-allelic variant record:

16 3294154 . GTT G,GT 1188 badReads;QD BRF=0.73;FR=0.5000,0.5000;HP=11;HapScore=2;MGOF=17;MMLQ=38;MQ=60.0;NF=0,0;NR=129,129;PP=695,1188;QD=5.45736434109;SC=CAGGTGTTTGGTTTTTTTTTT;SbPval=1.0;Source=Platypus;TC=135;TCF=0;TCR=135;TR=129,129;WE=3294164;WS=3294144 GT:GL:GOF:GQ:NR:NV 1/2:-1,-1,-1:17:99:135,135:129,129

I am confused by NV and NR numbers from FORMAT field here. In INFO field, TC ("Total coverage at this locus") is 135. If as you are saying in this thread, NR of FORMAT is "number of reads supporting the reference", wouldn't this mean that all reads (aka TC) are reads supporting the reference?

Additionally, is the case of this VCF record a case in which all 129 reads support both variants simultaneously? Is there an arithmetic way that I can separate reads into two separate variants directly from the VCF file?

Thanks a ton!

GloverLab commented 14 hours ago

Just reawakening this thread, as I also don't understand it. I have two questions.

First: the documentation says NR # Number of reads covering variant position in this sample NV # Number of reads at variant position which support the called variant in this sample But you above say that NR is 'number of reads supporting the reference'. Which is it?

Second, how to interpret the results? In baraaorabi's comment above, he gives an example of a multi-allelic variant record. If NR is 135,135 and NV is 129, 129, what's going on? Are there 135 reads at this position? If so how can 129 of them support one variant and 129 of them support the other variant? (That adds up to more than 135!) So I don't know how to calculate the number of reads supporting the reference allele here - it can't be 135 - 129 - 129, because that's a negative number of reads!

If you're able to shed any light on this I'd greatly appreciate it.

GloverLab commented 10 hours ago

Update to the first point: it's definitely the number of reads covering the variant position, not the number of reads supporting the reference; variantcaller.pyx line 865 contains NR=[theBuffer.reads.windowEnd - theBuffer.reads.windowStart]. I'm struggling to read the code to properly answer the second, though: the only way that I think it can happen within the code is for a function called variantSupportedByRead to return TRUE for more than one allele.