chapmanb / bcbio.variation

Toolkit to analyze genomic variation data, built on the GATK with Clojure
66 stars 15 forks source link

problem with normalization #20

Closed parlar closed 9 years ago

parlar commented 9 years ago

I notice that bcbio.variation does not decompose complex variant calls fully (example below). Additionally, I notice that the allele-specific information under the sample columns is not transferred to to the new records. For example, in the AO field (and other fields), there are still five alleles after normalization while there should be only one or two.

I'm not sure if this is by intention?

Prior to normalization: 1 1460791 . AGCTTTTTTTTTTTTTTTTTTTGAGAAG TGCTTTTTTTTTTTTTTTTTTGAGAAG,AGCTTTTTTTTTTTTTTTTTTGAGAAG,AGCTTTTTTTTTTTTTTTTTGAGAAG,AGCTTTTTTTTTTTTTTTTGAGAAG,AGCTTTTTTTTTTTTTTTTTTTTGAGAAG 206.833 . AB=0.0588235,0.156863,0.212121,0.111111,0.137255;ABP=89.2305,55.1682,26.7649,26.6552,61.2994;AC=0,1,1,0,1;AF=0,0.25,0.25,0,0.25;AN=4;AO=3,8,7,2,7;CIGAR=1X2M1D24M,3M1D24M,3M2D23M,3M3D22M,3M1I25M;DP=51;DPB=52.1786;DPRA=0,0,0,0,0;EPP=9.52472,4.09604,3.32051,3.0103,10.7656;EPPR=6.91895;GTI=1;LEN=27,1,2,3,1;MEANALT=13.5,13.5,14,13,13.5;MQM=60,60,60,60,60;MQMR=60;NS=2;NUMALT=5;ODDS=0.134799;PAIRED=1,1,1,1,1;PAIREDR=1;PAO=0,0.783333,0.45,0.2,0.783333;PQA=0,26.8167,15.15,7.4,26.8167;PQR=26.8167;PRO=0.783333;QA=42,233,181,55,222;QR=172;RO=5;RPP=9.52472,20.3821,18.2106,7.35324,10.7656;RPPR=13.8677;RUN=1,1,1,1,1;SAF=0,5,3,1,5;SAP=9.52472,4.09604,3.32051,3.0103,5.80219;SAR=3,3,4,1,2;SRF=4;SRP=6.91895;SRR=1;TYPE=complex,del,del,del,ins;technology.illumina=1,1,1,1,1 GT:GQ:DP:RO:QR:AO:QA:GL 0/5:29.6356:18:3:103:1,1,0,2,2:14,37,0,55,78:-5.71672,-7.72676,-10,-3.44929,-10,-10,-7.35789,-10,-10,-10,-2.90839,-10,-8.35979,-10,-10,0,-9.72887,-5.45141,-9.36,-4.9105,-7.99216 2/3:2.63685:33:2:69:2,7,7,0,5:28,196,181,0,144:-10,-10,-10,-8.75853,-10,-10,-10,-10,0,-10,-10,-10,-10,-10,-10,-10,-10,-2.80037,-4.15467,-10,-10

After normalization: 1 1460791 . A T 206.83 . AB=0.0588235,0.156863,0.212121,0.111111,0.137255;ABP=89.2305,55.1682,26.7649,26.6552,61.2994;AC=0,1,1,0,1;AF=0,0.25,0.25,0,0.25;AN=4;AO=3,8,7,2,7;CIGAR=1X2M1D24M,3M1D24M,3M2D23M,3M3D22M,3M1I25M;DP=51;DPB=52.1786;DPRA=0,0,0,0,0;EPP=9.52472,4.09604,3.32051,3.0103,10.7656;EPPR=6.91895;GTI=1;LEN=27,1,2,3,1;MEANALT=13.5,13.5,14,13,13.5;MQM=60,60,60,60,60;MQMR=60;NS=2;NUMALT=5;ODDS=0.134799;PAIRED=1,1,1,1,1;PAIREDR=1;PAO=0,0.783333,0.45,0.2,0.783333;PQA=0,26.8167,15.15,7.4,26.8167;PQR=26.8167;PRO=0.783333;QA=42,233,181,55,222;QR=172;RO=5;RPP=9.52472,20.3821,18.2106,7.35324,10.7656;RPPR=13.8677;RUN=1,1,1,1,1;SAF=0,5,3,1,5;SAP=9.52472,4.09604,3.32051,3.0103,5.80219;SAR=3,3,4,1,2;SRF=4;SRP=6.91895;SRR=1;TYPE=complex,del,del,del,ins;technology.illumina=1,1,1,1,1 GT:AO:DP:GQ:PL:QA:QR:RO 0/0:1,1,0,2,2:18:30:57,77,100,34,100,100,74,100,100,100,29,100,84,100,100,0,97,55,94,49,80:14,37,0,55,78:103:3 0/0:2,7,7,0,5:33:3:100,100,100,88,100,100,100,100,0,100,100,100,100,100,100,100,100,28,42,100,100:28,196,181,0,144:69:2 1 1460793 . CTTT C,CTTTT 206.83 . AB=0.0588235,0.156863,0.212121,0.111111,0.137255;ABP=89.2305,55.1682,26.7649,26.6552,61.2994;AC=0,1,1,0,1;AF=0,0.25,0.25,0,0.25;AN=4;AO=3,8,7,2,7;CIGAR=1X2M1D24M,3M1D24M,3M2D23M,3M3D22M,3M1I25M;DP=51;DPB=52.1786;DPRA=0,0,0,0,0;EPP=9.52472,4.09604,3.32051,3.0103,10.7656;EPPR=6.91895;GTI=1;LEN=27,1,2,3,1;MEANALT=13.5,13.5,14,13,13.5;MQM=60,60,60,60,60;MQMR=60;NS=2;NUMALT=5;ODDS=0.134799;PAIRED=1,1,1,1,1;PAIREDR=1;PAO=0,0.783333,0.45,0.2,0.783333;PQA=0,26.8167,15.15,7.4,26.8167;PQR=26.8167;PRO=0.783333;QA=42,233,181,55,222;QR=172;RO=5;RPP=9.52472,20.3821,18.2106,7.35324,10.7656;RPPR=13.8677;RUN=1,1,1,1,1;SAF=0,5,3,1,5;SAP=9.52472,4.09604,3.32051,3.0103,5.80219;SAR=3,3,4,1,2;SRF=4;SRP=6.91895;SRR=1;TYPE=complex,del,del,del,ins;technology.illumina=1,1,1,1,1 GT:AO:DP:GQ:PL:QA:QR:RO 0|2:1,1,0,2,2:18:30:57,77,100,34,100,100,74,100,100,100,29,100,84,100,100,0,97,55,94,49,80:14,37,0,55,78:103:3 0|0:2,7,7,0,5:33:3:100,100,100,88,100,100,100,100,0,100,100,100,100,100,100,100,100,28,42,100,100:28,196,181,0,144:69:2

chapmanb commented 9 years ago

Pär; Thanks for the report. bcbio.variation isn't going to do great in these very complex cases and the right thing to do is to pre-process this upstream with other third party tools. We try to do this with freebayes and I checked in some additional tools in that workflow to remove and cleanup duplicates. After running through those tools it will look like:

1   1460791 .   A   T   206.833 .   AB=0.0588235;ABP=89.2305;AC=0;AF=0;AN=4;AO=3;CIGAR=1X2M1D24M;DP=51;DPB=52.1786;DPRA=0;EPP=9.52472;EPPR=6.91895;GTI=1;LEN=1;MEANALT=13.5;MQM=60;MQMR=60;NS=2;NUMALT=5;ODDS=0.134799;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=26.8167;PRO=0.783333;QA=42;QR=172;RO=5;RPP=9.52472;RPPR=13.8677;RUN=1;SAF=0;SAP=9.52472;SAR=3;SRF=4;SRP=6.91895;SRR=1;TYPE=snp;technology.illumina=1  GT:GQ:DP:RO:QR:AO:QA:GL 0|0:29:63685:3:103:1,1,0,2,2:14,37,0,55,78:-5.71672,-7.72676,-10,-3.44929,-10,-10,-7.35789,-10,-10,-10,-2.90839,-10,-8.35979,-10,-10,0,-9.72887,-5.45141,-9.36,-4.9105,-7.99216 0|0:2:33:2:69:2,7,7,0,5:28,196,181,0,144:-10,-10,-10,-8.75853,-10,-10,-10,-10,0,-10,-10,-10,-10,-10,-10,-10,-10,-2.80037,-4.15467,-10,-10
1   1460793 .   CTTT    CTTTT,CTT,CT,C  206.833 .   AB=0.137255,0.156863,0.212121,0.111111;ABP=61.2994,55.1682,26.7649,26.6552;AC=1,1,1,0;AF=0.25,0.25,0.25,0;AN=4;AO=7,8,7,2;CIGAR=3M1I25M,3M1D24M,3M2D23M,3M3D22M;DP=51,51,51,51;DPB=52.1786,52.1786,52.1786,52.1786;DPRA=0,0,0,0;EPP=10.7656,4.09604,3.32051,3.0103;EPPR=6.91895,6.91895,6.91895,6.91895;GTI=1,1,1,1;LEN=1,1,2,3;MEANALT=13.5,13.5,14,13;MQM=60,60,60,60;MQMR=60,60,60,60;NS=2;NUMALT=5,5,5,5;ODDS=0.134799,0.134799,0.134799,0.134799;PAIRED=1,1,1,1;PAIREDR=1,1,1,1;PAO=0.783333,0.783333,0.45,0.2;PQA=26.8167,26.8167,15.15,7.4;PQR=26.8167,26.8167,26.8167,26.8167;PRO=0.783333,0.783333,0.783333,0.783333;QA=222,233,181,55;QR=172,172,172,172;RO=5,5,5,5;RPP=10.7656,20.3821,18.2106,7.35324;RPPR=13.8677,13.8677,13.8677,13.8677;RUN=1,1,1,1;SAF=5,5,3,1;SAP=5.80219,4.09604,3.32051,3.0103;SAR=2,3,4,1;SRF=4,4,4,4;SRP=6.91895,6.91895,6.91895,6.91895;SRR=1,1,1,1;TYPE=ins,del,del,del;technology.illumina=1,1,1,1  GT:GQ:DP:RO:QR:AO:QA:GL 0|1:29:63685:3:103:1,1,0,2,2:14,37,0,55,78:-5.71672,-7.72676,-10,-3.44929,-10,-10,-7.35789,-10,-10,-10,-2.90839,-10,-8.35979,-10,-10,0,-9.72887,-5.45141,-9.36,-4.9105,-7.99216 2|3:2:33:2:69:2,7,7,0,5:28,196,181,0,144:-10,-10,-10,-8.75853,-10,-10,-10,-10,0,-10,-10,-10,-10,-10,-10,-10,-10,-2.80037,-4.15467,-10,-10

So it's correct in terms of splitting compared to bcbio.variation but does still retain the extra AO parts. The tradeoff in using vcfallellicprimitives --keep-geno --keep-info is that annotations like this can get out of sorts. The alternative is dropping them, which I think is less useful since you lose the information entirely, or keeping things as MNPs, which makes validation, comparison and downstream processing difficult. So there is no great answer or tool for these especially difficult cases and we've tried to make the best choice out of imperfect options.

If you know any better normalization tools that resolve these type of issue we'd be happy to plug them into bcbio. Thanks again for the feedback and sorry to not have a better answer.