WGLab / doc-ANNOVAR

Documentation for the ANNOVAR software
http://annovar.openbioinformatics.org
224 stars 342 forks source link

table_annovar.pl : Output file inconsistencies if --onetranscript is not used #3

Closed viv-1 closed 9 years ago

viv-1 commented 9 years ago

Hello,

I found that if I use default parameter for hgvs annotation some lines have more fields than others. This is because when the variant is on multiple NM, multiple hgvs are separated by a tabulation. This is problematic for scripts that parse annovar output.

Bests regards

kaichop commented 9 years ago

please can you provide details, including examples?

viv-1 commented 9 years ago

If I launch the command : table_annovar.pl myfile.var /data/annotations/Galaxy/Human/hg19/annovar/ -buildver hg19 -protocol refGene -operation g -nastring NA -outfile ./test -otherinfo

I can have for example a line like this : Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange.refGene Otherinfo chr2 215593261 215593261 -0 T UTR3 BARD1 NM_000465:c._138139insA NM_001282549:c._138139insA NM_001282548:c._138139insA NM_001282543:c._138139insA NM_001282545:c._138__139insA NA NA chr2 215593261 . C CT . PASS ADP=497;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:0:847:497:130:247:33.56%:9.8E-1:36:23:108:22:136:111

You can see that hgvs annotaion for multiple nm accession are separated by a tabulation. This cause a shift. In this line entries does not correspond to want is described in header.

I also found today the same thing if multiple gene name / accession are reported for variants tagged as upstream. Example : chr2 215674436 215674436 C T upstream AC072062.1 BARD1 NA NA NA chr2 215674436 . C T . PASS ADP=171;WT=0;HET=0;HOM=1;NC=0 GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 1/1:0:171:171:0:171:100%:9.8E-1:0:38:0:0:0:171

I used the last version of annovar you published 2 days ago.

kaichop commented 9 years ago

I am unable to reproduce this.

The first line is invalid input (chr2 215593261 215593261 -0 T), and it should not even generate a result.

The second input (chr2 215674436 215674436 C T) is valid, but my result looks correct. chr2 215674436 215674436 C T upstream BARD1,LOC101928103 NA NA NA

For the first input, if I change to a valid input (chr2 215593261 215593261 0 T)), I get correct result chr2 215593261 215593261 0 T UTR3 BARD1 NM_000465:c._1390>A,NM_001282543:c._1390>A,NM_001282549:c._1390>A,NM_001282548:c._1390>A,NM_001282545:c.*1390>A NA NA

My guess is that you have used Microsoft Excel to open the TXT files, and you have selected "separate by tab" and "separate by comma", so that your column is shifted. Please examine the result file in a text editor to confirm.