WGLab / doc-ANNOVAR

Documentation for the ANNOVAR software
http://annovar.openbioinformatics.org
225 stars 344 forks source link

wrong column number of abraom annotation #121

Open hurleyLi opened 3 years ago

hurleyLi commented 3 years ago

Hi, When I use the abraom annotation, for most variants it gives me 3 columns, but for some variants the abraom annotation returns two columns in the .txt file, and a truncated .vcf result (stopped at the error variant), with the following error message for example:

prefield not defined (X 69504078 69504078 T C 0.000000 VQSRTrancheSNP99.00to99.90 . 60 . X 69504078 . T C 60 . .) with field=19 and prefield=18 at /users/hl7/analysis/annotation/annovar/table_annovar.pl line 186, <MANNO> line 4.

Here's a file you could replicate the error

##fileformat=VCFv4.2
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
X   69504078    .   T   G   60  .   .
X   69504078    .   T   C   60  .   .
X   69504079    .   T   G   60  .   .

Here's my command

/users/hl7/analysis/annotation/annovar/table_annovar.pl test.vcf /users/hl7/analysis/annotation/annovar/humandb/ -buildver hg19 -out test -remove -nastring . -polish -protocol abraom -operation f -vcfinput

Thanks!

mizzlefeng commented 4 months ago

I have same error. the following error message for example: NOTICE: VCF output is written to /home/mizzle/wes1106/align/6.VCF_File/annovar/wes.hg38_multianno.vcf prefield not defined (chr1 930284 930284 C T exonic SAMD11 . nonsynonymous_SNV SAMD11:NM_001385640:exon3:c.C739T:p.R247W,SAMD11:NM_001385641:exon3:c.C739T:p.R247W,SAMD11:NM_152486:exon3:c.C202T:p.R68W exonic ENST00000616016.5\x3bENST00000616125.5\x3bENST00000617307.5\x3bENST00000618181.5\x3bENST00000618323.5\x3bENST00000618779.5\x3bENST00000622503.5\x3bSAMD11 . nonsynonymous_SNV ENST00000616125.5:ENST00000616125.5:exon2:c.C202T:p.R68W,ENST00000617307.5:ENST00000617307.5:exon2:c.C202T:p.R68W,ENST00000618181.5:ENST00000618181.5:exon2:c.C202T:p.R68W,ENST00000618779.5:ENST00000618779.5:exon2:c.C202T:p.R68W,ENST00000622503.5:ENST00000622503.5:exon2:c.C202T:p.R68W,SAMD11:ENST00000342066.8:exon3:c.C202T:p.R68W,SAMD11:ENST00000437963.5:exon3:c.C202T:p.R68W,ENST00000616016.5:ENST00000616016.5:exon3:c.C739T:p.R247W,ENST00000618323.5:ENST00000618323.5:exon3:c.C739T:p.R247W 1500694 not_provided MedGen:C3661900 criteria_provided,_single_submitter Uncertain_significance 0.0001 0 0 0.0013 0 5.191e-05 0 0 5.15e-05 5.267e-05 5.871e-05 4.421e-05 0.0002 4.208e-05 3.838e-05 8.235e-05 5.808e-05 0 0 3.844e-05 0.0002 0 0 5.496e-05 4.989e-05 3.525e-05 7.7e-05 rs199655347 0.091 0.396 T 0.102 0.486 T 0.01 0.155 B 0.003 0.087 B 0.107 0.196 N 0.989 0.243 N 2.075 0.570 M . . . -4.63 0.792 D 0.25 0.424 -1.040 0.172 T 0.062 0.258 T 0.047 0.039 T 0.018 0.393 T 0.052 0.150 . . 0.312 0.308 0.008 0.007 0.387 0.233 T 0.028 0.225 T -0.350 0.047 T -0.406 0.327 T 0.170 0.186 T 0.798 0.481 T .\x3b.\x3b.\x3b.\x3b.\x3b.\x3b.\x3b.\x3b.\x3b.\x3b.\x3b .\x3b.\x3b.\x3b.\x3b.\x3b.\x3b.\x3b.\x3b.\x3b.\x3b.\x3b 1.385 0.202 14.98 0.991 0.537 0.606 0.312 D . . . -0.844 0.122 -0.769 0.153 0.337 0.195 0.598 0.345 0 . . 4.39 2.45 0.290 2.979 0.490 -0.364 0.067 0.141 0.235 0.001 0.051 7.752 0.281 .\x3b.\x3b.\x3b.\x3b.\x3b.\x3b.\x3b.\x3b.\x3b.\x3b. . . . . . 0.01818 9394.65 106 chr1 930284 rs199655347 C T 9394.65 PASS AC=2;AF=0.018;AN=110;BaseQRankSum=2.35;DB;DP=19375;ExcessHet=0.0400;FS=1.758;InbreedingCoeff=-0.0185;MLEAC=2;MLEAF=0.018;MQ=60.00;MQRankSum=0.00;QD=11.61;ReadPosRankSum=-4.720e-01;SOR=0.576;VQSLOD=5.89;culprit=FS GT:AD:DP:GQ:PL 0/0:151,0:151:99:0,120,1800 0/0:519,0:519:99:0,120,1800 0/0:576,0:576:99:0,120,1800 0/0:612,0:612:99:0,120,1800 0/0:464,0:464:99:0,120,1800 0/0:495,0:495:99:0,120,1800 0/0:476,0:476:99:0,120,1800 0/0:281,0:281:99:0,120,1800 0/0:212,0:212:99:0,120,1800 0/0:374,0:374:99:0,120,1800 0/0:523,0:523:99:0,120,1800 0/0:514,0:514:99:0,120,1800 0/0:530,0:530:99:0,120,1800 0/0:751,0:751:99:0,120,1800 0/0:633,0:633:99:0,120,1800 0/0:919,0:919:99:0,120,1800 0/0:74,0:74:99:0,120,1800 0/0:584,0:584:99:0,120,1800 0/0:376,0:376:99:0,120,1800 0/0:99,0:99:99:0,120,1800 0/0:639,0:639:99:0,120,1800 0/0:663,0:663:99:0,120,1800 0/0:650,0:650:99:0,120,1800 0/0:555,0:555:99:0,120,1800 0/0:563,0:563:99:0,120,1800 0/1:305,264:569:99:6828,0,7820 0/0:708,0:708:99:0,120,1800 0/0:511,0:511:99:0,120,1800 0/0:644,0:644:99:0,120,1800 0/0:503,0:503:99:0,120,1800 0/0:572,0:572:99:0,120,1800 0/0:478,0:478:99:0,120,1800 0/0:93,0:93:99:0,120,1800 0/0:125,0:125:99:0,120,1800 0/0:125,0:125:99:0,120,1800 0/0:124,0:124:99:0,120,1800 0/0:150,0:150:99:0,120,1800 0/0:150,0:150:99:0,120,1800 0/0:142,0:142:99:0,120,1800 0/0:166,0:166:99:0,120,1800 0/0:60,0:60:99:0,120,1800 0/0:182,0:182:99:0,120,1800 0/0:117,0:117:99:0,120,1800 0/1:127,113:240:99:2596,0,2826 0/0:119,0:119:99:0,120,1800 0/0:132,0:132:99:0,120,1800 0/0:136,0:136:99:0,120,1800 0/0:169,0:169:99:0,120,1800 0/0:142,0:142:99:0,120,1800 0/0:137,0:137:99:0,120,1800 0/0:142,0:142:99:0,120,1800 0/0:163,0:163:99:0,120,1800 0/0:125,0:125:99:0,120,1800 0/0:86,0:86:99:0,120,1800 0/0:106,0:106:99:0,120,1800) with field=227 and prefield=225 at /home/mizzle/software/annovar/table_annovar.pl line 186, <MANNO> line 7.

Here is my code: table_annovar.pl $vqsr_file $humandb \ -buildver hg38 --thread 12 \ -out ${outfile_dir}/wes \ -remove \ -protocol refGene,knownGene,clinvar_20240502,exac03,gnomad40_exome,esp6500siv2_all,avsnp150,dbnsfp42a,dbscsnv11,cosmic99 \ -operation g,g,f,f,f,f,f,f,f,f \ -nastring . \ -vcfinput

fengwei-li commented 3 months ago

This is due to the presence of duplicate variation annotations in your annotation library file. For example, in my cosmic database, there are: 1 65797 65797 T C ID=COSV58737189;OCCURENCE=1(COSO27984905) 1 65797 65797 T C ID=COSV58737189;OCCURENCE=1(COSO27984905) 1 66041 66041 A G ID=COSV58737025;OCCURENCE=1(COSO28864826) 1 66041 66041 A G ID=COSV58737025;OCCURENCE=1(COSO28864826) 1 66131 66131 C G ID=COSV58737120;OCCURENCE=1(COSO32664862) 1 66131 66131 C G ID=COSV58737120;OCCURENCE=1(COSO32664862) 1 66161 66162 TA - ID=COSV58736766;OCCURENCE=1(COSO32054826) 1 66161 66162 TA - ID=COSV58736766;OCCURENCE=1(COSO32054826)

You need to process the duplicate annotations in your annotation library.

!/bin/bash

Define the input and output files

input_file="hg38_cosmic100.txt" output_file="hg38_cosmic100_unique.txt"

Create a temporary file to store the de-duplication results

temp_file=$(mktemp)

Use awk to process the input file and de-duplicate

awk ' BEGIN { FS = OFS = "\t"; }

{ key = $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5; sub(/OCCURENCE=([0-9]+)/, "\1", occ); if (seen[key] == "") { seen[key] = occ; max_line[key] = $0; } else if (occ > seen[key]) { seen[key] = occ; max_line[key] = $0; } }

END { for (k in max_line) { print max_line[k]; } } ' "$input_file" > "$temp_file"

Move the temporary file to the output file

mv "$temp_file" "$output_file"

echo "De-duplication completed, results are saved in $output_file"

kaichop commented 3 months ago

Thank you Fengwei. Do you see this only in custom-made cosmic, or do you see it in other databases? I wonder if making a new prepare_annovar_user.pl can address the issue of duplicated record in annotation library if only cosmic has this issue.

mizzlefeng commented 3 months ago

I have only seen this problem in the cosmic database. Thanks to fengwei-li, I also found this problem later, but I still appreciate your solution.@kaichop @fengwei-li

kaichop commented 3 months ago

May I ask which version of cosmic? I suppose it is a custom-made cosmic annotation database? I can try find a solution for this.

On Sun, Jun 2, 2024 at 10:47 PM mizzlefeng @.***> wrote:

I have only seen this problem in the cosmic database. Thanks to fengwei-li, I also found this problem later, but I still appreciate your @.*** https://github.com/kaichop @fengwei-li https://github.com/fengwei-li

— Reply to this email directly, view it on GitHub https://github.com/WGLab/doc-ANNOVAR/issues/121#issuecomment-2144180898, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OHHRWEVAU5SX4LRUULZFPKNXAVCNFSM4W62W6B2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMJUGQYTQMBYHE4A . You are receiving this because you were mentioned.Message ID: @.***>

mizzlefeng commented 3 months ago

cosmic99 Below is the source data I downloaded image

fengwei-li commented 3 months ago

cosmic100_hg38 and cosmic99_hg38

At 2024-06-03 10:52:31, "Kai Wang" @.***> wrote:

May I ask which version of cosmic? I suppose it is a custom-made cosmic annotation database? I can try find a solution for this.

On Sun, Jun 2, 2024 at 10:47 PM mizzlefeng @.***> wrote:

I have only seen this problem in the cosmic database. Thanks to fengwei-li, I also found this problem later, but I still appreciate your @.*** https://github.com/kaichop @fengwei-li https://github.com/fengwei-li

— Reply to this email directly, view it on GitHub https://github.com/WGLab/doc-ANNOVAR/issues/121#issuecomment-2144180898, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OHHRWEVAU5SX4LRUULZFPKNXAVCNFSM4W62W6B2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMJUGQYTQMBYHE4A . You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

kaichop commented 1 month ago

I want to provide an update on this issue. I was not able to reproduce the problem in either cosmic99 or cosmic100. The command for hg38_cosmic100_coding is shown below. As you can see, COSV58737189 occurs only once. I suspect that when you run the prepare_annovar_user, you probably processed the same set of files twice (as opposed to process coding and noncoding file separately) which results in duplicated lines in the annotation database.

[wangk@biocluster cosmic100]$ tar xvf Cosmic_GenomeScreensMutant_Vcf_v100_GRCh38.tar [wangk@biocluster cosmic100]$ tar xvf Cosmic_GenomeScreensMutant_Tsv_v100_GRCh38.tar [wangk@biocluster cosmic100]$ gunzip Cosmic_GenomeScreensMutant_v100_GRCh38.vcf.gz [wangk@biocluster cosmic100]$ gunzip Cosmic_GenomeScreensMutant_v100_GRCh38.tsv.gz [wangk@biocluster cosmic100]$ echo -e '#Chr\tStart\tEnd\tRef\tAlt\tCOSMIC100' > hg38_cosmic100_coding.txt [wangk@biocluster cosmic100]$ prepare_annovar_user.pl -dbtype cosmic Cosmic_GenomeScreensMutant_v100_GRCh38.tsv -vcf Cosmic_GenomeScreensMutant_v100_GRCh38.vcf >> hg38_cosmic100_coding.txt [wangk@dragon cosmic100]$ fgrep COSV58737189 hg38_cosmic100_coding.txt 1     65797 65797 T     C     ID=COSV58737189;OCCURENCE=1(COSO27984905)

[wangk@dragon cosmic100]$ sort hg38_cosmic100_coding.txt | uniq > hg38_cosmic100_coding_unique.txt [wangk@dragon cosmic100]$ wc -l hg38_cosmic100_coding.txt hg38_cosmic100_coding_unique.txt 13402472 hg38_cosmic100_coding.txt 13402472 hg38_cosmic100_coding_unique.txt 26804944 total