WGLab / doc-ANNOVAR

Documentation for the ANNOVAR software
http://annovar.openbioinformatics.org
218 stars 332 forks source link

convert2annovar.pl create empty file #183

Open CLAIRE-cuhk opened 2 years ago

CLAIRE-cuhk commented 2 years ago

Hi Developers,

I am trying to convert my vcf file (created by cuteSV) into avinput using convert2annovar.pl. Here is my script: perl /home/ucnvlw0/Scratch/bioinfo_tools/Annovar/annovar/convert2annovar.pl --format vcf4 /home/ucnvlw0/Scratch/TALL_Project/results_wly/ont_annotation/D0257_cuteSV_SV_modified.vcf --outfile D0257_cuteSV.avinput

The program finished immediatedly and created an empty result file. Warning message is as follows: NOTICE: Finished reading 40182 lines from VCF file NOTICE: A total of 39959 locus in VCF file passed QC threshold, representing 0 SNPs (0 transitions and 0 transversions) and 39126 indels/substitutions NOTICE: Finished writing 0 SNP genotypes (0 transitions and 0 transversions) and 0 indels/substitutions for 1 sample WARNING: 750 invalid alternative alleles found in input file WARNING: 833 invalid reference alleles found in input file

I found issue #76 posted on 22 Oct 2019 stated a similar problem and followed the suggestion to skip conversion step and annotate the vcf file directly. So I tried the following script and ended up with everything in .invalid_input. script: perl /home/ucnvlw0/Scratch/bioinfo_tools/Annovar/annovar/annotate_variation.pl -geneanno -out D0257_annovar -buildver hg38 ./D0257.cuteSV.SV.vcf /home/ucnvlw0/Scratch/bioinfo_tools/Annovar/annovar/humandb/hg38 Results: 0 Apr 5 23:56 D0257_annovar.exonic_variant_function 18868595 Apr 5 23:56 D0257_annovar.invalid_input 907 Apr 5 23:56 D0257_annovar.log 0 Apr 5 23:56 D0257_annovar.variant_function log file: NOTICE: Output files are written to D0257_annovar.variant_function, D0257_annovar.exonic_variant_function NOTICE: Reading gene annotation from /home/ucnvlw0/Scratch/bioinfo_tools/Annovar/annovar/humandb/hg38/hg38_refGene.txt ... Done with 88819 transcripts (including 21511 without coding sequence annotation) for 28307 unique genes NOTICE: Variants with invalid input format are written to D0257_annovar.invalid_input

Here is my vcf file looks like:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NULL

chr1 35143 cuteSV.BND.0 N N[chr20:60003[ . PASS PRECISE;SVTYPE=BND;RE=11;RNAMES=NULL GT:DR:DV:PL:GQ ./.:.:11:.,.,.:. chr1 66241 cuteSV.DEL.0 TATATTATATAATATATAATATAAATATAATATAAATTAT T . PASS PRECISE;SVTYPE=DEL;SVLEN=-39;END=66280;CIPOS=-5,5;CILEN=-1,1;RE=22;RNAMES=NULL;STRAND=+- GT:DR:DV:PL:GQ ./.:.:22:.,.,.:. chr1 90275 cuteSV.INS.0 C CTGGAGGAAGACAGTCCTCAGTCCCTCTTGCTTGCCAACCAGTTAACCTGCTGCTTCC . PASS PRECISE;SVTYPE=INS;SVLEN=57;END=90275;CIPOS=-33,33;CILEN=-1,1;RE=19;RNAMES=NULL GT:DR:DV:PL:GQ ./.:.:19:.,.,.:. chr1 136924 cuteSV.INS.1 C CGGCTGACCCTCAGTGTGGGAGGGGCCGGTGTGAGGCAAGGGGCTCACGCTGGACCTCTGTCCGCGTGGGAGGGGCCGGTGTGAGACAGTACCGGGCTGACCTCTCTCAGCGTGGGAGGGGCCGGTGTGAGGCAAGGGGCCCGGGCTGACCTCTCAGCGTGGGAGGGGGCCAGTGTGAGGGCAAGGGCTCACACTGACCCTCTCAGCATGGGAGGGGCCGGCAGAGACAAGGGGCC . PASS PRECISE;SVTYPE=INS;SVLEN=235;END=136924;CIPOS=-77,77;CILEN=-3,3;RE=5;RNAMES=NULL GT:DR:DV:PL:GQ ./.:.:5:.,.,.:.

Please could you help out?

Many thanks.

kaichop commented 2 years ago

If you have a set of SV calls, it is best that you write a simple script to create "chr start end 0 0 info" lines, one for each SV call, then annotate it. There is no standard VCF format for SV calls so different software handles this differently; sometimes a software may actually spell out the reference allele in which case you can calculate the length of the SV call, but in many other cases, it is not the case. Your error log below shows "invalid alternative allele" and "invalid reference allele" which is why no output is written to the avinput file. Additionally, the cuteSV call has no GT information (it is "./." meaning there is no SV at this location).

On Tue, Apr 5, 2022 at 7:10 PM CLAIRE-cuhk @.***> wrote:

Hi Developers,

I am trying to convert my vcf file (created by cuteSV) into avinput using convert2annovar.pl. Here is my script: perl /home/ucnvlw0/Scratch/bioinfo_tools/Annovar/annovar/ convert2annovar.pl --format vcf4 /home/ucnvlw0/Scratch/TALL_Project/results_wly/ont_annotation/D0257_cuteSV_SV_modified.vcf --outfile D0257_cuteSV.avinput

The program finished immediatedly and created an empty result file. Warning message is as follows: NOTICE: Finished reading 40182 lines from VCF file NOTICE: A total of 39959 locus in VCF file passed QC threshold, representing 0 SNPs (0 transitions and 0 transversions) and 39126 indels/substitutions NOTICE: Finished writing 0 SNP genotypes (0 transitions and 0 transversions) and 0 indels/substitutions for 1 sample WARNING: 750 invalid alternative alleles found in input file WARNING: 833 invalid reference alleles found in input file

I found issue #76 https://github.com/WGLab/doc-ANNOVAR/issues/76 posted on 22 Oct 2019 stated a similar problem and followed the suggestion to skip conversion step and annotate the vcf file directly. So I tried the following script and ended up with everything in .invalid_input. script: perl /home/ucnvlw0/Scratch/bioinfo_tools/Annovar/annovar/ annotate_variation.pl -geneanno -out D0257_annovar -buildver hg38 ./D0257.cuteSV.SV.vcf /home/ucnvlw0/Scratch/bioinfo_tools/Annovar/annovar/humandb/hg38 Results: 0 Apr 5 23:56 D0257_annovar.exonic_variant_function 18868595 Apr 5 23:56 D0257_annovar.invalid_input 907 Apr 5 23:56 D0257_annovar.log 0 Apr 5 23:56 D0257_annovar.variant_function log file: NOTICE: Output files are written to D0257_annovar.variant_function, D0257_annovar.exonic_variant_function NOTICE: Reading gene annotation from /home/ucnvlw0/Scratch/bioinfo_tools/Annovar/annovar/humandb/hg38/hg38_refGene.txt ... Done with 88819 transcripts (including 21511 without coding sequence annotation) for 28307 unique genes NOTICE: Variants with invalid input format are written to D0257_annovar.invalid_input

Here is my vcf file looks like:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NULL

chr1 35143 cuteSV.BND.0 N N[chr20:60003[ . PASS PRECISE;SVTYPE=BND;RE=11;RNAMES=NULL GT:DR:DV:PL:GQ ./.:.:11:.,.,.:. chr1 66241 cuteSV.DEL.0 TATATTATATAATATATAATATAAATATAATATAAATTAT T . PASS PRECISE;SVTYPE=DEL;SVLEN=-39;END=66280;CIPOS=-5,5;CILEN=-1,1;RE=22;RNAMES=NULL;STRAND=+- GT:DR:DV:PL:GQ ./.:.:22:.,.,.:. chr1 90275 cuteSV.INS.0 C CTGGAGGAAGACAGTCCTCAGTCCCTCTTGCTTGCCAACCAGTTAACCTGCTGCTTCC . PASS PRECISE;SVTYPE=INS;SVLEN=57;END=90275;CIPOS=-33,33;CILEN=-1,1;RE=19;RNAMES=NULL GT:DR:DV:PL:GQ ./.:.:19:.,.,.:. chr1 136924 cuteSV.INS.1 C CGGCTGACCCTCAGTGTGGGAGGGGCCGGTGTGAGGCAAGGGGCTCACGCTGGACCTCTGTCCGCGTGGGAGGGGCCGGTGTGAGACAGTACCGGGCTGACCTCTCTCAGCGTGGGAGGGGCCGGTGTGAGGCAAGGGGCCCGGGCTGACCTCTCAGCGTGGGAGGGGGCCAGTGTGAGGGCAAGGGCTCACACTGACCCTCTCAGCATGGGAGGGGCCGGCAGAGACAAGGGGCC . PASS PRECISE;SVTYPE=INS;SVLEN=235;END=136924;CIPOS=-77,77;CILEN=-3,3;RE=5;RNAMES=NULL GT:DR:DV:PL:GQ ./.:.:5:.,.,.:.

Please could you help out?

Many thanks.

— Reply to this email directly, view it on GitHub https://github.com/WGLab/doc-ANNOVAR/issues/183, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OAO2RHAHYFAWENQPKLVDTB6LANCNFSM5SUH5HNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

CLAIRE-cuhk commented 2 years ago

Hi developer,

Thank you for your reply!

Additionally, the cuteSV call has no GT information (it is "./." meaning there is no SV at this location). So it is actually the missing of GT information (./.) leads to the invalid alternative and reference allele output?

If you have a set of SV calls, it is best that you write a simple script to create "chr start end 0 0 info" lines, one for each SV call, then annotate it.
Thank you for the suggestion. I am trying to prepare the input file using simple script. I know deletion, insertion and block substitution in your avinput example look like the following: 1 13211293 13211294 TC - comments: rs59770105, a 2-bp deletion 1 11403596 11403596 - AT comments: rs35561142, a 2-bp insertion 1 105492231 105492231 A ATAAA comments: rs10552169, a block substitution

I could extract the start and end points from VCF file but just wondering the proper way of representing Reference Allele and Alternative Allele columns for inversion, duplication and BND (plus some SVs are quite long). My cuteSV VCF file for INV, DUP, and BND look like this: chr1 43593622 cuteSV.INV.0 T INV . PASS PRECISE;SVTYPE=INV;SVLEN=594;END=43594216;RE=14;STRAND=++;RNAMES=NULL chr1 875845 cuteSV.DUP.0 T DUP . PASS PRECISE;SVTYPE=DUP;SVLEN=536;END=876381;RE=12;STRAND=-+;RNAMES=NULL chr1 883242 cuteSV.BND.0 N N]chr20:29789174] . PASS PRECISE;SVTYPE=BND;RE=12;RNAMES=NULL