Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
437 stars 150 forks source link

VCF FORMAT field output in --tab #1688

Open NotAPoetButACriminal opened 4 weeks ago

NotAPoetButACriminal commented 4 weeks ago

The --tab output is a great way to visualize annotated variants, especially due to multiple lines used to represent overlapping transcripts, genes, features, etc, and the many options to control this with --pick. This is one of VEP's strongest features, and the --vcf option is lacking in comparison, shoehorning annotations in a format that's hard to parse.

The issue I have is retaining info from the input VCF when using --tab output.

The current way to achieve this is with the --custom flag and specifying the input VCF as a custom annotation file as well, allowing one to keep the info from the INFO and FILTER fields. While this works well, it seems a bit arbitrary to allow only these 2 fields. VCFs made using GATK will have site-level info (AC,AF,DP..) in the INFO column, but the sample-level info such as the AD field goes in the FILTER column.

One way to retain all the info would be to use the --vcf flag of course, but the --tab output is preferable for the previously stated reasons.

Is there a way to achieve inclusion of other fields from the vcf in the --tab output, FORMAT for example but maybe also QUAL and potential others. Perhaps it would be nice to have a flag like the --show_ref_allele that simply retains all or selected fields from the input vcf?

likhitha-surapaneni commented 3 weeks ago

Hi @NotAPoetButACriminal, is there an input file that you can provide us with?

NotAPoetButACriminal commented 3 weeks ago

I could give you an example, sure, I'm just using a whole exome vcf produced by Illumina's DRAGEN pipeline. Annotating the vcf as usual works fine, the issue here though is that the option I'm looking for does not really exist, which is to maintain the AD from the vcf FORMAT field in the --tab output.

Here's what the vcf looks like (without the header):

chr1 69270 . A G 131.79 PASS AC=2;AF=1;AN=2;DP=49;FS=0;MQ=12.86;MQRankSum=1.662;QD=2.69;ReadPosRankSum=0.955;SOR=6.399;FractionInformativeReads=1 GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB 1/1:1,48:0.9796:49:1,28:0,20:106:170,109,0:131.79,105.79,1.1481e-10:0,34.77,37.77:1,0,48,0:1,0,28,20 chr1 69511 . A G 519.64 PASS AC=2;AF=1;AN=2;DP=161;FS=0;MQ=30.47;QD=3.23;SOR=1.017;FractionInformativeReads=1 GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB 1/1:0,161:1:161:0,92:0,69:481:557,484,0:519.64,480.64,0:0,34.77,37.77:0,0,92,69:0,0,81,80 chr1 69761 . A T 16.87 PASS AC=1;AF=0.5;AN=2;DP=31;FS=0;MQ=17.25;MQRankSum=1.066;QD=0.54;ReadPosRankSum=2.024;SOR=0.791;FractionInformativeReads=1 GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB 0/1:22,9:0.2903:31:11,6:11,3:17:52,0,29:16.87,0.092953,32.093:0,34.77,37.77:6,16,2,7:7,15,3,6 chr1 69897 . T C 34.05 PASS AC=2;AF=1;AN=2;DP=23;FS=0;MQ=15.38;MQRankSum=1.176;QD=1.48;ReadPosRankSum=1.176;SOR=1.022;FractionInformativeReads=1 GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB 1/1:4,19:0.8261:23:2,10:2,9:7:71,9,0:34.052,7.0523,0.95572:0,34.77,37.77:2,2,11,8:2,2,10,9 chr1 924024 . C G 59.55 PASS AC=2;AF=1;AN=2;DP=4;FS=0;MQ=240.58;QD=14.89;SOR=0.693;FractionInformativeReads=1 GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB 1/1:0,4:1:4:0,3:0,1:10:97,12,0:59.55,9.5495,0.51064:0,34.77,37.77:0,0,2,2:0,0,1,3 chr1 924321 . C G 62.31 PASS AC=2;AF=1;AN=2;DP=5;FS=0;MQ=250;QD=12.46;SOR=1.981;FractionInformativeReads=1 GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB 1/1:0,5:1:5:0,0:0,5:12:100,15,0:62.31,12.31,0.26296:0,34.77,37.77:0,0,4,1:0,0,1,4 chr1 924533 . A G 12.68 PASS AC=2;AF=1;AN=2;DP=2;FS=0;MQ=250;QD=6.34;SOR=2.303;FractionInformativeReads=1 GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB 1/1:0,2:1:2:0,2:0,0:4:48,6,0:12.679,4.7174,2.157:0,34.77,37.77:0,0,0,2:0,0,0,2 chr1 942451 . T C 92.13 PASS AC=2;AF=1;AN=2;DP=15;FS=0;MQ=250;QD=6.14;SOR=1.911;FractionInformativeReads=1 GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB 1/1:0,15:1:15:0,6:0,9:42:130,45,0:92.132,42.132,0.0002658:0,34.77,37.78:0,0,4,11:0,0,9,6 chr1 946247 . G A 50 PASS AC=1;AF=0.5;AN=2;DP=42;FS=19.958;MQ=240.49;MQRankSum=2.581;QD=1.19;ReadPosRankSum=3.161;SOR=2.703;FractionInformativeReads=1 GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB 0/1:20,22:0.5238:42:7,13:13,9:48:85,0,50:50,6.6419e-05,52.762:0,34.77,37.77:12,8,4,18:11,9,9,13 chr1 946538 . G A 48.25 PASS AC=1;AF=0.5;AN=2;DP=20;FS=0;MQ=210.68;MQRankSum=1.119;QD=2.41;ReadPosRankSum=1.813;SOR=0.986;FractionInformativeReads=1 GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB 0/1:12,8:0.4:20:5,6:7,2:47:83,0,50:48.252,8.671e-05,53:0,34.77,37.77:4,8,2,6:7,5,4,4

As you can see the sample level info is in the FORMAT (GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB) and SAMPLE (0/1:12,8:0.4:20:5,6:7,2:47:83,0,50:48.252,8.671e-05,53:0,34.77,37.77:4,8,2,6:7,5,4,4) columns, which is common for all GATK produced VCFs. So if I would like to retain the AD field info in the --tab output, I can't, cause the --custom flag only allows the extraction of the INFO and FILTER fields, and not FORMAT info or for example the ID field.

likhitha-surapaneni commented 3 weeks ago

Hi @NotAPoetButACriminal ,

Thank you for providing us the details. I am able to replicate this. We will get back to you once we have an update.

dvg-p4 commented 3 weeks ago

Looks like this request is somewhat related to mine here: https://github.com/Ensembl/ensembl-vep/issues/1565#issuecomment-2153112084

schreyers commented 2 weeks ago

I came up with a work around for this if it helps. I basically combine the FORMAT column into the INFO column using gawk.

The script is:

'gawk \'BEGIN { OFS="\t" } { if ($0 ~ /^#/) { print; } else { info = $8; gsub(",", "#", info); format_field = $9; format_data = $10; gsub(",", "#", format_data); $8 = info ";FORMAT=" format_field "|" format_data; NF = 8; print; } }\'

The gsub() functions are just to fix commas, so you might be able to remove those lines for you.

NotAPoetButACriminal commented 1 week ago

thanks @schreyers, that's probably better than my oneliner. I had basically the same idea, transpose the AD field to the INFO column: cat <(bcftools view -h sample.vcf.gz | sed '20i##INFO=<ID=AD,Number=.,Type=String,Description="Allelic depths (counting only informative reads out of the total reads) for the ref and alt alleles in the order listed">') <(paste <(paste -d ";" <(bcftools view -H sample.vcf.gz | cut -f 1-8) <(bcftools view -H sample.vcf.gz | cut -f 10 | cut -d : -f 2 | sed 's/^/AD=/g' | sed 's#,#\/#g')) <(bcftools view -H sample.vcf.gz | cut -f 9,10)) | bgzip > sample.ADinINFO.vcf.gz