brentp / vcfanno

annotate a VCF with other VCFs/BEDs/tabixed files
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0973-5
MIT License
357 stars 55 forks source link

CADD score not visible in the vcf annotated (similar to #137 ) #147

Closed MikeHala closed 2 years ago

MikeHala commented 2 years ago

Hi,

I am trying to annotate my VCF file (biallelic SNPs only, edited to remove the chr prefix of the chromosome name for each variant to match the CADD naming convention) with CADD v1.6 and running to problems as in issue #137 , i.e. CADD_phred is included in the header but no annotations were added to the variant INFO field.

I have re-indexed the CADD file with tabix (as suggested in #137), but still no luck.

My CADD.conf file:

[[annotation]]
file="/full/path/to/whole_genome_SNVs.tsv.gz"
names=["CADD_phred"]
ops=["mean"]
columns=[6]

I have also tried replacing columns=[6] with fields=["PHRED"].

Here the first lines of the CADD file : whole_genome_SNVs.tsv.gz

## CADD GRCh38-v1.6 (c) University of Washington, Hudson-Alpha Institute for Biotechnology and Berlin Institute of Health 2013-2020. All rights reserved.
#Chrom  Pos     Ref     Alt     RawScore        PHRED
1       10001   T       A       0.702541        8.478
1       10001   T       C       0.750954        8.921
1       10001   T       G       0.719549        8.634
1       10002   A       C       0.713993        8.583
1       10002   A       G       0.743661        8.854
1       10002   A       T       0.700507        8.460
1       10003   A       C       0.714485        8.588
1       10003   A       G       0.744152        8.859

Here is an example variants from the query file 1 34165788 chr1_34165788_C_G C G 46 PASS CSQ=G|missense_variant&NMD_transcript_variant|MODERATE|CSMD2|ENSG00000121904|Transcript|ENST00000241312|nonsense_mediated_decay|1/70||||55|26|9|G/A|gGc/gCc|||-1||HGNC|HGNC:19290|||||,G|upstream_gene_variant|MODIFIER|C1orf94|ENSG00000142698|Transcript|ENST00000373374|protein_coding|||||||||||1095|1||HGNC|HGNC:28250|||||,G|upstream_gene_variant|MODIFIER|CSMD2|ENSG00000121904|Transcript|ENST00000373381|protein_coding|||||||||||514|-1||HGNC|HGNC:19290|YES||||,G|missense_variant|MODERATE|CSMD2|ENSG00000121904|Transcript|ENST00000373388|protein_coding|1/70||||55|26|9|G/A|gGc/gCc|||-1||HGNC|HGNC:19290|||||,G|missense_variant|MODERATE|CSMD2|ENSG00000121904|Transcript|ENST00000619121|protein_coding|1/71||||55|26|9|G/A|gGc/gCc|||-1||HGNC|HGNC:19290|||||;AF=0.00119988;AN=20002;AC=24;<20x AN_subpop values>;<20x AF_subpop values>;<20x AC_subpop values> GT 0/0 .....

Command I used: time vcfanno_linux64 -p 2 CADD.conf query.vcf > query.CADD.vcf

Output of the tool:

=============================================
vcfanno version 0.3.3 [built with go1.16.5]
see: https://github.com/brentp/vcfanno
=============================================
vcfanno.go:116: found 1 sources from 1 files
vcfanno.go:157: falling back to non-bgzip
vcfanno.go:250: annotated 28 variants in 0.79 seconds (35.5 / second)

Output of tabix /full/path/to/whole_genome_SNVs.tsv.gz 1:34165788-34165789 is

1       34165788        C       A       0.675786        8.231
1       34165788        C       G       0.693038        8.391
1       34165788        C       T       0.735310        8.778
1       34165789        C       A       0.053260        1.705
1       34165789        C       G       0.070512        1.840
1       34165789        C       T       0.112784        2.208

Any suggestions what I am doing wrong would be greatly appreciated!

Best, Mike

brentp commented 2 years ago

I can't recreate this below is what I do. I suspect your file is index incorrectly. Note how I index below:

wget -qO - https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh38/whole_genome_SNVs.tsv.gz | zcat - | head -2000 | bgzip -c > whole_genome_SNVs.tsv.gz
tabix -f -b 2 -e 2 -s 1 whole_genome_SNVs.tsv.gz

echo -e "##fileformat=VCFv4.2
#CHROM  POS ID  REF ALT QUAL  FILTER  INFO
1 10110 id  A G 46  PASS  ." > x.vcf

echo '
[[annotation]]
file="whole_genome_SNVs.tsv.gz"
names=["CADD_phred"]
ops=["mean"]
columns=[6] ' > bug.conf

vcfanno bug.conf x.vcf | grep -v ^#

if you edit this in such a way that you can demonstrate the bug, then we can track it down.

BTW, there is no need to add/remove "chr" as vcfanno will match "chr1" with "1", for example.

brentp commented 2 years ago

here it is in a file so you don't need to worry about the tabs bug.sh.gz

MikeHala commented 2 years ago

Thanks for the quick reply, Brent!

It seems to work now (note for future readers - the toy VCF above should be tab, not space separated)

=============================================
vcfanno version 0.3.3 [built with go1.16.5]
see: https://github.com/brentp/vcfanno
=============================================
vcfanno.go:116: found 1 sources from 1 files
vcfanno.go:157: falling back to non-bgzip
vcfanno.go:250: annotated 1 variants in 0.00 seconds (366.7 / second)
1       10110   id      A       G       46.0    PASS    AC=1;AF=0.5;AN=2;CADD_phred=8.752

Will re-index the CADD file as you do and will let you know how it goes!

Thanks again, Mike

MikeHala commented 2 years ago

Happy to confirm that after re-compressing the CADD v1.6 scores file with bgzip and tabix indexing it following your recipe it works as expected!

Thanks again for your help! Mike

brentp commented 2 years ago

great! thanks for following up.