brentp / vcfanno

annotate a VCF with other VCFs/BEDs/tabixed files
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0973-5
MIT License
356 stars 55 forks source link

Collecting GRCh38 annotations #26

Closed komalsrathi closed 8 years ago

komalsrathi commented 8 years ago

Hi,

As far as I understand, if I were to use GRCh38 reference for my VCF, the corresponding hg38 annotations would have to be downloaded and probably saved in the /path/to/gemini/data/gemini_data/ which is path to all default hg19 annotations. Then I would use it with vcfanno, like this:

# download hg38 clinvar annotation
export GEMINI_ANNO=/path/to/gemini_data/hg38_annotations
wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20160531.vcf.gz -O $GEMINI_ANNO/clinvar_20160531.vcf.gz

# run vcfanno
export VCF=/path/to/snpeff-annotated-sample.vcf
export lua=/path/to/test.lua
export conf=/path/to/conf
vcfanno -p 4 -base-path $GEMINI_ANNO -lua $lua $conf $VCF | bgzip -c > anno.vcf.gz

Is this correct or is there any other (faster & better) way the annotations can be collected and used in vcfanno? When I google, I keep getting references to Annovar but I don't know how to go about it. To get the hg38 clinvar annotation, I tried the command below but I am kind of stuck beyond this:

gemini_conda install -c ggd-alpha hg38-clinvar=20150330.2

This installed a package in /path/to/gemini/data/anaconda. This location is different than the location of default hg19 annotations i.e. /path/to/gemini/data/gemini_data.

# this is the directory structure of the package:
hg38-clinvar-20150330.2-2
    ├── bin
    └── info
        ├── files
        ├── index.json
        ├── recipe
        └── recipe.json

Somewhat similar to this discussion, I noticed that the default hg19 annotations which are obtained with gemini installation have names such as clinvar_20160203.tidy.vcf.gz, cosmic-v68-GRCh37.tidy.vcf.gz and ExAC.r0.3.sites.vep.tidy.vcf.gz. Does tidy mean they have been processed in some way and that the user has to do the same when they download annotations other than hg19?

Thank you for your patience in responding to my questions!

brentp commented 8 years ago

So, answers in arbitrary order:

For now, I'd recommend that you just wget the few hg38 files that you are interested in, then vt decompose and vt normalize those and use them as your annotations.

Does that make sense.

komalsrathi commented 8 years ago

Yes. That was going to be my default approach so sounds good.

But I am little confused now. This is my entire workflow:

# input - sample.vcf
# Decompose and normalize sample.vcf using vt
vt decompose -s -o sample.de.vcf sample.vcf
vt normalize -r GRCh38.fa -o sample.de.nm.vcf sample.de.vcf

# Annotate using snpEff
snpEff -Xmx32g -Xms16g -Djava.io.tmpdir=/tmp -c snpEff.GRCh38.config -ud 10 -classic GRCh38.82 sample.de.nm.vcf > sample.de.nm.snpeff.vcf

# Run vcfanno.
export VCF=sample.de.nm.snpeff.vcf
vcfanno -p 4 -base-path $GEMINI_ANNO -lua test.lua test.conf $VCF | bgzip -c > anno.vcf.gz

# Run gemini 
gemini_python vcf2db.py --legacy-compression anno.vcf.gz $PED my.db

As you can see, I decompose and normalize my sample VCF. Are you saying we need to do it for both the annotation vcfs as well as the sample vcfs?

Thanks!

brentp commented 8 years ago

Yes, I would do:

vt decompose -s $VCF \
    | vt normalize -r $REF - \
    | bgzip -c > $NEW_VCF
tabix $NEW_VCF

for the query and all annotation VCFs. Note that for your query VCF from GATK, you'll likely want to do:

zcat $VCF \
    | sed 's/ID=AD,Number=./ID=AD,Number=R/' \
    | vt ...

so that the alternate allele counts are handled correctly in vt.

From there, what you have looks right to me: snpeff on just the query, then vcfanno, then vcf2db.

komalsrathi commented 8 years ago

Oh yes, I do that too for the sample VCF - just wanted to be brief. Thanks a lot!!