Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
456 stars 152 forks source link

VEP runs but resulting VCF contains no VEP annotated records when trying to annotate variants using T2T assembly #1757

Closed indapa closed 1 month ago

indapa commented 2 months ago

VEP runs successfully with no error, but output VCF only contains VCF header of input VCF

I was generally trying to follow the example here using VEP to annotate records from the T2T assembly.

System

Full VEP command line

vep --cache --cache_version 112 --compress_output bgzip --database 0 --dir_cache vep_cache --fasta Homo_sapiens-GCA_009914755.4-softmasked.fa --force_overwrite --format vcf --input_file clinvar_20240624_GCA_009914755.4:chr20.vcf.gz --offline --output_file clinvar_20240624_GCA_009914755_annotated.vcf.gz --species homo_sapiens_gca009914755v4 --symbol --vcf

Data files (if applicable)

They include:

I thought maybe vep couldn't find my cache dir, but the the vep_cache is set up like this, but maybe the 107_T2T dirname should be homo_sapiens_gca009914755v4?

Any advice is appreciated

vep_cache/
└── homo_sapiens_gca009914755v4
    └── 112_homo_sapiens_gca009914755v4
        └── 107_T2T-CHM13v2.0 

Output only contains header and no VEP annotated records:

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=AF_ESP,Number=1,Type=Float,Description="allele frequencies from GO-ESP">
##INFO=<ID=AF_EXAC,Number=1,Type=Float,Description="allele frequencies from ExAC">
##INFO=<ID=AF_TGP,Number=1,Type=Float,Description="allele frequencies from TGP">
##INFO=<ID=ALLELEID,Number=1,Type=Integer,Description="the ClinVar Allele ID">
##INFO=<ID=CLNDN,Number=.,Type=String,Description="ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB">
##INFO=<ID=CLNDNINCL,Number=.,Type=String,Description="For included Variant : ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB">
##INFO=<ID=CLNDISDB,Number=.,Type=String,Description="Tag-value pairs of disease database name and identifier submitted for germline classifications, e.g. OMIM:NNNNNN">
##INFO=<ID=CLNDISDBINCL,Number=.,Type=String,Description="For included Variant: Tag-value pairs of disease database name and identifier for germline classifications, e.g. OMIM:NNNNNN">
##INFO=<ID=CLNHGVS,Number=.,Type=String,Description="Top-level (primary assembly, alt, or patch) HGVS expression.">
##INFO=<ID=CLNREVSTAT,Number=.,Type=String,Description="ClinVar review status of germline classification for the Variation ID">
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Aggregate germline classification for this single variant; multiple values are separated by a vertical bar">
##INFO=<ID=CLNSIGCONF,Number=.,Type=String,Description="Conflicting germline classification for this single variant; multiple values are separated by a vertical bar">
##INFO=<ID=CLNSIGINCL,Number=.,Type=String,Description="Germline classification for a haplotype or genotype that includes this variant. Reported as pairs of VariationID:classification; multiple values are separated by a vertical bar">
##INFO=<ID=CLNVC,Number=1,Type=String,Description="Variant type">
##INFO=<ID=CLNVCSO,Number=1,Type=String,Description="Sequence Ontology id for variant type">
##INFO=<ID=CLNVI,Number=.,Type=String,Description="the variant's clinical sources reported as tag-value pairs of database and variant identifier">
##INFO=<ID=DBVARID,Number=.,Type=String,Description="nsv accessions from dbVar for the variant">
##INFO=<ID=GENEINFO,Number=1,Type=String,Description="Gene(s) for the variant reported as gene symbol:gene id. The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|)">
##INFO=<ID=MC,Number=.,Type=String,Description="comma separated list of molecular consequence in the form of Sequence Ontology ID|molecular_consequence">
##INFO=<ID=ONCDN,Number=.,Type=String,Description="ClinVar's preferred disease name for the concept specified by disease identifiers in ONCDISDB">
##INFO=<ID=ONCDNINCL,Number=.,Type=String,Description="For included variant: ClinVar's preferred disease name for the concept specified by disease identifiers in ONCDISDBINCL">
##INFO=<ID=ONCDISDB,Number=.,Type=String,Description="Tag-value pairs of disease database name and identifier submitted for oncogenicity classifications, e.g. MedGen:NNNNNN">
##INFO=<ID=ONCDISDBINCL,Number=.,Type=String,Description="For included variant: Tag-value pairs of disease database name and identifier for oncogenicity classifications, e.g. OMIM:NNNNNN">
##INFO=<ID=ONC,Number=.,Type=String,Description="Aggregate oncogenicity classification for this single variant; multiple values are separated by a vertical bar">
##INFO=<ID=ONCINCL,Number=.,Type=String,Description="Oncogenicity classification for a haplotype or genotype that includes this variant. Reported as pairs of VariationID:classification; multiple values are separated by a vertical bar">
##INFO=<ID=ONCREVSTAT,Number=.,Type=String,Description="ClinVar review status of oncogenicity classification for the Variation ID">
##INFO=<ID=ONCCONF,Number=.,Type=String,Description="Conflicting oncogenicity classification for this single variant; multiple values are separated by a vertical bar">
##INFO=<ID=ORIGIN,Number=.,Type=String,Description="Allele origin. One or more of the following values may be added: 0 - unknown; 1 - germline; 2 - somatic; 4 - inherited; 8 - paternal; 16 - maternal; 32 - de-novo; 64 - biparental; 128 - uniparental; 256 - not-tested; 512 - tested-inconclusive; 1073741824 - other">
##INFO=<ID=RS,Number=.,Type=String,Description="dbSNP ID (i.e. rs number)">
##INFO=<ID=SCIDN,Number=.,Type=String,Description="ClinVar's preferred disease name for the concept specified by disease identifiers in SCIDISDB">
##INFO=<ID=SCIDNINCL,Number=.,Type=String,Description="For included variant: ClinVar's preferred disease name for the concept specified by disease identifiers in SCIDISDBINCL">
##INFO=<ID=SCIDISDB,Number=.,Type=String,Description="Tag-value pairs of disease database name and identifier submitted for somatic clinial impact classifications, e.g. MedGen:NNNNNN">
##INFO=<ID=SCIDISDBINCL,Number=.,Type=String,Description="For included variant: Tag-value pairs of disease database name and identifier for somatic clinical impact classifications, e.g. OMIM:NNNNNN">
##INFO=<ID=SCIREVSTAT,Number=.,Type=String,Description="ClinVar review status of somatic clinical impact for the Variation ID">
##INFO=<ID=SCI,Number=.,Type=String,Description="Aggregate somatic clinical impact for this single variant; multiple values are separated by a vertical bar">
##INFO=<ID=SCIINCL,Number=.,Type=String,Description="Somatic clinical impact classification for a haplotype or genotype that includes this variant. Reported as pairs of VariationID:classification; multiple values are separated by a vertical bar">
##contig=<ID=chr1,length=248387328,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##contig=<ID=chr10,length=134758134,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##contig=<ID=chr11,length=135127769,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##contig=<ID=chr12,length=133324548,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##contig=<ID=chr13,length=113566686,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##contig=<ID=chr14,length=101161492,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##contig=<ID=chr15,length=99753195,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##contig=<ID=chr16,length=96330374,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##contig=<ID=chr17,length=84276897,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##contig=<ID=chr18,length=80542538,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##contig=<ID=chr19,length=61707364,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##contig=<ID=chr2,length=242696752,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##contig=<ID=chr20,length=66210255,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##contig=<ID=chr21,length=45090682,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##contig=<ID=chr22,length=51324926,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##contig=<ID=chr3,length=201105948,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##contig=<ID=chr4,length=193574945,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##contig=<ID=chr5,length=182045439,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##contig=<ID=chr6,length=172126628,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##contig=<ID=chr7,length=160567428,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##contig=<ID=chr8,length=146259331,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##contig=<ID=chr9,length=150617247,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##contig=<ID=chrX,length=154259566,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##contig=<ID=chrY,length=62460029,assembly=Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz>
##liftOverProgram=CrossMap,version=0.6.6,website=https://crossmap.readthedocs.io/en/latest/
##liftOverChainFile=chain/grch38-T2T_CHM13_v2.chain
##originalFile=clinvar_20240624.vcf.gz
##targetRefGenome=fasta/Homo_sapiens_gca009914755v4.T2T_CHM13_v2.dna.primary_assembly.fa.gz
##liftOverDate=July26,2024
##contig=<ID=1>
##contig=<ID=10>
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=2>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##contig=<ID=X>
##contig=<ID=Y>
##bcftools_viewVersion=1.20+htslib-1.20
##bcftools_viewCommand=view clinvar_20240624_GCA_009914755.4.vcf.gz 20; Date=Wed Sep 18 17:10:37 2024
##VEP="v112.0" API="v112" time="2024-09-18 17:12:51" cache="vep_cache/homo_sapiens_gca009914755v4/112_homo_sapiens_gca009914755v4" ensembl=112.7104005 ensembl-funcgen=112.be19ffa ensembl-io=112.2851b6f ensembl-variation=112.4113356
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID">
##VEP-command-line='vep --cache --cache_version 112 --compress_output bgzip --database 0 --dir_cache vep_cache --fasta Homo_sapiens-GCA_009914755.4-softmasked.fa --force_overwrite --format vcf --input_file clinvar_20240624_GCA_009914755.4:chr20.vcf.gz --offline --output_file clinvar_20240624_GCA_009914755_annotated.vcf.gz --species homo_sapiens_gca009914755v4 --symbol --vcf'
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
likhitha-surapaneni commented 1 month ago

Hi @indapa , Sorry to hear that you are facing issues. Replicating the steps here to download cache, the directory hierarchy looks like homo_sapiens_gca009914755v4/107_T2T-CHM13v2.0/ and with this structure, the command may have to be modified to use --cache_version 107. Kindly let us know if you are facing the same issue even after re-downloading the cache following the documentation.

indapa commented 1 month ago

Thanks @likhitha-surapaneni I didn't realize I had the wrong cache_version. Also, I added in an extra folder to my vep cache:

vep_cache/
└── homo_sapiens_gca009914755v4
    └── 107_T2T-CHM13v2.0

This code works great now, thank you!

vep --input_file ${vcf} \
        --output_file ${vcf.simpleName}_annotated.vcf.gz \
        --format vcf \
        --vcf \
        --compress_output bgzip \
        --cache_version 107 \
        --dir_cache ${vep_cache} \
        --fasta ${reference} \
        --offline \
        --symbol \
        --species homo_sapiens_gca009914755v4 \
        --force_overwrite