Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
453 stars 152 forks source link

All variants are intergenic, upstream, or downstream with NCBI GFF3 #1003

Closed emjbishop closed 2 years ago

emjbishop commented 3 years ago

Hello, I am trying to annotate a custom VCF using NCBI's GFF3 and FASTA for the bacteria Mycobacterium tuberculosis (https://www.ncbi.nlm.nih.gov/genome/?term=Mycobacterium+tuberculosis+H37Rv) and I find that even variants within genes are being labelled "intergenic_variant."

Eventually I will use my own GFF3 but for now I'm just trying to get any GFF3 cache working, and I could only find help for GFF/GTF issues.

Thank you, Emma

System

Full VEP command line

(/home/ebishop/anaconda) [06/03 16:52:57] ebishop@valafar18:~/workspace/ensembl-vep$ ./vep --fasta /home/ebishop/projects/database/vep/GCF_000195955.2_ASM19595v2_genomic.fna.gz --gff /home/ebishop/projects/database/vep/GCF_000195955.gff.gz -i /grp/valafar/data/depot/arrow/4-0019.vcf.gz -o /home/ebishop/projects/database/vep/arrow/vep_output.txt --synonyms /home/ebishop/projects/database/vep/synonyms.txt

Full error message

WARNING: Unable to determine biotype of rna-RVnc0008
WARNING: Unable to determine biotype of rna-RVnc0002
WARNING: Unable to determine biotype of rna-RVnc0005
WARNING: Unable to determine biotype of rna-RVnc0010
WARNING: Unable to determine biotype of rna-RVnc0012
WARNING: Unable to determine biotype of rna-RVnc0036
WARNING: Unable to determine biotype of rna-RVnc0035
WARNING: Unable to determine biotype of rna-RVnc0021
WARNING: Unable to determine biotype of rna-RVnc0047
WARNING: Unable to determine biotype of rna-RVnc0034
WARNING: Unable to determine biotype of rna-RVnc0015
WARNING: Unable to determine biotype of rna-RVnc0013
WARNING: Unable to determine biotype of rna-RVnc0018
WARNING: Unable to determine biotype of rna-RVnc0001
WARNING: Unable to determine biotype of rna-RVnc0024
WARNING: Unable to determine biotype of rna-RVnc0003
WARNING: Unable to determine biotype of rna-RVnc0004
WARNING: Unable to determine biotype of rna-RVnc0006
WARNING: Unable to determine biotype of rna-RVnc0036a
WARNING: Unable to determine biotype of rna-RVnc0040

Data files

A sample of the GFF3 after:

grep -v "#" GCF_000195955.2_ASM19595v2_genomic.gff | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip -c > GCF_000195955.gff.gz 
tabix -p gff GCF_000195955.gff.gz
NC_000962.3     RefSeq  CDS     1       1524    .       +       0       ID=cds-NP_214515.1;Parent=gene-Rv0001;Dbxref=Genbank:NP_214515.1,GeneID:885041;Name=NP_214515.1;experiment=DESCRIPTION:Mutation analysis%2C gene expression[PMID: 10375628];gbkey=CDS;gene=dnaA;inference=protein motif:PROSITE:PS01008;locus_tag=Rv0001;product=chromosomal replication initiator protein DnaA;protein_id=NP_214515.1;transl_table=11
NC_000962.3     RefSeq  gene    1       1524    .       +       .       ID=gene-Rv0001;Dbxref=GeneID:885041;Name=dnaA;experiment=DESCRIPTION:Mutation analysis%2C gene expression[PMID: 10375628];gbkey=Gene;gene=dnaA;gene_biotype=protein_coding;locus_tag=Rv0001
NC_000962.3     RefSeq  region  1       4411532 .       +       .       ID=NC_000962.3:1..4411532;Dbxref=taxon:83332;gbkey=Src;genome=genomic;mol_type=genomic DNA;strain=H37Rv;type-material=type strain of Mycobacterium tuberculosis
NC_000962.3     RefSeq  CDS     2052    3260    .       +       0       ID=cds-NP_214516.1;Parent=gene-Rv0002;Dbxref=Genbank:NP_214516.1,GeneID:887092;Name=NP_214516.1;experiment=EXISTENCE:Mass spectrometry[PMID:20429878];gbkey=CDS;gene=dnaN;locus_tag=Rv0002;product=DNA polymerase III subunit beta;protein_id=NP_214516.1;transl_table=11
NC_000962.3     RefSeq  gene    2052    3260    .       +       .       ID=gene-Rv0002;Dbxref=GeneID:887092;Name=dnaN;gbkey=Gene;gene=dnaN;gene_biotype=protein_coding;locus_tag=Rv0002
NC_000962.3     RefSeq  CDS     3280    4437    .       +       0       ID=cds-NP_214517.1;Parent=gene-Rv0003;Dbxref=Genbank:NP_214517.1,GeneID:887089;Name=NP_214517.1;Note=single-stranded DNA-binding protein;experiment=EXISTENCE:Mass spectrometry[PMID:15525680];gbkey=CDS;gene=recF;inference=protein motif:PROSITE:PS00618;locus_tag=Rv0003;product=DNA replication/repair protein RecF;protein_id=NP_214517.1;transl_table=11
NC_000962.3     RefSeq  gene    3280    4437    .       +       .       ID=gene-Rv0003;Dbxref=GeneID:887089;Name=recF;gbkey=Gene;gene=recF;gene_biotype=protein_coding;locus_tag=Rv0003
NC_000962.3     RefSeq  CDS     4434    4997    .       +       0       ID=cds-NP_214518.1;Parent=gene-Rv0004;Dbxref=Genbank:NP_214518.1,GeneID:887088;Name=NP_214518.1;gbkey=CDS;locus_tag=Rv0004;product=hypothetical protein;protein_id=NP_214518.1;transl_table=11

A sample of the compressed VCF:

##fileformat=VCFv4.2
##fileDate=20200904
##source=GenomicConsensusV2.3.3
##reference=file:///grp/valafar/resources/H37Rv-NC_000962.3.fasta
##contig=<ID=1,length=4411532>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##FILTER=<ID=q40,Description="Quality below 40">
##FILTER=<ID=c5,Description="Coverage below 5">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  4-0019
1       1701    .       T       C       93      PASS    DP=131  GT      1
1       1821    .       G       T       93      PASS    DP=133  GT      1
1       1977    .       A       G       93      PASS    DP=134  GT      1
1       4013    .       T       C       93      PASS    DP=122  GT      1
1       7362    .       G       C       93      PASS    DP=111  GT      1

I believe the SNP at position 4013 should not be labelled as an intergenic variant:

## ENSEMBL VARIANT EFFECT PREDICTOR v104.3
## Output produced at 2021-06-03 16:53:05
## Using API version 104, DB version ?
## ensembl-funcgen version 104
## ensembl version 104
## ensembl-io version 104
## ensembl-variation version 104
## Column descriptions:
## Uploaded_variation : Identifier of uploaded variant
## Location : Location of variant in standard coordinate format (chr:start or chr:start-end)
## Allele : The variant allele used to calculate the consequence
## Gene : Stable ID of affected gene
## Feature : Stable ID of feature
## Feature_type : Type of feature - Transcript, RegulatoryFeature or MotifFeature
## Consequence : Consequence type
## cDNA_position : Relative position of base pair in cDNA sequence
## CDS_position : Relative position of base pair in coding sequence
## Protein_position : Relative position of amino acid in protein
## Amino_acids : Reference and variant amino acids
## Codons : Reference and variant codon sequence
## Existing_variation : Identifier(s) of co-located known variants
## Extra column keys:
## IMPACT : Subjective impact classification of consequence type
## DISTANCE : Shortest distance from variant to transcript
## STRAND : Strand of the feature (1/-1)
## FLAGS : Transcript quality flags
## SOURCE : Source of transcript
## GCF_000195955.gff.gz : /home/ebishop/projects/database/vep/GCF_000195955.gff.gz (overlap)
#Uploaded_variation     Location        Allele  Gene    Feature Feature_type    Consequence     cDNA_position   CDS_position    Protein_position        Amino_acids     Codons  Existing_variation      Extra
1_1701_T/C      1:1701  C       -       -       -       intergenic_variant      -       -       -       -       -       -       IMPACT=MODIFIER
1_1821_G/T      1:1821  T       -       -       -       intergenic_variant      -       -       -       -       -       -       IMPACT=MODIFIER
1_1977_A/G      1:1977  G       -       -       -       intergenic_variant      -       -       -       -       -       -       IMPACT=MODIFIER
1_4013_T/C      1:4013  C       -       -       -       intergenic_variant      -       -       -       -       -       -       IMPACT=MODIFIER
1_7362_G/C      1:7362  C       2700464 rna-Rvnt01      Transcript      upstream_gene_variant   -       -       -       -       -       -       IMPACT=MODIFIER;DISTANCE=3525;STRAND=1;SOURCE=GCF_000195955.gff.gz
dglemos commented 3 years ago

Hello @bishemma1, Thank you for explaining the issue with all the examples. Some of the features of the GFF file don't have a known transcript, example:

NC_000962.3     RefSeq  CDS     3280    4437    .       +       0       ID=cds-NP_214517.1;Parent=gene-Rv0003;Dbxref=Genbank:NP_214517.1,GeneID:887089;Name=NP_214517.1;Note=single-stranded DNA-binding protein;experiment=EXISTENCE:Mass spectrometry[PMID:15525680];gbkey=CDS;gene=recF;inference=protein motif:PROSITE:PS00618;locus_tag=Rv0003;product=DNA replication/repair protein RecF;protein_id=NP_214517.1;transl_table=11
NC_000962.3     RefSeq  gene    3280    4437    .       +       .       ID=gene-Rv0003;Dbxref=GeneID:887089;Name=recF;gbkey=Gene;gene=recF;gene_biotype=protein_coding;locus_tag=Rv0003

CDS is not attached to any transcript, this could be the reason why some variants are annotated as intergenic. We will have another look at your example and will let you know more.

Best wishes, Diana

dglemos commented 2 years ago

Hello @bishemma1, The reason why some variants are being annotated as intergenic even when they overlap a gene is that bacterial genomes in NCBI are annotated with only CDS. In the GFF file the CDS is directly attached to the gene without any transcript and exons. A workaround is to edit the file to link the gene to the transcript and the exons.

emjbishop commented 2 years ago

Hi @dglemos, thank you for your reply. I did attempt converting using some of the tools listed here and through manual edits but it doesn't seem like an easy fix. I think it could be done though given enough trial and error. Might I suggest not listing GFF3 files as supported at this time, or at least not for genomic data? Thank you again.