Closed emjbishop closed 2 years ago
Hello @bishemma1, Thank you for explaining the issue with all the examples. Some of the features of the GFF file don't have a known transcript, example:
NC_000962.3 RefSeq CDS 3280 4437 . + 0 ID=cds-NP_214517.1;Parent=gene-Rv0003;Dbxref=Genbank:NP_214517.1,GeneID:887089;Name=NP_214517.1;Note=single-stranded DNA-binding protein;experiment=EXISTENCE:Mass spectrometry[PMID:15525680];gbkey=CDS;gene=recF;inference=protein motif:PROSITE:PS00618;locus_tag=Rv0003;product=DNA replication/repair protein RecF;protein_id=NP_214517.1;transl_table=11
NC_000962.3 RefSeq gene 3280 4437 . + . ID=gene-Rv0003;Dbxref=GeneID:887089;Name=recF;gbkey=Gene;gene=recF;gene_biotype=protein_coding;locus_tag=Rv0003
CDS is not attached to any transcript, this could be the reason why some variants are annotated as intergenic. We will have another look at your example and will let you know more.
Best wishes, Diana
Hello @bishemma1, The reason why some variants are being annotated as intergenic even when they overlap a gene is that bacterial genomes in NCBI are annotated with only CDS. In the GFF file the CDS is directly attached to the gene without any transcript and exons. A workaround is to edit the file to link the gene to the transcript and the exons.
Hi @dglemos, thank you for your reply. I did attempt converting using some of the tools listed here and through manual edits but it doesn't seem like an easy fix. I think it could be done though given enough trial and error. Might I suggest not listing GFF3 files as supported at this time, or at least not for genomic data? Thank you again.
Hello, I am trying to annotate a custom VCF using NCBI's GFF3 and FASTA for the bacteria Mycobacterium tuberculosis (https://www.ncbi.nlm.nih.gov/genome/?term=Mycobacterium+tuberculosis+H37Rv) and I find that even variants within genes are being labelled "intergenic_variant."
Eventually I will use my own GFF3 but for now I'm just trying to get any GFF3 cache working, and I could only find help for GFF/GTF issues.
Thank you, Emma
System
Full VEP command line
Full error message
Data files
A sample of the GFF3 after:
A sample of the compressed VCF:
I believe the SNP at position 4013 should not be labelled as an intergenic variant: