dzc0104 commented 7 months ago

Hi, I am attempting to annotate a customized VCF file using NCBI's GFF and (fna) FASTA files for the Newcastle disease virus (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_004786615.1/). However, I've observed that all the variants are being classified as intergenic. But this is not true, when viewed in IGV.

System

VEP version:104.3
VEP Cache version: N/A
Perl version: N/A
OS: Linux
tabix installed

Script

To install the bgzip and tabix (I did it in my local terminal)

Download htslib-1.19.1.tar.gz

tar -zxvf htslib-1.19.1.tar.gz cd htslib-1.19.1

removing header line of gff as vep does not work with files having header line (local terminal)

grep -v '^#' genomic.gff | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip > genomic.gff.gz tabix -p gff genomic.gff.gz

for compressing fasta file (local terminal and transfer all the files in super computer later)

bgzip -c GCF_004786615.1_ASM478661v1_genomic.fna > GCF_004786615.1_ASM478661v1_genomic.fna.gz

for indexing fasta file

samtools faidx GCF_004786615.1_ASM478661v1_genomic.fna.gz

creating a synonyms file that maps the chromosome names used in your VCF to those used in your GFF file

zcat iso1_filtered.snp.vcf.gz | grep -v '^#' | sort -k1,1 -o sorted_iso1.vcf cut -f1 sorted_iso10.vcf > 1snpsynonyms.txt zcat genomic.gff.gz | grep -v '^#' | sort -k1,1 -o sorted.gff

variants annotation for snp using ASM4786615.1

vep -i iso1_filtered.snp.vcf.gz --gff /home/shared/hauck_research/Deepa_NDV_updated/troubleshooting/ncbiASM478661/ncbi_dataset/data/GCF_004786615.1/genomic.gff.gz --fasta /home/shared/hauck_research/Deepa_NDV_updated/troubleshooting/ncbiASM478661/ncbi_dataset/data/GCF_004786615.1/GCF_004786615.1_ASM478661v1_genomic.fna.gz --synonyms 1snpsynonyms.txt --species avian_orthoavulavirus

Full error message

I have not got any warning message as the script ran but the output file was with all intergenic variants.

Data files

A sample of the GFF after NC_075404.1 RefSeq region 1 15186 . + . ID=NC_075404.1:1..15186;Dbxref=taxon:2560319;country=United Kingdom: N. Ireland;gbkey=Src;genome=genomic;isolate=chicken/N. Ireland/Ulster/67;mol_type=genomic RNA;old-name=Newcastle disease virus NC_075404.1 RefSeq gene 56 1801 . + . ID=gene-QKC91_gp1;Dbxref=GeneID:80527638;Name=N;gbkey=Gene;gene=N;gene_biotype=protein_coding;locus_tag=QKC91_gp1 NC_075404.1 RefSeq CDS 122 1591 . + 0 ID=cds-YP_010790286.1;Parent=gene-QKC91_gp1;Dbxref=GenBank:YP_010790286.1,GeneID:80527638;Name=YP_010790286.1;gbkey=CDS;gene=N;locus_tag=QKC91_gp1;product=nucleoprotein;protein_id=YP_010790286.1 NC_075404.1 RefSeq gene 1804 3254 . + . ID=gene-QKC91_gp2;Dbxref=GeneID:80527633;Name=P;gbkey=Gene;gene=P;gene_biotype=protein_coding;locus_tag=QKC91_gp2 .....

A sample of the compressed VCF

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT iso1

NODE_1_length_6008_cov_909.877255 980 . T C 12078.64 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=0.924;DP=624;ExcessHet=0.0000;FS=1.120;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=19.87;ReadPosRankSum=0.149;SOR=0.728 GT:AD:DP:GQ:PL 0/1:236,372:608:99:12086,0,6929 NODE_1_length_6008_cov_909.877255 3666 . C T 15573.64 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-0.079;DP=770;ExcessHet=0.0000;FS=7.765;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=20.88;ReadPosRankSum=0.795;SOR=0.362 GT:AD:DP:GQ:PL 0/1:235,511:746:99:15581,0,5829 NODE_1_length_6008_cov_909.877255 3812 . A G 534.64 ReadPosRankSum-8 AC=1;AF=0.500;AN=2;BaseQRankSum=1.096;DP=826;ExcessHet=0.0000;FS=15.515;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=0.66;ReadPosRankSum=-12.298;SOR=2.487 GT:AD:DP:GQ:PL 0/1:722,85:807:99:542,0,23105 NODE_1_length_6008_cov_909.877255 4631 . T C 1817.64 ReadPosRankSum-8 AC=1;AF=0.500;AN=2;BaseQRankSum=-3.725;DP=846;ExcessHet=0.0000;FS=22.208;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=2.24;ReadPosRankSum=-13.945;SOR=1.685 GT:AD:DP:GQ:PL 0/1:680,133:813:99:1825,0,21905 NODE_2_length_2668_cov_848.858356 289 . G A 924.64 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-1.811;DP=720;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=59.97;MQRankSum=0.000;QD=1.50;ReadPosRankSum=-5.861;SOR=0.631 GT:AD:DP:GQ:PL 0/1:531,87:618:99:932,0,16256 .....

Synonyms text file format NODE_1_length_6008_cov_909.877255 NC_075404.1 NODE_1_length_6008_cov_909.877255 NC_075404.1 NODE_1_length_6008_cov_909.877255 NC_075404.1 NODE_1_length_6008_cov_909.877255 NC_075404.1 NODE_2_length_2668_cov_848.858356 NC_075404.1 NODE_2_length_2668_cov_848.858356 NC_075404.1 .....

VEP output

ENSEMBL VARIANT EFFECT PREDICTOR v104.3

Output produced at 2024-02-09 19:23:53

Using API version 104, DB version ?

ensembl-funcgen version 104.f1c7762

ensembl-io version 104.1d3bb6e

ensembl version 104.1af1dce

ensembl-variation version 104.20f5335

Column descriptions:

Uploaded_variation : Identifier of uploaded variant

Location : Location of variant in standard coordinate format (chr:start or chr:start-end)

Allele : The variant allele used to calculate the consequence

Gene : Stable ID of affected gene

Feature : Stable ID of feature

Feature_type : Type of feature - Transcript, RegulatoryFeature or MotifFeature

Consequence : Consequence type

cDNA_position : Relative position of base pair in cDNA sequence

CDS_position : Relative position of base pair in coding sequence

Protein_position : Relative position of amino acid in protein

Amino_acids : Reference and variant amino acids

Codons : Reference and variant codon sequence

Existing_variation : Identifier(s) of co-located known variants

Extra column keys:

IMPACT : Subjective impact classification of consequence type

DISTANCE : Shortest distance from variant to transcript

STRAND : Strand of the feature (1/-1)

FLAGS : Transcript quality flags

SOURCE : Source of transcript

genomic.gff.gz : /home/shared/hauck_research/Deepa_NDV_updated/troubleshooting/ncbiASM478661/ncbi_dataset/data/GCF_004786615.1/genomic.gff.gz (overlap)

Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation Extra

NODE_1_length_6008_cov_909.877255_980_T/C NODE_1_length_6008_cov_909.877255:980 C - - - intergenic_variant - - - - - - IMPACT=MODIFIER NODE_1_length_6008_cov_909.877255_3666_C/T NODE_1_length_6008_cov_909.877255:3666 T - - - intergenic_variant - - - - - - IMPACT=MODIFIER NODE_1_length_6008_cov_909.877255_3812_A/G NODE_1_length_6008_cov_909.877255:3812 G - - - intergenic_variant - - - - - - IMPACT=MODIFIER NODE_1_length_6008_cov_909.877255_4631_T/C NODE_1_length_6008_cov_909.877255:4631 C - - - intergenic_variant - - - - - - IMPACT=MODIFIER ....

nuno-agostinho commented 7 months ago

Hey @dzc0104,

Thank you for your question. The problem is related with using the NCBI GTF/GFF annotation for microorganisms: we currently require the GTF/GFF annotation to explicitly describe the transcript and its exons.

For your use case, you could use the following modified annotation:

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM478661v1
#!genome-build-accession NCBI_Assembly:GCF_004786615.1
##sequence-region NC_075404.1 1 15186
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=2560319
NC_075404.1 RefSeq  region  1   15186   .   +   .   ID=NC_075404.1:1..15186;Dbxref=taxon:2560319;country=United Kingdom: N. Ireland;gbkey=Src;genome=genomic;isolate=chicken/N. Ireland/Ulster/67;mol_type=genomic RNA;old-name=Newcastle disease virus
NC_075404.1 RefSeq  gene    56  1801    .   +   .   ID=gene-QKC91_gp1;Dbxref=GeneID:80527638;Name=N;gbkey=Gene;gene=N;gene_biotype=protein_coding;locus_tag=QKC91_gp1
NC_075404.1 RefSeq  transcript  122 1591    .   +   0   ID=transcript-YP_010790286.1;Parent=gene-QKC91_gp1;Dbxref=GenBank:YP_010790286.1,GeneID:80527638;Name=YP_010790286.1;gbkey=CDS;gene=N;locus_tag=QKC91_gp1;product=nucleoprotein;protein_id=YP_010790286.1
NC_075404.1 RefSeq  exon    122 1591    .   +   0   ID=exon-YP_010790286.1;Parent=transcript-YP_010790286.1;Dbxref=GenBank:YP_010790286.1,GeneID:80527638;Name=YP_010790286.1;gbkey=CDS;gene=N;locus_tag=QKC91_gp1;product=nucleoprotein;protein_id=YP_010790286.1
NC_075404.1 RefSeq  gene    1804    3254    .   +   .   ID=gene-QKC91_gp2;Dbxref=GeneID:80527633;Name=P;gbkey=Gene;gene=P;gene_biotype=protein_coding;locus_tag=QKC91_gp2
NC_075404.1 RefSeq  transcript  1887    3074    .   +   0   ID=transcript-YP_010790287.1;Parent=gene-QKC91_gp2;Dbxref=GenBank:YP_010790287.1,GeneID:80527633;Name=YP_010790287.1;gbkey=CDS;gene=P;locus_tag=QKC91_gp2;product=phosphoprotein;protein_id=YP_010790287.1
NC_075404.1 RefSeq  exon    1887    3074    .   +   0   ID=exon-YP_010790287.1;Parent=transcript-YP_010790287.1;Dbxref=GenBank:YP_010790287.1,GeneID:80527633;Name=YP_010790287.1;gbkey=CDS;gene=P;locus_tag=QKC91_gp2;product=phosphoprotein;protein_id=YP_010790287.1
NC_075404.1 RefSeq  gene    3256    4496    .   +   .   ID=gene-QKC91_gp3;Dbxref=GeneID:80527634;Name=M;gbkey=Gene;gene=M;gene_biotype=protein_coding;locus_tag=QKC91_gp3
NC_075404.1 RefSeq  transcript  3290    4384    .   +   0   ID=transcript-YP_010790288.1;Parent=gene-QKC91_gp3;Dbxref=GenBank:YP_010790288.1,GeneID:80527634;Name=YP_010790288.1;gbkey=CDS;gene=M;locus_tag=QKC91_gp3;product=matrix protein;protein_id=YP_010790288.1
NC_075404.1 RefSeq  exon    3290    4384    .   +   0   ID=exon-YP_010790288.1;Parent=transcript-YP_010790288.1;Dbxref=GenBank:YP_010790288.1,GeneID:80527634;Name=YP_010790288.1;gbkey=CDS;gene=M;locus_tag=QKC91_gp3;product=matrix protein;protein_id=YP_010790288.1
NC_075404.1 RefSeq  gene    4498    6289    .   +   .   ID=gene-QKC91_gp4;Dbxref=GeneID:80527635;Name=F;gbkey=Gene;gene=F;gene_biotype=protein_coding;locus_tag=QKC91_gp4
NC_075404.1 RefSeq  transcript  4544    6205    .   +   0   ID=transcript-YP_010790289.1;Parent=gene-QKC91_gp4;Dbxref=GenBank:YP_010790289.1,GeneID:80527635;Name=YP_010790289.1;gbkey=CDS;gene=F;locus_tag=QKC91_gp4;product=fusion protein;protein_id=YP_010790289.1
NC_075404.1 RefSeq  exon    4544    6205    .   +   0   ID=exon-YP_010790289.1;Parent=transcript-YP_010790289.1;Dbxref=GenBank:YP_010790289.1,GeneID:80527635;Name=YP_010790289.1;gbkey=CDS;gene=F;locus_tag=QKC91_gp4;product=fusion protein;protein_id=YP_010790289.1
NC_075404.1 RefSeq  gene    6321    8322    .   +   .   ID=gene-QKC91_gp5;Dbxref=GeneID:80527636;Name=HN;gbkey=Gene;gene=HN;gene_biotype=protein_coding;locus_tag=QKC91_gp5
NC_075404.1 RefSeq  transcript  6412    8262    .   +   0   ID=transcript-YP_010790290.1;Parent=gene-QKC91_gp5;Dbxref=GenBank:YP_010790290.1,GeneID:80527636;Name=YP_010790290.1;gbkey=CDS;gene=HN;locus_tag=QKC91_gp5;product=hemagglutinin-neuraminidase;protein_id=YP_010790290.1
NC_075404.1 RefSeq  exon    6412    8262    .   +   0   ID=exon-YP_010790290.1;Parent=transcript-YP_010790290.1;Dbxref=GenBank:YP_010790290.1,GeneID:80527636;Name=YP_010790290.1;gbkey=CDS;gene=HN;locus_tag=QKC91_gp5;product=hemagglutinin-neuraminidase;protein_id=YP_010790290.1
NC_075404.1 RefSeq  gene    8370    15072   .   +   .   ID=gene-QKC91_gp6;Dbxref=GeneID:80527637;Name=L;gbkey=Gene;gene=L;gene_biotype=protein_coding;locus_tag=QKC91_gp6
NC_075404.1 RefSeq  transcript  8381    14995   .   +   0   ID=transcript-YP_010790291.1;Parent=gene-QKC91_gp6;Dbxref=GenBank:YP_010790291.1,GeneID:80527637;Name=YP_010790291.1;gbkey=CDS;gene=L;locus_tag=QKC91_gp6;product=RNA-dependent RNA polymerase;protein_id=YP_010790291.1
NC_075404.1 RefSeq  exon    8381    14995   .   +   0   ID=exon-YP_010790291.1;Parent=transcript-YP_010790291.1;Dbxref=GenBank:YP_010790291.1,GeneID:80527637;Name=YP_010790291.1;gbkey=CDS;gene=L;locus_tag=QKC91_gp6;product=RNA-dependent RNA polymerase;protein_id=YP_010790291.1

As this is not the first time we got this question (see https://github.com/Ensembl/ensembl-vep/issues/1074), I am going to talk with the team about the possibility of supporting these NCBI GTF/GFF annotation files for microorganisms. Maybe we can consider each CDS as a single-exon transcript. I will keep you updated on this.

Best regards, Nuno

dzc0104 commented 7 months ago

Thank you for the response @nuno-agostinho It worked for that reference. I have a question did you edit the gff file manually? I have other two references 1) https://www.ncbi.nlm.nih.gov/nuccore/NC_039223.1 2) https://www.ncbi.nlm.nih.gov/nuccore/AF077761 - this one has gff3 files and I tried to convert it into gff and even gtf but could not. Gff3 did not even bgzipped and tabixed.

nuno-agostinho commented 7 months ago

Hi @dzc0104,

I manually created the file by basically:

Duplicating the CDS lines
Changing the feature to transcript and exon
Changing their IDs to something unique
Changing their Parent IDs:
- Put the gene ID as the parent ID of the transcript
- Put the transcript ID as the parent ID of the exon

Tell me if you need further instructions.

this one has gff3 files and I tried to convert it into gff and even gtf but could not. Gff3 did not even bgzipped and tabixed.

If you downloaded the GFF3 annotation via the Send to form in the top right corner of the record, you need to remove the last empty lines of the file before running bgzip and tabix. Tell me if this worked.

Cheers, Nuno

dzc0104 commented 6 months ago

@nuno-agostinho Yay! It worked. Thank you very much, Nuno.

Regard, Deepa

dzc0104 commented 6 months ago

@nuno-agostinho I still have a question. How can position 77 be associated with multiple types of genes, namely F, M, NP, and P? During my analysis, I observed that genomic position 77 is annotated with gene symbols F, M, NP, and P across various transcripts like this Iso7- Vep.xlsx

I got this information from a dataset https://www.ncbi.nlm.nih.gov/nuccore/AF077761 that includes details about gene symbols and transcript types. But I'm not sure what it means biologically to have different gene types at the same position.

nuno-agostinho commented 6 months ago

Hi @dzc0104,

The only results associated with genes F and M are upstream_gene_variant or downstram_gene_variant. Marking variants as upstream/downstream a gene is useful to understand variants that may affect those genes (maybe as regulatory regions).

However, the default distance between a variant and a transcript used by VEP to annotate up/downstream variants is 5 000 bp (optimised for vertebrates) and the genome you mentioned is small (15 186 bp). Please try to decrease the --distance parameter to make it more sense for your use case.

Hope this makes it clear.

Cheers, Nuno

dzc0104 commented 4 months ago

Hi @nuno-agostinho,

Thank you for your assistance.

As part of my data analysis, I've identified synonymous variants and now I'm exploring their potential impacts at the amino acid level. While synonymous variants traditionally aren't thought to have functional impacts on protein structure, they can affect RNA stability, protein folding, evolutionary conservation, splicing regulation, and regulatory elements.

I've utilized Variant Effect Predictor (VEP) with the SIFT option (-sift b), but unfortunately, I didn't receive any relevant data in the output. Does this lack of prediction indicate that there are no available predictions for my variants?

Here's the command I used: vep -i iso1p1_filtered.snp.vcf.gz \ --gff /home/shared/hauck_research/Deepa_NDV_updated/troubleshooting/ref/AF077761/sequence.gff3.gz \ --fasta /home/shared/hauck_research/Deepa_NDV_updated/troubleshooting/ref/AF077761/AF077761.fasta.gz \ --species avian_orthoavulavirus \ --sift b

Additionally, I'm seeking recommendations for other tools to analyze the functional impacts of synonymous variants, particularly those focusing on RNA-level effects, splicing regulation, and non-protein-coding impacts.

Thank you for your guidance! 😊

I have attached hereby the link to the VCF file.

iso1p1_filtered.snp.vcf.gz

Best regards, Deepa

nuno-agostinho commented 4 months ago

Hi @dzc0104,

VEP only returns pre-computed SIFT results stored in Ensembl databases in --database or --cache modes. However, we don't have SIFT results for avian orthoavulavirus. You may want to consider installing and running SIFT on your data, as per https://sift.bii.a-star.edu.sg.

Regarding additional tools to help predict variant consequences, some articles list such tools:

Hope this information was useful.

Cheers, Nuno

Joshua-Macleod commented 1 month ago

Hi @nuno-agostinho,

I have a similar issue as the one originally reported by @dzc0104 regarding intergenic variant calling.

I've built .gff3 files using both prokka and bakta for reference genomes against which I'm looking to find variants. Here's an excerpt of a bakta .gff3 below:

contig00001     Prodigal        CDS     265     723     .       +       0       ID=KAHBKG_00010;Name=Transcriptional regulator CtsR;locus_tag=KAHBKG_00010;product=Transcriptional regulator CtsR;Dbxref=COG:COG4463,COG:K,RefSeq:WP_003760062.1,SO:0001217,UniParc:UPI00000CC18E,UniRef:UniRef100_H1GA27,UniRef:UniRef50_A0A143YMT3,UniRef:UniRef90_G2ZA06;gene=ctsR
contig00001     Prodigal        CDS     736     1254    .       +       0       ID=KAHBKG_00015;Name=Protein-arginine kinase activator protein McsA;locus_tag=KAHBKG_00015;product=Protein-arginine kinase activator protein McsA;Dbxref=COG:COG3880,COG:O,RefSeq:WP_003760064.1,SO:0001217,UniParc:UPI0001EB894E,UniRef:UniRef100_A0A823H5C3,UniRef:UniRef50_H1GA28,UniRef:UniRef90_H1GA28;gene=mcsA
contig00001     Prodigal        CDS     1251    2273    .       +       0       ID=KAHBKG_00020;Name=protein arginine kinase;locus_tag=KAHBKG_00020;product=protein arginine kinase;Dbxref=COG:COG3869,COG:O,EC:2.7.14.1,GO:0004111,GO:0004672,GO:0005524,GO:0016310,GO:0046314,RefSeq:WP_010990301.1,SO:0001217,UniParc:UPI000013952D,UniRef:UniRef100_Q92F44,UniRef:UniRef50_Q48759,UniRef:UniRef90_Q48759;gene=mcsB
contig00001     Prodigal        CDS     2302    4764    .       +       0       ID=KAHBKG_00025;Name=endopeptidase Clp ATP-binding chain C;locus_tag=KAHBKG_00025;product=endopeptidase Clp ATP-binding chain C;Dbxref=COG:COG0542,COG:O,RefSeq:WP_003770116.1,SO:0001217,UniParc:UPI00000CC190,UniRef:UniRef100_A0A3H2VSB6,UniRef:UniRef50_A0A0F7N4K2,UniRef:UniRef90_A0A097B1Z0,VFDB:VFC0282,VFDB:VFG000079;gene=clpC

I've tried to make use of your method here:

Duplicating the CDS lines

Changing the feature to transcript and exon

Changing their IDs to something unique

Changing their Parent IDs:

Put the gene ID as the parent ID of the transcript

Put the transcript ID as the parent ID of the exon

and even changing CDS to gene in the .gff3 file and including a biotype to remedy the warning (just on the off chance...):

contig00001     Prodigal        gene    265     723     .       +       .       ID=gene-KAHBKG_00010;locus_tag=KAHBKG_00010;gene_biotype=protein_coding
contig00001     Prodigal        transcript      265     723     .       +       .       ID=KAHBKG_00010_t1000;Parent=gene-KAHBKG_00010;locus_tag=KAHBKG_00010
contig00001     Prodigal        exon    265     723     .       +       0       ID=KAHBKG_00010_e1000;Parent=KAHBKG_00010_t1000;locus_tag=KAHBKG_00010

However, I still receive warnings (WARNING: Unable to determine biotype of KAHBKG_01390) for approx. 30 IDs/locus_tags per .gff3 and variants are still called as intergenic even if the locations fall within a CDS.

Any recommendations here, or if you'd like me to provide test data, do let me know.

Cheers, Joshua

nuno-agostinho commented 1 month ago

Hi @Joshua-Macleod,

Based on that warning, I would say that those lines have no field indicating their biotype, so VEP can't determine whether they are part of a protein_coding transcript or not.

Could you show me the lines in your GFF3 file relative to KAHBKG_01390?

Best, Nuno

Joshua-Macleod commented 1 month ago

Hi @nuno-agostinho,

Thanks for getting back to me.

Here are the lines:

contig00001     Prodigal        gene    270089  271192  .       +       .       ID=gene-KAHBKG_01390;locus_tag=KAHBKG_01390;gene_biotype=protein_coding;Name=23S rRNA (adenine(2503)-C(2))-methyltransferase RlmN;product=23S rRNA (adenine(2503)-C(2))-methyltransferase RlmN;Dbxref=COG:COG0820,COG:J,EC:2.1.1.192,GO:0000049,GO:0002935,GO:0005737,GO:0008757,GO:0016433,GO:0019843,GO:0031167,GO:0046872,GO:0051539,GO:0070040,GO:0070475,RefSeq:WP_003725208.1,SO:0001217,UniParc:UPI00000CC251,UniRef:UniRef100_Q92EH6,UniRef:UniRef50_Q8Y9P2,UniRef:UniRef90_Q8Y9P2;gene=rlmN
contig00001     Prodigal        transcript      270089  271192  .       +       .       ID=KAHBKG_01390_t1272;Parent=gene-KAHBKG_01390;locus_tag=KAHBKG_01390;Name=23S rRNA (adenine(2503)-C(2))-methyltransferase RlmN;product=23S rRNA (adenine(2503)-C(2))-methyltransferase RlmN;Dbxref=COG:COG0820,COG:J,EC:2.1.1.192,GO:0000049,GO:0002935,GO:0005737,GO:0008757,GO:0016433,GO:0019843,GO:0031167,GO:0046872,GO:0051539,GO:0070040,GO:0070475,RefSeq:WP_003725208.1,SO:0001217,UniParc:UPI00000CC251,UniRef:UniRef100_Q92EH6,UniRef:UniRef50_Q8Y9P2,UniRef:UniRef90_Q8Y9P2;gene=rlmN
contig00001     Prodigal        exon    270089  271192  .       +       0       ID=KAHBKG_01390_e1272;Parent=KAHBKG_01390_t1272;locus_tag=KAHBKG_01390;Name=23S rRNA (adenine(2503)-C(2))-methyltransferase RlmN;product=23S rRNA (adenine(2503)-C(2))-methyltransferase RlmN;Dbxref=COG:COG0820,COG:J,EC:2.1.1.192,GO:0000049,GO:0002935,GO:0005737,GO:0008757,GO:0016433,GO:0019843,GO:0031167,GO:0046872,GO:0051539,GO:0070040,GO:0070475,RefSeq:WP_003725208.1,SO:0001217,UniParc:UPI00000CC251,UniRef:UniRef100_Q92EH6,UniRef:UniRef50_Q8Y9P2,UniRef:UniRef90_Q8Y9P2;gene=rlmN

Worth noting, these aren't loci outputted by vep (edit: presumably wouldn't be for the same reason they're noted in the warnings - I didn't put two and two together).

Cheers, Joshua

Ensembl / ensembl-vep

All variants are intergenic with NCBI GFF #1620

System

Script

To install the bgzip and tabix (I did it in my local terminal)

Download htslib-1.19.1.tar.gz

removing header line of gff as vep does not work with files having header line (local terminal)

for compressing fasta file (local terminal and transfer all the files in super computer later)

for indexing fasta file

creating a synonyms file that maps the chromosome names used in your VCF to those used in your GFF file

variants annotation for snp using ASM4786615.1

Full error message

Data files

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT iso1

ENSEMBL VARIANT EFFECT PREDICTOR v104.3

Output produced at 2024-02-09 19:23:53

Using API version 104, DB version ?

ensembl-funcgen version 104.f1c7762

ensembl-io version 104.1d3bb6e

ensembl version 104.1af1dce

ensembl-variation version 104.20f5335

Column descriptions:

Uploaded_variation : Identifier of uploaded variant

Location : Location of variant in standard coordinate format (chr:start or chr:start-end)

Allele : The variant allele used to calculate the consequence

Gene : Stable ID of affected gene

Feature : Stable ID of feature

Feature_type : Type of feature - Transcript, RegulatoryFeature or MotifFeature

Consequence : Consequence type

cDNA_position : Relative position of base pair in cDNA sequence

CDS_position : Relative position of base pair in coding sequence

Protein_position : Relative position of amino acid in protein

Amino_acids : Reference and variant amino acids

Codons : Reference and variant codon sequence

Existing_variation : Identifier(s) of co-located known variants

Extra column keys:

IMPACT : Subjective impact classification of consequence type

DISTANCE : Shortest distance from variant to transcript

STRAND : Strand of the feature (1/-1)

FLAGS : Transcript quality flags

SOURCE : Source of transcript

genomic.gff.gz : /home/shared/hauck_research/Deepa_NDV_updated/troubleshooting/ncbiASM478661/ncbi_dataset/data/GCF_004786615.1/genomic.gff.gz (overlap)

Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation Extra