connor-lab / ncov2019-artic-nf

A Nextflow pipeline for running the ARTIC network's fieldbioinformatics tools (https://github.com/artic-network/fieldbioinformatics), with a focus on ncov2019
GNU Affero General Public License v3.0
88 stars 89 forks source link

GFF included repo doesn't include ORF1ab frameshift #110

Open tnguyensanger opened 3 years ago

tnguyensanger commented 3 years ago

Is the GFF included in the repo intended for production use or is it only for unit testing?

https://github.com/connor-lab/ncov2019-artic-nf/blob/9ac3119a875d75c49de65848a3587e6fcec22d1c/typing/MN908947.3.gff

The GFF included in the repo seems to use coordinates for the ORF1ab gene that do not take into account 1bp frameshift due to ribosomal slippage:

MN908947.3  ensembl gene    266 13483   .   +   .   ID=gene:ENSSASG00005000003;Name=ORF1ab;biotype=protein_coding;description=ORF1a polyprotein%3BORF1ab polyprotein [Source:NCBI gene (formerly Entrezgene)%3BAcc:43740578];gene_id=ENSSASG00005000003;logic_name=ensembl_covid;version=1
MN908947.3  ensembl mRNA    266 13483   .   +   .   ID=transcript:ENSSAST00005000003;Parent=gene:ENSSASG00005000003;Name=ORF1a;biotype=protein_coding;transcript_id=ENSSAST00005000003;version=1
MN908947.3  ensembl exon    266 13483   .   +   .   Parent=transcript:ENSSAST00005000003;Name=ENSSASE00005000003;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=ENSSASE00005000003;rank=1;version=1
MN908947.3  ensembl CDS 266 13483   .   +   0   ID=CDS:ENSSASP00005000003;Parent=transcript:ENSSAST00005000003;protein_id=ENSSASP00005000003
####
MN908947.3  ensembl gene    266 21555   .   +   .   ID=gene:ENSSASG00005000002;Name=ORF1ab;biotype=protein_coding;description=ORF1a polyprotein%3BORF1ab polyprotein [Source:NCBI gene (formerly Entrezgene)%3BAcc:43740578];gene_id=ENSSASG00005000002;logic_name=ensembl_covid;version=1
MN908947.3  ensembl mRNA    266 21555   .   +   .   ID=transcript:ENSSAST00005000002;Parent=gene:ENSSASG00005000002;Name=ORF1ab;biotype=protein_coding;transcript_id=ENSSAST00005000002;version=1
MN908947.3  ensembl exon    266 21555   .   +   .   Parent=transcript:ENSSAST00005000002;Name=ENSSASE00005000002;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=ENSSASE00005000002;rank=1;version=1
MN908947.3  ensembl CDS 266 21555   .   +   0   ID=CDS:ENSSASP00005000002;Parent=transcript:ENSSAST00005000002;protein_id=ENSSASP00005000002

This frameshift is seen in the latest NCBI GFF: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.gff.gz

NC_045512.2     RefSeq  gene    266     21555   .       +       .       ID=gene-GU280_gp01;Dbxref=GeneID:43740578;Name=ORF1ab;gbkey=Gene;gene=ORF1ab;gene_biotype=protein_coding;locus_tag=GU280_gp01
NC_045512.2     RefSeq  CDS     266     13468   .       +       0       ID=cds-YP_009724389.1;Parent=gene-GU280_gp01;Dbxref=Genbank:YP_009724389.1,GeneID:43740578;Name=YP_009724389.1;Note=pp1ab%3B translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=ORF1ab;locus_tag=GU280_gp01;product=ORF1ab polyprotein;protein_id=YP_009724389.1
NC_045512.2     RefSeq  CDS     13468   21555   .       +       0       ID=cds-YP_009724389.1;Parent=gene-GU280_gp01;Dbxref=Genbank:YP_009724389.1,GeneID:43740578;Name=YP_009724389.1;Note=pp1ab%3B translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=ORF1ab;locus_tag=GU280_gp01;product=ORF1ab polyprotein;protein_id=YP_009724389.1