SACGF / cdot

Transcript versions for HGVS libraries
MIT License
29 stars 5 forks source link

RefSeq missing MT transcripts #79

Closed davmlaw closed 3 months ago

davmlaw commented 7 months ago

RefSeq GRCh37/GRCh38 joint historical do not have any contigs of "NC_012920.1"

There are 37 entries for contig NC_012920.1 in the latest GFF file:

zgrep NC_012920.1 GCF_000001405.40_GRCh38.p14_genomic.RS_2023_10.gff.gz | egrep "\sgene\s"
516000;Name=ND1;gbkey=Gene;gene=ND1;gene_biotype=protein_coding;gene_synonym=MTND1
NC_012920.1 RefSeq  gene    4263    4331    .   +   .   ID=gene-TRNI;Dbxref=GeneID:4565,HGNC:HGNC:7488,MIM:590045;Name=TRNI;gbkey=Gene;gene=TRNI;gene_biotype=tRNA;gene_synonym=MTTI
NC_012920.1 RefSeq  gene    4329    4400    .   -   .   ID=gene-TRNQ;Dbxref=GeneID:4572,HGNC:HGNC:7495,MIM:590030;Name=TRNQ;gbkey=Gene;gene=TRNQ;gene_biotype=tRNA;gene_synonym=MTTQ
NC_012920.1 RefSeq  gene    4470    5511    .   +   .   ID=gene-ND2;Dbxref=GeneID:4536,HGNC:HGNC:7456,MIM:516001;Name=ND2;gbkey=Gene;gene=ND2;gene_biotype=protein_coding;gene_synonym=MTND2

We must have been not reading these in correctly

davmlaw commented 7 months ago

From what I can tell, RefSeq do not provide MT transcripts

There are not NM_ records against NC012920.1 and All of the proteins in MT have the "YP" prefix - which means there is no associated NM_ transcript

holtgrewe commented 6 months ago

@davmlaw hm... Could we add them with MT-(geneid) as pseudo transcripts as a special case?

davmlaw commented 6 months ago

@holtgrewe - I am not an expert on mitochondria - but apparently sometimes these genes are transcribed as large polycistronic units... and as it's different that's why there's no standard transcripts given

So, the "correct" thing to do would be to insert the genes, but w/o transcripts, but that's not very useful as we store start/end etc in transcripts

So perhaps we could insert transcripts with IDs like "fake-transcript-YP_XXX" that have start/end and coding start/end taken from the gene start/end - is that what you mean?

holtgrewe commented 6 months ago

@davmlaw yes that's what I mean to make refseq symmetric to Ensembl

davmlaw commented 3 months ago

@holtgrewe - see https://github.com/SACGF/cdot/releases/tag/data_v0.2.26

import gzip
import json
filename = "/data/annotation/cdot/refseq/GRCh38/cdot-0.2.26.GCF_000001405.40_GRCh38.p14_genomic.RS_2023_10.gff.json.gz"
data = json.load(gzip.open(filename))
for t, td in data["transcripts"].items():
    contig = td["genome_builds"]["GRCh38"]["contig"]
    if contig == "NC_012920.1":
        print(t)

Output

fake-rna-ATP6
fake-rna-ATP8
fake-rna-COX1
fake-rna-COX2
fake-rna-COX3
fake-rna-CYTB
fake-rna-ND1
fake-rna-ND2
fake-rna-ND3
fake-rna-ND4
fake-rna-ND4L
fake-rna-ND5
fake-rna-ND6
In [10]: data["transcripts"]["fake-rna-COX3"]
Out[10]: 
{'biotype': ['mRNA'],
 'gene_name': 'COX3',
 'gene_version': '4514',
 'genome_builds': {'GRCh38': {'cds_end': 9990,
   'cds_start': 9206,
   'contig': 'NC_012920.1',
   'exons': [[9206, 9990, 0, 1, 784, None]],
   'note': "TAA stop codon is completed by the addition of 3' A residues to the mRNA",
   'strand': '+',
   'url': 'https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/GCF_000001405.40-RS_2023_10/GCF_000001405.40_GRCh38.p14_genomic.gff.gz'}},
 'hgnc': '7422',
 'id': 'fake-rna-COX3',
 'protein': 'YP_003024032.1',
 'start_codon': 0,
 'stop_codon': 784}
holtgrewe commented 3 months ago

Cc @tedil