Closed davmlaw closed 3 months ago
From what I can tell, RefSeq do not provide MT transcripts
There are not NM_ records against NC012920.1 and All of the proteins in MT have the "YP" prefix - which means there is no associated NM_ transcript
@davmlaw hm... Could we add them with MT-(geneid) as pseudo transcripts as a special case?
@holtgrewe - I am not an expert on mitochondria - but apparently sometimes these genes are transcribed as large polycistronic units... and as it's different that's why there's no standard transcripts given
So, the "correct" thing to do would be to insert the genes, but w/o transcripts, but that's not very useful as we store start/end etc in transcripts
So perhaps we could insert transcripts with IDs like "fake-transcript-YP_XXX" that have start/end and coding start/end taken from the gene start/end - is that what you mean?
@davmlaw yes that's what I mean to make refseq symmetric to Ensembl
@holtgrewe - see https://github.com/SACGF/cdot/releases/tag/data_v0.2.26
import gzip
import json
filename = "/data/annotation/cdot/refseq/GRCh38/cdot-0.2.26.GCF_000001405.40_GRCh38.p14_genomic.RS_2023_10.gff.json.gz"
data = json.load(gzip.open(filename))
for t, td in data["transcripts"].items():
contig = td["genome_builds"]["GRCh38"]["contig"]
if contig == "NC_012920.1":
print(t)
Output
fake-rna-ATP6
fake-rna-ATP8
fake-rna-COX1
fake-rna-COX2
fake-rna-COX3
fake-rna-CYTB
fake-rna-ND1
fake-rna-ND2
fake-rna-ND3
fake-rna-ND4
fake-rna-ND4L
fake-rna-ND5
fake-rna-ND6
In [10]: data["transcripts"]["fake-rna-COX3"]
Out[10]:
{'biotype': ['mRNA'],
'gene_name': 'COX3',
'gene_version': '4514',
'genome_builds': {'GRCh38': {'cds_end': 9990,
'cds_start': 9206,
'contig': 'NC_012920.1',
'exons': [[9206, 9990, 0, 1, 784, None]],
'note': "TAA stop codon is completed by the addition of 3' A residues to the mRNA",
'strand': '+',
'url': 'https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/GCF_000001405.40-RS_2023_10/GCF_000001405.40_GRCh38.p14_genomic.gff.gz'}},
'hgnc': '7422',
'id': 'fake-rna-COX3',
'protein': 'YP_003024032.1',
'start_codon': 0,
'stop_codon': 784}
Cc @tedil
RefSeq GRCh37/GRCh38 joint historical do not have any contigs of "NC_012920.1"
There are 37 entries for contig NC_012920.1 in the latest GFF file:
We must have been not reading these in correctly