Biotype - missing in Ensembl transcripts

SACGF / cdot

Transcript versions for HGVS libraries

MIT License

29 stars 5 forks source link

The transcript biotype is being lost in Ensembl - it's always [] - there should be heaps of different ones
The gene biotype is a comma separated string, while transcript looks to be a JSON array - we should be consistent

Here's gene/transcript biotypes produced:

Ensembl 37 Transcript - Biotype: all [] Genes - Biotype: “protein_coding”

Ensembl 38 Transcript - Biotype: all [] Genes - 'biotype': 'protein_coding',

RefSeq 37 Genes = non_coding,protein_coding Transcripts - In [26]: transcript_biotypes Out[26]: Counter({"['protein_coding']": 107302, 'None': 19034, "['non_coding']": 23057, '[]': 40073})

RefSeq 38 Genes - 'biotype': 'non_coding', Transcripts - 'biotype': ['protein_coding']

import json import gzip filename = sys.argv[1] data = json.load(gzip.open(filename)) biotypes = set() for transcript_id, td in data["transcripts"].items(): biotype = td["biotype"] biotypes.update(biotype) gene_biotypes = set() for gid, gd in data["genes"].items(): gene_biotypes.update(gd["biotype"])

SACGF / cdot

Biotype - missing in Ensembl transcripts #50