SACGF / cdot

Transcript versions for HGVS libraries
MIT License
29 stars 5 forks source link

Biotype - missing in Ensembl transcripts #50

Closed davmlaw closed 1 year ago

davmlaw commented 1 year ago
  1. The transcript biotype is being lost in Ensembl - it's always [] - there should be heaps of different ones

  2. The gene biotype is a comma separated string, while transcript looks to be a JSON array - we should be consistent

Here's gene/transcript biotypes produced:

Ensembl 37 Transcript - Biotype: all [] Genes - Biotype: “protein_coding”

Ensembl 38 Transcript - Biotype: all [] Genes - 'biotype': 'protein_coding',

RefSeq 37 Genes = non_coding,protein_coding Transcripts - In [26]: transcript_biotypes Out[26]: Counter({"['protein_coding']": 107302, 'None': 19034, "['non_coding']": 23057, '[]': 40073})

RefSeq 38 Genes - 'biotype': 'non_coding', Transcripts - 'biotype': ['protein_coding']

davmlaw commented 1 year ago

Running

~/localwork/cdot/generate_transcript_data/all_transcripts.sh

Will check output on Monday and see if it's good for all files

write script to read over all files and see transcript/gene biotypes, ie something like:

import json
import gzip

filename = sys.argv[1]
data = json.load(gzip.open(filename))
biotypes = set()
for transcript_id, td in data["transcripts"].items():
    biotype = td["biotype"]
    biotypes.update(biotype)

gene_biotypes = set()
for gid, gd in data["genes"].items():
    gene_biotypes.update(gd["biotype"])