Closed davmlaw closed 1 year ago
Running
~/localwork/cdot/generate_transcript_data/all_transcripts.sh
Will check output on Monday and see if it's good for all files
write script to read over all files and see transcript/gene biotypes, ie something like:
import json
import gzip
filename = sys.argv[1]
data = json.load(gzip.open(filename))
biotypes = set()
for transcript_id, td in data["transcripts"].items():
biotype = td["biotype"]
biotypes.update(biotype)
gene_biotypes = set()
for gid, gd in data["genes"].items():
gene_biotypes.update(gd["biotype"])
The transcript biotype is being lost in Ensembl - it's always
[]
- there should be heaps of different onesThe gene biotype is a comma separated string, while transcript looks to be a JSON array - we should be consistent
Here's gene/transcript biotypes produced:
Ensembl 37 Transcript - Biotype: all [] Genes - Biotype: “protein_coding”
Ensembl 38 Transcript - Biotype: all [] Genes - 'biotype': 'protein_coding',
RefSeq 37 Genes = non_coding,protein_coding Transcripts - In [26]: transcript_biotypes Out[26]: Counter({"['protein_coding']": 107302, 'None': 19034, "['non_coding']": 23057, '[]': 40073})
RefSeq 38 Genes - 'biotype': 'non_coding', Transcripts - 'biotype': ['protein_coding']