Closed holtgrewe closed 6 months ago
Proposing #74, please have a look.
Merged pull request, am currently re-generating data. Will double check that there are no genes with description/gene symbol as NULL
import json
import gzip
data = json.load(gzip.open("./ensembl/GRCh37/cdot-0.2.25.ensembl.grch37.json.gz"))
for gene_id, gene_data in data["genes"].items():
if gene_data["gene_symbol"] is None:
for biotype in gene_data.get("biotype", []):
biotypes.add(biotype)
biotypes = set()
for gene_id, gene_data in data["genes"].items():
if gene_data["gene_symbol"] is None:
for biotype in gene_data.get("biotype", []):
biotypes.add(biotype)
biotypes
There are quite a few feature types that have gene's in them...
{'C_gene_segment',
'IG_D_gene',
'J_gene_segment',
'TR_C_gene',
'TR_J_gene',
'TR_V_gene',
'V_gene_segment',
'aberrant_processed_transcript',
'lincRNA',
'mRNA',
'miRNA',
'misc_RNA',
'ncRNA',
'processed_transcript',
'rRNA',
'snRNA',
'snoRNA'}
I think hardcoding the accepted names is wrong, will just try pulling out attributes... ie simplify to:
gene_name = feature.attr.get("gene_name") or feature.attr.get("Name")
description = feature.attr.get("description")
Will run it and check results tomorrow
generated new release: https://github.com/SACGF/cdot/releases/tag/data_v0.2.25
@davmlaw thanks, you're the best
In the v0.2.24 release file cdot-0.2.24.ensembl.Homo_sapiens.GRCh37.87.gff3.json.gz, there still are genes with missing HGNC ID.
The information is in the GFF3 file: