SACGF / cdot

Transcript versions for HGVS libraries
MIT License
29 stars 5 forks source link

HGNC ID still missing for some chrMT genes in ENSEMBL #73

Closed holtgrewe closed 6 months ago

holtgrewe commented 6 months ago

In the v0.2.24 release file cdot-0.2.24.ensembl.Homo_sapiens.GRCh37.87.gff3.json.gz, there still are genes with missing HGNC ID.

    "ENSG00000209082": {
      "biotype": [
        "Mt_tRNA",
        "ncRNA"
      ],
      "description": null,
      "gene_symbol": null,
      "url": "ftp://ftp.ensembl.org/pub/grch37/release-87/gff3/homo_sapiens/Homo_sapiens.GRCh37.87.gff3.gz"
    },
    "ENSG00000210049": {
      "biotype": [
        "Mt_tRNA",
        "ncRNA"
      ],
      "description": null,
      "gene_symbol": null,
      "url": "ftp://ftp.ensembl.org/pub/grch37/release-87/gff3/homo_sapiens/Homo_sapiens.GRCh37.87.gff3.gz"
    },
    "ENSG00000210077": {
      "biotype": [
        "Mt_tRNA",
        "ncRNA"
      ],
      "description": null,
      "gene_symbol": null,
      "url": "ftp://ftp.ensembl.org/pub/grch37/release-87/gff3/homo_sapiens/Homo_sapiens.GRCh37.87.gff3.gz"
    },

The information is in the GFF3 file:

# zgrep ENSG00000209082 Homo_sapiens.GRCh37.87.gff3.gz
MT      insdc   mt_gene 3230    3304    .       +       .       ID=gene:ENSG00000209082;Name=MT-TL1;biotype=Mt_tRNA;description=mitochondrially encoded tRNA leucine 1 (UUA/G) [Source:HGNC Symbol%3BAcc:7490];gene_id=ENSG00000209082;logic_name=mt_genbank_import;version=1
MT      insdc   transcript      3230    3304    .       +       .       ID=transcript:ENST00000386347;Parent=gene:ENSG00000209082;Name=MT-TL1-201;biotype=Mt_tRNA;tag=basic;transcript_id=ENST00000386347;version=1
holtgrewe commented 6 months ago

Proposing #74, please have a look.

davmlaw commented 6 months ago

Merged pull request, am currently re-generating data. Will double check that there are no genes with description/gene symbol as NULL

davmlaw commented 6 months ago
import json
import gzip
data = json.load(gzip.open("./ensembl/GRCh37/cdot-0.2.25.ensembl.grch37.json.gz"))
for gene_id, gene_data in data["genes"].items():
    if gene_data["gene_symbol"] is None:
        for biotype in gene_data.get("biotype", []):
            biotypes.add(biotype)
biotypes = set()
for gene_id, gene_data in data["genes"].items():
    if gene_data["gene_symbol"] is None:
        for biotype in gene_data.get("biotype", []):
            biotypes.add(biotype)
biotypes

There are quite a few feature types that have gene's in them...

{'C_gene_segment',
 'IG_D_gene',
 'J_gene_segment',
 'TR_C_gene',
 'TR_J_gene',
 'TR_V_gene',
 'V_gene_segment',
 'aberrant_processed_transcript',
 'lincRNA',
 'mRNA',
 'miRNA',
 'misc_RNA',
 'ncRNA',
 'processed_transcript',
 'rRNA',
 'snRNA',
 'snoRNA'}

I think hardcoding the accepted names is wrong, will just try pulling out attributes... ie simplify to:

        gene_name = feature.attr.get("gene_name") or feature.attr.get("Name")
        description = feature.attr.get("description")

Will run it and check results tomorrow

davmlaw commented 6 months ago

generated new release: https://github.com/SACGF/cdot/releases/tag/data_v0.2.25

holtgrewe commented 6 months ago

@davmlaw thanks, you're the best