SACGF / cdot

Transcript versions for HGVS libraries
MIT License
29 stars 5 forks source link

Implement HGVS data provider method get_gene_info #20

Closed davmlaw closed 2 years ago

davmlaw commented 2 years ago
        """
        returns basic information about the gene.
        :param gene: HGNC gene name
        :type gene: str
        # database results
        hgnc    | ATM
        maploc  | 11q22-q23
        descr   | ataxia telangiectasia mutated
        summary | The protein encoded by this gene belongs to the PI3/PI4-kinase family. This...
        aliases | AT1,ATA,ATC,ATD,ATE,ATDC,TEL1,TELO1
        added   | 2014-02-04 21:39:32.57125
        """

This data isn't in the GTFs, so we'll probably have to download HGNC

davmlaw commented 2 years ago

HGNC REST API:

{'hgnc_id': 'HGNC:795',
    'symbol': 'ATM',
    'name': 'ATM serine/threonine kinase',
    'status': 'Approved',
    'locus_type': 'gene with protein product',
    'prev_symbol': ['ATA', 'ATDC', 'ATC', 'ATD'],
    'prev_name': ['ataxia telangiectasia mutated (includes complementation groups A, C and D)',
     'ataxia telangiectasia mutated'],
    'alias_symbol': ['TEL1', 'TELO1'],
    'alias_name': ['TEL1, telomere maintenance 1, homolog (S. cerevisiae)'],
    'location': '11q22.3',
    'date_approved_reserved': '1995-07-07T00:00:00Z',
    'date_modified': '2021-04-13T00:00:00Z',
    'date_name_changed': '2014-06-17T00:00:00Z',
    'ena': ['AB209133'],
    'entrez_id': '472',
    'mgd_id': ['MGI:107202'],
    'iuphar': 'objectId:1934',
    'cosmic': 'ATM',
    'orphanet': 121474,
    'refseq_accession': ['NM_000051'],
    'gene_group': ['Armadillo like helical domain containing'],
    'vega_id': 'OTTHUMG00000166480',
    'lsdb': ['Ataxia Telangiectasia Mutated (ATM)|http://www.LOVD.nl/ATM',
     'Global Variome shared LOVD|https://databases.lovd.nl/shared/genes/ATM',
     'LRG_135|http://ftp.ebi.ac.uk/pub/databases/lrgex/LRG_135.xml'],
    'ensembl_gene_id': 'ENSG00000149311',
    'ccds_id': ['CCDS31669', 'CCDS86245'],
    'locus_group': 'protein-coding gene',
    'omim_id': ['607585'],
    'uniprot_ids': ['Q13315'],
    'ucsc_id': 'uc001pkb.1',
    'rgd_id': ['RGD:1593265'],
    'gene_group_id': [1492],
    'location_sortable': '11q22.3',
    'agr': 'HGNC:795',
    'mane_select': ['ENST00000675843.1', 'NM_000051.4'],
    'gencc': 'HGNC:795',
    'uuid': '78f0dba4-ee0b-41ef-83be-471847c1b3ad',
    '_version_': 1743737007974121473}

https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/Homo_sapiens.gene_info.gz

{'#tax_id': {393: 9606},
 'GeneID': {393: 472},
 'Symbol': {393: 'ATM'},
 'LocusTag': {393: '-'},
 'Synonyms': {393: 'AT1|ATA|ATC|ATD|ATDC|ATE|TEL1|TELO1'},
 'dbXrefs': {393: 'MIM:607585|HGNC:HGNC:795|Ensembl:ENSG00000149311|AllianceGenome:HGNC:795'},
 'chromosome': {393: '11'},
 'map_location': {393: '11q22.3'},
 'description': {393: 'ATM serine/threonine kinase'},
 'type_of_gene': {393: 'protein-coding'},
 'Symbol_from_nomenclature_authority': {393: 'ATM'},
 'Full_name_from_nomenclature_authority': {393: 'ATM serine/threonine kinase'},
 'Nomenclature_status': {393: 'O'},
 'Other_designations': {393: 'serine-protein kinase ATM|A-T mutated|AT mutated|TEL1, telomere maintenance 1, homolog|ataxia telangiectasia mutated|serine/threonine kinase ATM'},
 'Modification_date': {393: 20220906},
 'Feature_type': {393: '-'}}
davmlaw commented 2 years ago

I originally started trying to make a separate API call to HGNC/NCBI but that requires Eutilities etc - probably best to move it into CDot then implement a gene info API on cdot REST service

I think the easiest way is to download Homo_sapiens.gene_info.gz then batch call to retrieve the Entrez data, and produce the summaries.

Copy/paste of code

    def _get_entrez_gene_summary(id_list):
        from Bio import Entrez

        request = Entrez.epost("gene", id=",".join(id_list))
        result = Entrez.read(request)
        web_env = result["WebEnv"]
        query_key = result["QueryKey"]
        data = Entrez.esummary(db="gene", webenv=web_env, query_key=query_key)
        document = Entrez.read(data)
        return document["DocumentSummarySet"]["DocumentSummary"]

    def get_gene_info(self, gene):
        url = f"http://rest.genenames.org/fetch/symbol/{gene}"
        r = requests.get(url, headers={'Accept': "application/json"})
        json_data = r.json()
        record = json_data["response"]["docs"][0]
        gene_info = {
            "hgnc": record["symbol"],
            "maploc": record["location"],
            "descr": record["name"],
            "added": record["date_name_changed"],
        }

        if entrez_id := record.get("entrez_id"):
            entrez_gene_summary = self._get_entrez_gene_summary([entrez_id])
            d = entrez_gene_summary[0]
            gene_info.update({
                "aliases": d["OtherAliases"],
                "summary": d["Summary"],
            })

        else:
            # Just use what we can from HGNC
            gene_info.update({
                "summary": "",
                "aliases": ",".join(record.get("prev_symbol", []) + record.get("alias_symbol", [])),
            })
        return gene_info

Still need to deal with "added" - perhaps get from Homo_sapiens.gene_info.gz

davmlaw commented 2 years ago

Only 2.4mb for all the gene info - seems ok to just put it in every cdot json

Once done, we should also raise an issue in variantgrid to be able to take gene summaries from cdot, which will save doing the Entrez queries ourselves