Closed davmlaw closed 2 years ago
HGNC REST API:
{'hgnc_id': 'HGNC:795',
'symbol': 'ATM',
'name': 'ATM serine/threonine kinase',
'status': 'Approved',
'locus_type': 'gene with protein product',
'prev_symbol': ['ATA', 'ATDC', 'ATC', 'ATD'],
'prev_name': ['ataxia telangiectasia mutated (includes complementation groups A, C and D)',
'ataxia telangiectasia mutated'],
'alias_symbol': ['TEL1', 'TELO1'],
'alias_name': ['TEL1, telomere maintenance 1, homolog (S. cerevisiae)'],
'location': '11q22.3',
'date_approved_reserved': '1995-07-07T00:00:00Z',
'date_modified': '2021-04-13T00:00:00Z',
'date_name_changed': '2014-06-17T00:00:00Z',
'ena': ['AB209133'],
'entrez_id': '472',
'mgd_id': ['MGI:107202'],
'iuphar': 'objectId:1934',
'cosmic': 'ATM',
'orphanet': 121474,
'refseq_accession': ['NM_000051'],
'gene_group': ['Armadillo like helical domain containing'],
'vega_id': 'OTTHUMG00000166480',
'lsdb': ['Ataxia Telangiectasia Mutated (ATM)|http://www.LOVD.nl/ATM',
'Global Variome shared LOVD|https://databases.lovd.nl/shared/genes/ATM',
'LRG_135|http://ftp.ebi.ac.uk/pub/databases/lrgex/LRG_135.xml'],
'ensembl_gene_id': 'ENSG00000149311',
'ccds_id': ['CCDS31669', 'CCDS86245'],
'locus_group': 'protein-coding gene',
'omim_id': ['607585'],
'uniprot_ids': ['Q13315'],
'ucsc_id': 'uc001pkb.1',
'rgd_id': ['RGD:1593265'],
'gene_group_id': [1492],
'location_sortable': '11q22.3',
'agr': 'HGNC:795',
'mane_select': ['ENST00000675843.1', 'NM_000051.4'],
'gencc': 'HGNC:795',
'uuid': '78f0dba4-ee0b-41ef-83be-471847c1b3ad',
'_version_': 1743737007974121473}
https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/Homo_sapiens.gene_info.gz
{'#tax_id': {393: 9606},
'GeneID': {393: 472},
'Symbol': {393: 'ATM'},
'LocusTag': {393: '-'},
'Synonyms': {393: 'AT1|ATA|ATC|ATD|ATDC|ATE|TEL1|TELO1'},
'dbXrefs': {393: 'MIM:607585|HGNC:HGNC:795|Ensembl:ENSG00000149311|AllianceGenome:HGNC:795'},
'chromosome': {393: '11'},
'map_location': {393: '11q22.3'},
'description': {393: 'ATM serine/threonine kinase'},
'type_of_gene': {393: 'protein-coding'},
'Symbol_from_nomenclature_authority': {393: 'ATM'},
'Full_name_from_nomenclature_authority': {393: 'ATM serine/threonine kinase'},
'Nomenclature_status': {393: 'O'},
'Other_designations': {393: 'serine-protein kinase ATM|A-T mutated|AT mutated|TEL1, telomere maintenance 1, homolog|ataxia telangiectasia mutated|serine/threonine kinase ATM'},
'Modification_date': {393: 20220906},
'Feature_type': {393: '-'}}
I originally started trying to make a separate API call to HGNC/NCBI but that requires Eutilities etc - probably best to move it into CDot then implement a gene info API on cdot REST service
I think the easiest way is to download Homo_sapiens.gene_info.gz then batch call to retrieve the Entrez data, and produce the summaries.
Copy/paste of code
def _get_entrez_gene_summary(id_list):
from Bio import Entrez
request = Entrez.epost("gene", id=",".join(id_list))
result = Entrez.read(request)
web_env = result["WebEnv"]
query_key = result["QueryKey"]
data = Entrez.esummary(db="gene", webenv=web_env, query_key=query_key)
document = Entrez.read(data)
return document["DocumentSummarySet"]["DocumentSummary"]
def get_gene_info(self, gene):
url = f"http://rest.genenames.org/fetch/symbol/{gene}"
r = requests.get(url, headers={'Accept': "application/json"})
json_data = r.json()
record = json_data["response"]["docs"][0]
gene_info = {
"hgnc": record["symbol"],
"maploc": record["location"],
"descr": record["name"],
"added": record["date_name_changed"],
}
if entrez_id := record.get("entrez_id"):
entrez_gene_summary = self._get_entrez_gene_summary([entrez_id])
d = entrez_gene_summary[0]
gene_info.update({
"aliases": d["OtherAliases"],
"summary": d["Summary"],
})
else:
# Just use what we can from HGNC
gene_info.update({
"summary": "",
"aliases": ",".join(record.get("prev_symbol", []) + record.get("alias_symbol", [])),
})
return gene_info
Still need to deal with "added" - perhaps get from Homo_sapiens.gene_info.gz
Only 2.4mb for all the gene info - seems ok to just put it in every cdot json
Once done, we should also raise an issue in variantgrid to be able to take gene summaries from cdot, which will save doing the Entrez queries ourselves
This data isn't in the GTFs, so we'll probably have to download HGNC