Closed jvwong closed 6 years ago
- We could grab some basic things straight from PC like its name and potentially some synonyms, unless there is a better place to grab some information from?
How about using esummary from NCBI Gene E-Utils?
The following will fetch a summary using the NCBI gene IDs for TP53 and MDM2: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id=7157 4193
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD esummary gene 20150202//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20150202/esummary_gene.dtd">
<eSummaryResult>
<DocumentSummarySet status="OK">
<DbBuild>Build180328-2250m.1</DbBuild>
<DocumentSummary uid="7157">
<Name>TP53</Name>
<Description>tumor protein p53</Description>
<Status>0</Status>
<CurrentID>0</CurrentID>
<Chromosome>17</Chromosome>
<GeneticSource>genomic</GeneticSource>
<MapLocation>17p13.1</MapLocation>
<OtherAliases>BCC7, LFS1, P53, TRP53</OtherAliases>
<OtherDesignations>cellular tumor antigen p53|antigen NY-CO-13|mutant tumor protein 53|p53 tumor suppressor|phosphoprotein p53|transformation-related protein 53|tumor protein 53|tumor supressor p53</OtherDesignations>
<NomenclatureSymbol>TP53</NomenclatureSymbol>
<NomenclatureName>tumor protein p53</NomenclatureName>
<NomenclatureStatus>Official</NomenclatureStatus>
<Mim>
<int>191170</int>
</Mim>
<GenomicInfo>
<GenomicInfoType>
<ChrLoc>17</ChrLoc>
<ChrAccVer>NC_000017.11</ChrAccVer>
<ChrStart>7687549</ChrStart>
<ChrStop>7668401</ChrStop>
<ExonCount>12</ExonCount>
</GenomicInfoType>
</GenomicInfo>
<GeneWeight>1000000</GeneWeight>
<Summary>This gene encodes a tumor suppressor protein containing transcriptional activation, DNA binding, and oligomerization domains. The encoded protein responds to diverse cellular stresses to regulate expression of target genes, thereby inducing cell cycle arrest, apoptosis, senescence, DNA repair, or changes in metabolism. Mutations in this gene are associated with a variety of human cancers, including hereditary cancers such as Li-Fraumeni syndrome. Alternative splicing of this gene and the use of alternate promoters result in multiple transcript variants and isoforms. Additional isoforms have also been shown to result from the use of alternate translation initiation codons from identical transcript variants (PMIDs: 12032546, 20937277). [provided by RefSeq, Dec 2016]</Summary>
<ChrSort>17</ChrSort>
<ChrStart>7668401</ChrStart>
<Organism>
<ScientificName>Homo sapiens</ScientificName>
<CommonName>human</CommonName>
<TaxID>9606</TaxID>
</Organism>
<LocationHist>
...
</LocationHist>
</DocumentSummary>
<DocumentSummary uid="4193">
<Name>MDM2</Name>
...
</DocumentSummary>
</DocumentSummarySet>
</eSummaryResult>
- We can use a recognized identifier. I will need to change how an id is recognized, right now it is if it has a uniprot id, in PR #535 it is the same but uses the gene validator which and would still return unrecognized. Correct me if I am wrong but the geneValidator service works only recognizes 1 database at a time, so is there a database that would cover what we need. Another option may be to return the recognized ids from the search along with the search results.
This is where we should talk strategy:
a. Map a token to an NCBI Gene ID b. If valid result, search using token
The only information it looks like I can't get directly from this source are the other links. Other than that I have everything else on the search implemented. The other thing I will need to change is how I am currently trimming the network as if we pass only the query then we will likely need to call gene validator to figure out which of the nodes in the network represents the queried node.
The only information it looks like I can't get directly from this source are the other links. Other than that I have everything else on the search implemented.
This is a little tricky - some genes may have no protein product (so UniProt record won't exist), the other ones can be fetched from the validator I guess. Maybe look through the other EUtils (Elink)
The other thing I will need to change is how I am currently trimming the network as if we pass only the query then we will likely need to call gene validator to figure out which of the nodes in the network represents the queried node.
Yes. Let's think on this more.
On May 1, 2018, NCBI will begin enforcing the use of API keys that will offer enhanced levels of supported access to the E-utilities. After that date, any site (IP address) posting more than 3 requests per second to the E-utilities without an API key will receive an error message. By including an API key, a site can post up to 10 requests per second by default. Higher rates are available by request (vog.hin.mln.ibcn@seitilitue). Users can obtain an API key now from the Settings page of their NCBI account (to create an account, visit www.ncbi.nlm.nih.gov/account/). After creating the key, users should include it in each E-utility request by assigning it to the new api_key parameter.
Example request including an API key: esummary.fcgi?db=pubmed&id=123456&api_key=ABCDE12345
Example error message if rates are exceeded: {"error":"API rate limit exceeded","count":"11"}
Only one API key is allowed per NCBI account; however, a user may request a new key at any time. Such a request will invalidate any existing API key associated with that NCBI account.
We encourage regular E-utility users to obtain an API key as soon as possible and begin the process of incorporating it into code. We also encourage users to monitor their request rates to determine if they will require rates higher than 10 per second. As stated above, we can potentially have higher rates negotiated prior to the beginning of enforcement on May 1, 2018.
Should we not use NCBI then?
Motivation for new feature
The goal of our landing page is to provide context for users entering via another biological database. Up to this point, much of our focus has been on proteins, in particular, handling the specific use case in which a user links from UniProt to Search.
However, the existing approach is not generic enough to handle the case where the referring site describes a gene, in particular, NCBI Gene and GeneCards.
To illustrate the problem, consider if a user enters "hsa-let-7a-1" into NCBI Gene which is a microRNA encoded by the MIRLET7A1 gene. A linkout from NCBI would contain the Gene ID '406881'.