Closed Genomewide closed 2 weeks ago
This looks to me like a normalization issue - the CHEMBL identifier doesn't merge with the UniProt Ids. The CHEMBL one is considered a chemical entity while the UniProt is a Protein. I guess it makes sense to merge these? Do we have some other examples of CHEMBL identifiers for proteins? @gaurav what do you think?
from TAQA: we'll push to the NodeNorm folks for more investigation, probably not a data/UI question - SPOKE has plenty of interactions :)
Repeated process. Looks like BDNF gets returned as a small molecule called "ABRINEURIN" when using the node selection tool in ARAX UI. Manually changing the query to the gene curie HGNC:1033 gave results.
https://arax.ncats.io/?r=351c7c8b-9bbf-4818-bff5-22e6e2b09529
{
"edges": {
"e0": {
"subject": "n0",
"object": "n1"
}
},
"nodes": {
"n0": {
"ids": [
"CHEMBL.COMPOUND:CHEMBL2108230"
],
"categories": [
"biolink:SmallMolecule"
],
"is_set": false,
"name": "ABRINEURIN"
},
"n1": {
"categories": [
"biolink:Gene",
"biolink:Protein"
],
"is_set": false
}
}
}
changed to HGNC curie and remove the small molecule category results in
But, even trying to get the synonyms of the HGNC curie return the small molecule
How would this impact queries?
from TAQA: ABRINEURIN and BDNF are two names for the same thing and give different answers depending on which you select. Lots of results tied to BDNF as a gene/protein w/HGNC id (NCBI, OMIM, UniProtKB, etc. but no ChEBML identifier). Any non-ChEMBL id will get back all these good results.
ABRINEURIN is not a SmallMolecule? That isn't a SmallMolecule.
Is there a modeling issue here? maybe ARAX normalizer?
Maybe CHEMBL brings in gene identifiers? -- should CHEMBL be in the id_prefixes section of 'biolink:Gene'
@Genomewide in the last screenshot in your most recent post, you showed the ARAX UI mapping HGNC:1033
to CHEMBL.COMPOUND:CHEMBL2108230
. Can you tell me how you got that? When I plug HGNC:1033
into the ARAXI synonym mapper, I get this:
I ask because I can't find a good database that maps between gene/protein IDs and chemical/drug IDs in cases where the therapeutic is a protein. If ARAX has found a source for that, I'm guessing the NodeNorm folks would be very interested in that mapping, and it could help solve this issue...
Note that our production maturity is still using an older database and old Node Synonymizer system, while everything else (testing, staging, dev) is using a newer database and newer Node Synonymizer. So you will get different answers at these two sites:
Old system (production): https://arax.transltr.io/?term=HGNC:1033
New system (staging): https://arax.ci.transltr.io/?term=HGNC:1033
@andrewsu, @Genomewide - What can we do here for BDNF having no gene or protein interactions? @gaurav - can node normalizer help in any way here?
Two thoughts from me:
HGNC:1033
with CHEMBL.COMPOUND:CHEMBL2108230
? If yes, I think NN could consider importing that (but needing to be cautious of over-merging, which presumably is the reason why CI ARAX does not merge those two IDs)I agree with Andrew about the likely group that is affected by this. If there was a way to identify some group of genes that needed this by looking at the list of biologics or some low hanging fruit, I would think it would be worth it . If there was a somewhat targeted way to get 50-80% of them linked up I would think it would be worth a little time.
I don't know how insulin does not suffer from the same problem. Is there a fix that was done before?
It's not easy anymore to be sure, but I think it probably was this record: https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL2108230/ that says that CHEMBL.COMPOUND:CHEMBL2108230 is called abrineurin and then UniProtKB https://www.uniprot.org/uniprotkb/P23560/entry also gave it an alternate name of abrineurin and that is equivalent to HGNC:1033
So based on ChEMBL and UniProtKB giving it alternate names of abrineurin The old NodeSynonymizer made lots of links that way. but it often overmerged. Name-based merging is often very good, but generated a lot of of tickets when it went bad
If name merging is the only way to bridge between the compound/drug identifiers and the gene/protein identifiers, and we've already established that name merging comes with some undesirable properties, then I suspect that there is not "low lying fruit" to be harvested here. And given that this type of drug-protein edge is not directly the subject of one of our MVP queries, my vote is to table this issue until later (post-fall).
I don't know how insulin does not suffer from the same problem.
I think insulin will have exactly the same problem. The ARAX synonymizer merges the compound/drug IDs and the gene/protein IDs on prod (https://arax.transltr.io/?r=2d241a3c-25f1-498a-9eab-ff618f65b68c), but not on CI (https://arax.ci.transltr.io/?r=2d241a3c-25f1-498a-9eab-ff618f65b68c)
I agree with @andrewsu that this is a tough problem that we will probably not solve by Sept. I think (?) that the correct response might be to allow a conflation between chemical/drug and protein but it's going to take some work to implement that and test it out. FWIW, I am not a big fan of name merging for the reason that Eric mentions - I think that there is plenty of structured equivalence or other relationship information that we should try to take advantage of first.
since there are two votes in favor of tabling this until the fall (and no opposed), going to adjust the milestone now...
NodeNorm seems to have split abrineurin into multiple cliques: https://nodenormalization-sri.renci.org/1.4/get_normalized_nodes?curie=HGNC%3A1033&curie=UniProtKB%3AP23560&curie=CHEMBL.COMPOUND%3ACHEMBL2108230&curie=MESH:C415772&curie=PR:000004716&curie=UNII:A1ED6W905I&curie=UNII:86ZE5V51WT&conflate=false&drug_chemical_conflate=false&description=false
Some of these might be genuine splits, but I think something is going wrong in protein conflation here. I'm tracking this at https://github.com/TranslatorSRI/NodeNormalization/issues/224. Is it correct to assume that this is at the same level of priority as other cliquing issues, or is there something particularly bad about this issue?
@Genomewide can you please retest, i'm not clear what we're looking for
Synonyms from ARAX production don't work at all now.
Appears to be connected on CI though and returns results.
Searching for any edge with BDNF to another gene or protein returns zero results from the ARAX UI. This is a well-studied gene and just looking at SPOKE shows a number of connected proteins.
The query is the query that failed to return results:
BTE returns results when I used the HGNC:1033. It appears that BTE adjusts the categories from small molecules to small molecules and gene.
ARAX returned results without modifying it. HGNC:1033 still worked to return results. However, it will not using the ARAX query builder that returns CHEMBL.COMPOUND:CHEMBL2108230 for BDNF.