Open jh111 opened 1 year ago
@gaurav - is this something for Name Resolver?
Tagging the ace team David and Gaurav.
It isn't clear what the UI team can do about this issue. Is the idea of a "canonical name" available in the attribute server? @newgene
I think Jenn means that the returned results were not normalized properly. I used Jenn's PK to load back the results using ARAX CI UI (note this is an "old" query and the ARAS are falling the validation):
I found that BTE was responsible for this result:
I RETESTED on test today and the unusual name is still popping up:
and appears twice (one with the meshID and one with the UMLS ID . Apparently both BTE and RTX-KG2 are returning that result.
I looked at RENCI name resolver for monoclonal antibody AN100226 and found that the 2 identifiers instances gets properly pooled together.
Natalizumab is part of the synonyms but is not the label. I do not know what is the rule for deciding the drug label, but my guess is that the drug label is decided at the Node Norm stage, so that is a NodeNorm issue?
EDIT: So there are 2 issues here I think:
@sandrine-m the UI does not do any normalization, we use the normalization the ARS provides. The ARS relies on the node normalizer so most likely it is an issue with that service @gaurav @cbizon
From conversation through slack: @cbizon : The label is probably coming from nodenorm, which is where we are choosing the 'best' label. We currently have an approach that has not always been well received @gaurav aurav Vaidya (SRI) I've added "Investigate strategies for improved preferred labels for cliques." to our priorities. I know we have some tickets with individual examples we can start working on, but if people have ideas about improving this at scale -- if a particular chemical provider has really good labels, say -- please let us know!
I think we should move away from tickets with open ended definitions of success. "Give better labels" is way too broad and basically can never be finished. It would be better to create tickets with a finite set of items that should be corrected.
@gprice1129 If I understand correctly, I think what you're pointing out is that we can't implement this until we define what output is expected, and whether it's possible to do it.
Re: deciding on the label for nodeNorm. my understanding was that sometimes, nodeNorm choosen label is not the user preferred one. Although this issue cannot be fixed right away (longterm issue, perhaps needing some user surveys as Jenn is pointing out) , I started a test asset sheet for testing chemical names based on a few searches I made using the system. Please note that this sheet was done back in November 2023 I think so perhaps the system changed since then. MolePro team was interested particularly into looking at chemical labels choosen differently between MolePro and NodeNorm to see how we can improve our system.
@jh111 Having a definition for "better chemical labels" would definitely be a good idea, however, even if we had a perfect definition for "chemical label" its still unclear when the ticket can be closed: Are we talking about all the chemical labels in the system right now or all of them for all time? In my opinion it would be better if we constrained tickets of this nature to some finite set of chemical labels so whoever is working on it can have a clear goal.
I have put on a better title, to reflect the problem/opportunity with experience for specific users, and the fact different users might want different names. There are several different technical options for how this could be addressed.
For the INN, for I think RxNorm ingredient would be a fine level of detail. For example, inFLIXimab, as opposed to inFLIXimab-abda. I don't think we need to use the uppercase (which is designed for prescription safety).
I think this is a node norm issue. We display whatever the canonical name is. So, @gaurav can you tell us what the rules are for this? Then maybe @jh111 can see if there are examples where that are not optimized and if optimizing those would break other terms? So, the rubric could change. However, I don't think this is a UI issue.
Another example of suboptimal labeling is using the name "Activated Charcoal" for carbon:
The rule that's being applied is to get the name from each source and then rank them by the same source priority as used in biolink to pick which curie is the best one.
When you say source, do you mean original sources or each team within Translator? Would it be useful then to collect the name that each source provides and learn a rule (=set of weights) that best predict the user liking (=the desired result in the test asset sheet?) The idea being that some sources have more user-friendly naming strategies than others (=higher weights).
To deal with the simpler issue first, CHEBI:27594 "CHARCOAL, ACTIVATED" still has the wrong label (should be "carbon"). This is because we prefer CHEMBL.COMPOUND labels over others. I think I've seen other examples of CHEMBL labels being suboptimal; I wonder if we should promote CHEBI above it and see if that improves this situation (it should definitely fix this bug). I'm going to look for other reports of this before deciding whether to try this.
Now for the more complex issue: UMLS:C0665297 is present twice in NodeNorm Test -- once in a UMLS-only Protein clique, and once in a UMLS+MESH ChemicalEntity clique. These should really be merged into a single clique, but proteins and chemicals are currently produced by independent modules, so there isn't any way to merge those cliques given how NodeNorm is currently architected.
Is there a way we can gather all of the examples together to look at the flavors we are talking about? Charcoal, activated is wrong for different reasons than A synthetic peptide of 20 amino acids, comprising D-Phe, Pro, Arg, Pro, Gly, Gly, Gly, Gly, Asn, Gly, Asp, Phe, Glu, Glu, Ile, Pro, Glu, Glu, Tyr, and Leu in sequence. A congener of hirudin (a naturally occurring drug found in the saliva of the medicinal leech), it a specific and reversible inhibitor of thrombin, and is used as an anticoagulant.
@gaurav Do you have a dart board or a stress ball where you keep all of our complaints (or other place). I would be interested in seeing how to break these down and then look at the some examples from each group.
@Genomewide I started this sheet on my side (to become perhaps a set of tests in future for @gaurav ) it does not contain all examples and surely Gaurav has a lot more
How do I find what to put for Molpro? I added asset # 25
Thank you for adding a row to the sheet. Here is how you can query MolePro where you put as an input ["CID:75007581"] (MolePro has internally a different set of CURIES. However, MolePro does not know about this ID (we are tracking why at the moment) but does know about collagenase. To query by a name, use the "by_name" endpoint. I do see it on the PubChem page that it got modified beginning of July (2024-07-20) so that is perhaps a change of ID. We are investigating. I'll keep you posted.
@Genomewide I started this sheet on my side (to become perhaps a set of tests in future for @gaurav ) it does not contain all examples and surely Gaurav has a lot more
Thanks, @sandrine-muller-research! My list is actually much shorter :) I'll start moving your entries over in Hammerhead.
Just putting these here in case people are unaware of other convos:
Thank you Colleen! Putting this query here as it has a good amount of extremely long names. I will need to see whether we have better chemical name, and update the test asset sheet. Will come back to this.
Search What drug may treat Multiple Sclerosis. https://ui.test.transltr.io/results?l=Multiple%20Sclerosis&i=MONDO:0005301&t=0&q=bf9d0342-0966-4cec-8122-8d87187b1ef3
One of the answer that comes up is Monoclonal antibody an100226.
This is the early name/number for natalizumab. It will be much more helpful for users to have this normalized to the current name, natalizumab.
Options: