NCATSTranslator / Feedback

A repo for tracking gaps in Translator data and finding ways to fill them.
7 stars 0 forks source link

Give better chemical labels to returned responses #462

Open jh111 opened 1 year ago

jh111 commented 1 year ago

Search What drug may treat Multiple Sclerosis. https://ui.test.transltr.io/results?l=Multiple%20Sclerosis&i=MONDO:0005301&t=0&q=bf9d0342-0966-4cec-8122-8d87187b1ef3

One of the answer that comes up is Monoclonal antibody an100226.

This is the early name/number for natalizumab. It will be much more helpful for users to have this normalized to the current name, natalizumab.

Options:

  1. A search in PubChem for Monoclonal antibody an100226. brings up natalizumab, and a list of depositor-supplied synonyms: https://pubchem.ncbi.nlm.nih.gov/substance/481101759
  2. The evidence for treat includes many papers from PubMed that clearly have natalizumab in the title. Perhaps there's a way to get SemMedDB results to provide natalizumab as an answer, or to map SemMedDB DB answers.
sierra-moxon commented 1 year ago

@gaurav - is this something for Name Resolver?

sandrine-m commented 1 year ago

Tagging the ace team David and Gaurav.

gprice1129 commented 11 months ago

It isn't clear what the UI team can do about this issue. Is the idea of a "canonical name" available in the attribute server? @newgene

sandrine-m commented 11 months ago

I think Jenn means that the returned results were not normalized properly. I used Jenn's PK to load back the results using ARAX CI UI (note this is an "old" query and the ARAS are falling the validation):

image

I found that BTE was responsible for this result:

image

I RETESTED on test today and the unusual name is still popping up:

image

and appears twice (one with the meshID and one with the UMLS ID . Apparently both BTE and RTX-KG2 are returning that result.

I looked at RENCI name resolver for monoclonal antibody AN100226 and found that the 2 identifiers instances gets properly pooled together.

Natalizumab is part of the synonyms but is not the label. I do not know what is the rule for deciding the drug label, but my guess is that the drug label is decided at the Node Norm stage, so that is a NodeNorm issue?

EDIT: So there are 2 issues here I think:

gprice1129 commented 11 months ago

@sandrine-m the UI does not do any normalization, we use the normalization the ARS provides. The ARS relies on the node normalizer so most likely it is an issue with that service @gaurav @cbizon

sandrine-muller-research commented 11 months ago

From conversation through slack: @cbizon : The label is probably coming from nodenorm, which is where we are choosing the 'best' label. We currently have an approach that has not always been well received @gaurav aurav Vaidya (SRI) I've added "Investigate strategies for improved preferred labels for cliques." to our priorities. I know we have some tickets with individual examples we can start working on, but if people have ideas about improving this at scale -- if a particular chemical provider has really good labels, say -- please let us know!

gprice1129 commented 7 months ago

I think we should move away from tickets with open ended definitions of success. "Give better labels" is way too broad and basically can never be finished. It would be better to create tickets with a finite set of items that should be corrected.

jh111 commented 7 months ago

@gprice1129 If I understand correctly, I think what you're pointing out is that we can't implement this until we define what output is expected, and whether it's possible to do it.

sandrine-muller commented 7 months ago

Re: deciding on the label for nodeNorm. my understanding was that sometimes, nodeNorm choosen label is not the user preferred one. Although this issue cannot be fixed right away (longterm issue, perhaps needing some user surveys as Jenn is pointing out) , I started a test asset sheet for testing chemical names based on a few searches I made using the system. Please note that this sheet was done back in November 2023 I think so perhaps the system changed since then. MolePro team was interested particularly into looking at chemical labels choosen differently between MolePro and NodeNorm to see how we can improve our system.

gprice1129 commented 7 months ago

@jh111 Having a definition for "better chemical labels" would definitely be a good idea, however, even if we had a perfect definition for "chemical label" its still unclear when the ticket can be closed: Are we talking about all the chemical labels in the system right now or all of them for all time? In my opinion it would be better if we constrained tickets of this nature to some finite set of chemical labels so whoever is working on it can have a clear goal.

jh111 commented 7 months ago

I have put on a better title, to reflect the problem/opportunity with experience for specific users, and the fact different users might want different names. There are several different technical options for how this could be addressed.

For the INN, for I think RxNorm ingredient would be a fine level of detail. For example, inFLIXimab, as opposed to inFLIXimab-abda. I don't think we need to use the uppercase (which is designed for prescription safety).

Genomewide commented 7 months ago

I think this is a node norm issue. We display whatever the canonical name is. So, @gaurav can you tell us what the rules are for this? Then maybe @jh111 can see if there are examples where that are not optimized and if optimizing those would break other terms? So, the rubric could change. However, I don't think this is a UI issue.

cbizon commented 6 months ago

Another example of suboptimal labeling is using the name "Activated Charcoal" for carbon:

https://nodenorm.test.transltr.io/1.4/get_normalized_nodes?curie=PUBCHEM.COMPOUND%3A5462310&conflate=true&drug_chemical_conflate=false&description=false

The rule that's being applied is to get the name from each source and then rank them by the same source priority as used in biolink to pick which curie is the best one.

sandrine-muller commented 6 months ago

When you say source, do you mean original sources or each team within Translator? Would it be useful then to collect the name that each source provides and learn a rule (=set of weights) that best predict the user liking (=the desired result in the test asset sheet?) The idea being that some sources have more user-friendly naming strategies than others (=higher weights).

gaurav commented 1 month ago

To deal with the simpler issue first, CHEBI:27594 "CHARCOAL, ACTIVATED" still has the wrong label (should be "carbon"). This is because we prefer CHEMBL.COMPOUND labels over others. I think I've seen other examples of CHEMBL labels being suboptimal; I wonder if we should promote CHEBI above it and see if that improves this situation (it should definitely fix this bug). I'm going to look for other reports of this before deciding whether to try this.

Now for the more complex issue: UMLS:C0665297 is present twice in NodeNorm Test -- once in a UMLS-only Protein clique, and once in a UMLS+MESH ChemicalEntity clique. These should really be merged into a single clique, but proteins and chemicals are currently produced by independent modules, so there isn't any way to merge those cliques given how NodeNorm is currently architected.

Genomewide commented 1 month ago

Is there a way we can gather all of the examples together to look at the flavors we are talking about? Charcoal, activated is wrong for different reasons than A synthetic peptide of 20 amino acids, comprising D-Phe, Pro, Arg, Pro, Gly, Gly, Gly, Gly, Asn, Gly, Asp, Phe, Glu, Glu, Ile, Pro, Glu, Glu, Tyr, and Leu in sequence. A congener of hirudin (a naturally occurring drug found in the saliva of the medicinal leech), it a specific and reversible inhibitor of thrombin, and is used as an anticoagulant.

@gaurav Do you have a dart board or a stress ball where you keep all of our complaints (or other place). I would be interested in seeing how to break these down and then look at the some examples from each group.

sandrine-muller-research commented 1 month ago

@Genomewide I started this sheet on my side (to become perhaps a set of tests in future for @gaurav ) it does not contain all examples and surely Gaurav has a lot more

Genomewide commented 1 month ago

How do I find what to put for Molpro? I added asset # 25

sandrine-muller-research commented 1 month ago

Thank you for adding a row to the sheet. Here is how you can query MolePro where you put as an input ["CID:75007581"] (MolePro has internally a different set of CURIES. However, MolePro does not know about this ID (we are tracking why at the moment) but does know about collagenase. To query by a name, use the "by_name" endpoint. I do see it on the PubChem page that it got modified beginning of July (2024-07-20) so that is perhaps a change of ID. We are investigating. I'll keep you posted.

gaurav commented 1 month ago

@Genomewide I started this sheet on my side (to become perhaps a set of tests in future for @gaurav ) it does not contain all examples and surely Gaurav has a lot more

Thanks, @sandrine-muller-research! My list is actually much shorter :) I'll start moving your entries over in Hammerhead.

colleenXu commented 4 weeks ago

Just putting these here in case people are unaware of other convos:

sandrine-muller-research commented 3 weeks ago

Thank you Colleen! Putting this query here as it has a good amount of extremely long names. I will need to see whether we have better chemical name, and update the test asset sheet. Will come back to this.