NCATSTranslator / Feedback

A repo for tracking gaps in Translator data and finding ways to fill them.
7 stars 0 forks source link

Pubchem Name - Can another name be used once its too long? #759

Closed sstemann closed 4 weeks ago

sstemann commented 5 months ago

As an example, in MVP1 for Breast Cancer on Test: https://ui.test.transltr.io/main/results?l=Breast%20Cancer&i=MONDO:0007254&t=0&r=0&q=bbbc4bf4-8ad8-4d6f-9a37-6aa1c5658d3a

This result: 6-(2,5-dioxo-3-sulfanylpyrrolidin-1-yl)-N-[2-[[2-[[(2S)-1-[[2-[[2-[[(10S,23S)-10-ethyl-18-fluoro-10-hydroxy-19-methyl-5,9-dioxo-8-oxa-4,15-diazahexacyclo[14.7.1.02,14.04,13.06,11.020,24]tetracosa-1,6(11),12,14,16,18,20(24)-heptaen-23-yl]amino]-2-oxoethoxy]methylamino]-2-oxoethyl]amino]-1-oxo-3-phenylpropan-2-yl]amino]-2-oxoethyl]amino]-2-oxoethyl]hexanamide

image

When I use the CURIE PUBCHEM.COMPOUND:132508503 in NodeNorm, a few other systems give return easier to read labels

image

 {
    "identifier": "CHEMBL.COMPOUND:CHEMBL4297844",
    "label": "TRASTUZUMAB DERUXTECAN"
  },
  {
    "identifier": "UNII:5384HK7574",
    "label": "TRASTUZUMAB DERUXTECAN"
  },
dnsmith124 commented 5 months ago

@sstemann is your intention for the UI to resolve this problem? Such as when the UI gets a result that has a name greater than a certain number of characters, it hits node norm to look for a more readable name? Given the number of results that tend to return with these sorts of names we'd be looking at doing this processing on >20 results on any given query.

Or should this problem be resolved upstream from the UI?

gprice1129 commented 5 months ago

@gaurav what do you think about ranking labels lower if they fit certain criteria (a lot of non-alpha characters for example)

gprice1129 commented 2 months ago

UI team needs to discuss internally before assigning a release.

gaurav commented 2 months ago

PUBCHEM.COMPOUND:132508503 now has a label of "trastuzumab deruxtecan" on NodeNorm CI, NodeNorm Test and NodeNorm Prod. This is because we prioritize DrugCentral labels above the other available labels

(Oddly enough, when drug-chemical conflation is turned on, we get PUBCHEM.COMPOUND:132508503 "trastuzumab emtansine" -- I hope that is close enough!)

Label prioritization by prefix should have eliminated a lot of these issues, but there are still MANY identifiers with very long labels (especially from PubChem). I like @gprice1129's idea of ranking long labels lower (I've created an issue for it at https://github.com/TranslatorSRI/Babel/issues/313). For a quick fix, UI could replace a label that is longer than a particular length (somewhere around 35 characters or so) with the identifier itself (i.e. display "PUBCHEM.COMPOUND:132508503" instead of "6-(2,5-dioxo-3-sulfanylpyrrolidin-1-yl)-N-[...]". Would that be worth implementing?

gprice1129 commented 2 months ago

The UI team has discussed this and we will be implementing a short term solution.

sstemann commented 4 weeks ago

this result isnt returning any longer. of the long name compounds, i'm not seeing any that are examples of this where a shorter name is available so i'm closed this.

https://ui.transltr.io/results?l=Breast%20Cancer&i=MONDO:0007254&t=0&r=0&q=5b4ce30a-b1ae-4c82-9e33-9e1881ae3d75