NCATSTranslator / Feedback

A repo for tracking gaps in Translator data and finding ways to fill them.
7 stars 0 forks source link

CHEMBL.TARGET:CHEMBL204 "UNIPROTKB:P00734" should not be preferred name, Prothrombin should be #760

Open TranslatorIssueCreator opened 6 months ago

TranslatorIssueCreator commented 6 months ago

Type: Bug Report

URL: https://ui.test.transltr.io/main/results?l=Lepirudin&i=PUBCHEM.COMPOUND:118856773&t=4&r=0&q=9c30d7a5-be5f-4f44-a12e-376369954f57

ARS PK: 9c30d7a5-be5f-4f44-a12e-376369954f57

Steps to reproduce:

search for UNIPROTKB:P00734 (current top answer)

Screenshots:

sandrine-muller commented 6 months ago

What genes may be downregulated by:Lepirudin

added as well in the chemical names asset sheet

To note that the returned CURIE is a CHEMBL target type and not the actual protein CURIE for Thrombin that should be CHEMBL:2108110.

sandrine-muller-research commented 6 months ago

Tested today at RENCI dev endpoints: NameRes Message:

{
  "curies": [
    "UniProtKB:P00734"
  ]
}

Response:

{
  "UniProtKB:P00734": {
    "curie": "UniProtKB:P00734",
    "names": [
      "F2",
      "DCP",
      "hF2",
      "PIVKA-II",
      "Factor II",
      "EC 3.4.21.5",
      "Prothrombin",
      "F2 protein, human",
      "prothrombin (human)",
      "Coagulation Factor II",
      "Des-Gamma Carboxyprothrombin",
      "Des-Gamma-Carboxy Prothrombin",
      "coagulation factor II (human)",
      "THRB_HUMAN Prothrombin (sprot)",
      "Protein Induced by Vitamin K Absence-II",
      "Protein Induced by Vitamin K Absence/Antagonist-II",
      "Protein Induced by Vitamin K Absence or Antagonist II"
    ],
    "types": [
      "Protein",
      "GeneProductMixin",
      "Polypeptide",
      "ChemicalEntityOrGeneOrGeneProduct",
      "ChemicalEntityOrProteinOrPolypeptide",
      "BiologicalEntity",
      "ThingWithTaxon",
      "NamedThing",
      "Entity",
      "GeneOrGeneProduct",
      "MacromolecularMachineMixin"
    ],
    "preferred_name": "THRB_HUMAN Prothrombin (sprot)",
    "shortest_name_length": 2,
    "clique_identifier_count": 6,
    "id": "82926183-9370-4db8-b0bd-99922e2f8fd1",
    "_version_": 1796561715672907800
  }
}

NodeNorm { "UNIPROTKB:P00734": { "id": { "identifier": "NCBIGene:2147", "label": "F2" }, "equivalent_identifiers": [ { "identifier": "NCBIGene:2147", "label": "F2" }, { "identifier": "ENSEMBL:ENSG00000180210", "label": "F2 (Hsap)" }, { "identifier": "HGNC:3535", "label": "F2" }, { "identifier": "OMIM:176930" }, { "identifier": "UMLS:C1414504", "label": "F2 gene" }, { "identifier": "UniProtKB:P00734", "label": "THRB_HUMAN Prothrombin (sprot)" }, { "identifier": "PR:P00734", "label": "prothrombin (human)" }, { "identifier": "ENSEMBL:ENSP00000308541" }, { "identifier": "ENSEMBL:ENSP00000308541.5" }, { "identifier": "UMLS:C3540506", "label": "Des-Gamma Carboxyprothrombin" }, { "identifier": "UMLS:C5552806", "label": "F2 protein, human" } ], "type": [ "biolink:Gene", "biolink:BiologicalEntity", "biolink:NamedThing", "biolink:GeneOrGeneProduct", "biolink:GenomicEntity", "biolink:ChemicalEntityOrGeneOrGeneProduct", "biolink:PhysicalEssence", "biolink:OntologyClass", "biolink:ThingWithTaxon", "biolink:PhysicalEssenceOrOccurrent", "biolink:MacromolecularMachineMixin", "biolink:Protein", "biolink:Polypeptide", "biolink:GeneProductMixin", "biolink:ChemicalEntityOrProteinOrPolypeptide" ], "information_content": 76.2 } }

At this point (perhaps just a versioning issue), I am not sure why name res is choosing "THRB_HUMAN Prothrombin (sprot)" as the preferred label given NodeNorm output...

gaurav commented 5 months ago

To note that the returned CURIE is a CHEMBL target type and not the actual protein CURIE for Thrombin that should be CHEMBL:2108110.

Yup, that's the key to what's going on here! It looks like "UNIPROTKB:P00734" is the label for the identifier http://identifiers.org/chembl.target/CHEMBL204 (presumably CHEMBL.TARGET:CHEMBL204). However, we don't have a CHEMBL.TARGET:CHEMBL204 in NodeNorm at all. I'm searching through our Proteins to see if it has a different prefix.

At this point (perhaps just a versioning issue), I am not sure why name res is choosing "THRB_HUMAN Prothrombin (sprot)" as the preferred label given NodeNorm output...

This is because that NodeNorm output has gene-protein conflation turned on, so it returns the preferred ID of NCBIGene:2147 ("F2"). NameRes currently has gene-protein conflation turned off, so UNIPROTKB:P00734 returns the label for UniProtKB:P00734, which is "THRB_HUMAN Prothrombin (sprot)". If you look this up on NodeNorm Prod with gene-protein conflation turned off, you'll see it's more similar to the NameRes output.

gaurav commented 3 months ago

We should still come up with a better name for NCBIGene:2147 ("F2"). I think we can try using the label prefix boosting to do this. I'm tracking this at https://github.com/TranslatorSRI/Babel/issues/312

gaurav commented 3 months ago

However, we don't have a CHEMBL.TARGET:CHEMBL204 in NodeNorm at all. I'm searching through our Proteins to see if it has a different prefix.

We really don't have CHEMBL.TARGET:CHEMBL204 "UNIPROTKB:P00734" at all in NodeNorm, but I agree that that is a terrible label :)

@cbizon Looking at https://arax.test.transltr.io/?r=22907b79-f50e-42fe-a9ba-5da28b380cbc, it looks like CHEMBL.TARGET:CHEMBL204 is being returned by Aragorn -- do you know where it's coming from? Should we add CHEMBL.TARGET to NodeNorm?

cbizon commented 3 months ago

It looks like those edges are coming from MolePro. Tagging @vdancik

vdancik commented 3 months ago

I'll check what went wrong