NCATSTranslator / Feedback

A repo for tracking gaps in Translator data and finding ways to fill them.
7 stars 0 forks source link

BDNF has no gene or protein interactions #150

Closed Genomewide closed 2 weeks ago

Genomewide commented 1 year ago

Searching for any edge with BDNF to another gene or protein returns zero results from the ARAX UI. This is a well-studied gene and just looking at SPOKE shows a number of connected proteins.

The query is the query that failed to return results:

{
  "edges": {
    "e0": {
      "subject": "n0",
      "object": "n1"
    }
  },
  "nodes": {
    "n0": {
      "ids": [
        "CHEMBL.COMPOUND:CHEMBL2108230"
      ],
      "categories": [
        "biolink:SmallMolecule"
      ],
      "is_set": false,
      "name": "ABRINEURIN"
    },
    "n1": {
      "categories": [
        "biolink:Gene",
        "biolink:Protein"
      ],
      "is_set": false
    }
  }
}

image

image

BTE returns results when I used the HGNC:1033. It appears that BTE adjusts the categories from small molecules to small molecules and gene.

{
  "edges": {
    "e0": {
      "object": "n1",
      "subject": "n0"
    }
  },
  "nodes": {
    "n0": {
      "categories": [
        "biolink:SmallMolecule",
        "biolink:Gene"
      ],
      "ids": [
        "HGNC:1033"
      ],
      "is_set": true,
      "name": "ABRINEURIN"
    },
    "n1": {
      "categories": [
        "biolink:Gene",
        "biolink:Protein"
      ],
      "is_set": false
    }
  }
}

ARAX returned results without modifying it. HGNC:1033 still worked to return results. However, it will not using the ARAX query builder that returns CHEMBL.COMPOUND:CHEMBL2108230 for BDNF.

{
  "edges": {
    "N1": {
      "attribute_constraints": [],
      "object": "n1",
      "predicates": [
        "biolink:has_normalized_google_distance_with"
      ],
      "qualifier_constraints": [],
      "subject": "n0"
    },
    "e0": {
      "attribute_constraints": [],
      "object": "n1",
      "qualifier_constraints": [],
      "subject": "n0"
    }
  },
  "nodes": {
    "n0": {
      "categories": [
        "biolink:SmallMolecule"
      ],
      "constraints": [],
      "ids": [
        "HGNC:1033"
      ],
      "is_set": false
    },
    "n1": {
      "categories": [
        "biolink:Gene",
        "biolink:Protein"
      ],
      "constraints": [],
      "is_set": false
    }
  }
}
cbizon commented 1 year ago

This looks to me like a normalization issue - the CHEMBL identifier doesn't merge with the UniProt Ids. The CHEMBL one is considered a chemical entity while the UniProt is a Protein. I guess it makes sense to merge these? Do we have some other examples of CHEMBL identifiers for proteins? @gaurav what do you think?

sierra-moxon commented 1 year ago

from TAQA: we'll push to the NodeNorm folks for more investigation, probably not a data/UI question - SPOKE has plenty of interactions :)

Genomewide commented 1 year ago

Repeated process. Looks like BDNF gets returned as a small molecule called "ABRINEURIN" when using the node selection tool in ARAX UI. Manually changing the query to the gene curie HGNC:1033 gave results.

https://arax.ncats.io/?r=351c7c8b-9bbf-4818-bff5-22e6e2b09529

image

image

{
  "edges": {
    "e0": {
      "subject": "n0",
      "object": "n1"
    }
  },
  "nodes": {
    "n0": {
      "ids": [
        "CHEMBL.COMPOUND:CHEMBL2108230"
      ],
      "categories": [
        "biolink:SmallMolecule"
      ],
      "is_set": false,
      "name": "ABRINEURIN"
    },
    "n1": {
      "categories": [
        "biolink:Gene",
        "biolink:Protein"
      ],
      "is_set": false
    }
  }
}

changed to HGNC curie and remove the small molecule category results in image

But, even trying to get the synonyms of the HGNC curie return the small molecule

image

How would this impact queries?

sierra-moxon commented 1 year ago

from TAQA: ABRINEURIN and BDNF are two names for the same thing and give different answers depending on which you select. Lots of results tied to BDNF as a gene/protein w/HGNC id (NCBI, OMIM, UniProtKB, etc. but no ChEBML identifier). Any non-ChEMBL id will get back all these good results.

ABRINEURIN is not a SmallMolecule? That isn't a SmallMolecule.
Is there a modeling issue here? maybe ARAX normalizer? Maybe CHEMBL brings in gene identifiers? -- should CHEMBL be in the id_prefixes section of 'biolink:Gene'

andrewsu commented 1 year ago

@Genomewide in the last screenshot in your most recent post, you showed the ARAX UI mapping HGNC:1033 to CHEMBL.COMPOUND:CHEMBL2108230. Can you tell me how you got that? When I plug HGNC:1033 into the ARAXI synonym mapper, I get this:

image

I ask because I can't find a good database that maps between gene/protein IDs and chemical/drug IDs in cases where the therapeutic is a protein. If ARAX has found a source for that, I'm guessing the NodeNorm folks would be very interested in that mapping, and it could help solve this issue...

edeutsch commented 1 year ago

Note that our production maturity is still using an older database and old Node Synonymizer system, while everything else (testing, staging, dev) is using a newer database and newer Node Synonymizer. So you will get different answers at these two sites:

Old system (production): https://arax.transltr.io/?term=HGNC:1033

New system (staging): https://arax.ci.transltr.io/?term=HGNC:1033

sierra-moxon commented 1 year ago

@andrewsu, @Genomewide - What can we do here for BDNF having no gene or protein interactions? @gaurav - can node normalizer help in any way here?

andrewsu commented 1 year ago

Two thoughts from me:

Genomewide commented 1 year ago

I agree with Andrew about the likely group that is affected by this. If there was a way to identify some group of genes that needed this by looking at the list of biologics or some low hanging fruit, I would think it would be worth it . If there was a somewhat targeted way to get 50-80% of them linked up I would think it would be worth a little time.

I don't know how insulin does not suffer from the same problem. Is there a fix that was done before?

edeutsch commented 1 year ago

It's not easy anymore to be sure, but I think it probably was this record: https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL2108230/ that says that CHEMBL.COMPOUND:CHEMBL2108230 is called abrineurin and then UniProtKB https://www.uniprot.org/uniprotkb/P23560/entry also gave it an alternate name of abrineurin and that is equivalent to HGNC:1033

So based on ChEMBL and UniProtKB giving it alternate names of abrineurin The old NodeSynonymizer made lots of links that way. but it often overmerged. Name-based merging is often very good, but generated a lot of of tickets when it went bad

andrewsu commented 1 year ago

If name merging is the only way to bridge between the compound/drug identifiers and the gene/protein identifiers, and we've already established that name merging comes with some undesirable properties, then I suspect that there is not "low lying fruit" to be harvested here. And given that this type of drug-protein edge is not directly the subject of one of our MVP queries, my vote is to table this issue until later (post-fall).

I don't know how insulin does not suffer from the same problem.

I think insulin will have exactly the same problem. The ARAX synonymizer merges the compound/drug IDs and the gene/protein IDs on prod (https://arax.transltr.io/?r=2d241a3c-25f1-498a-9eab-ff618f65b68c), but not on CI (https://arax.ci.transltr.io/?r=2d241a3c-25f1-498a-9eab-ff618f65b68c)

cbizon commented 1 year ago

I agree with @andrewsu that this is a tough problem that we will probably not solve by Sept. I think (?) that the correct response might be to allow a conflation between chemical/drug and protein but it's going to take some work to implement that and test it out. FWIW, I am not a big fan of name merging for the reason that Eric mentions - I think that there is plenty of structured equivalence or other relationship information that we should try to take advantage of first.

andrewsu commented 1 year ago

since there are two votes in favor of tabling this until the fall (and no opposed), going to adjust the milestone now...

gaurav commented 1 year ago

NodeNorm seems to have split abrineurin into multiple cliques: https://nodenormalization-sri.renci.org/1.4/get_normalized_nodes?curie=HGNC%3A1033&curie=UniProtKB%3AP23560&curie=CHEMBL.COMPOUND%3ACHEMBL2108230&curie=MESH:C415772&curie=PR:000004716&curie=UNII:A1ED6W905I&curie=UNII:86ZE5V51WT&conflate=false&drug_chemical_conflate=false&description=false

Some of these might be genuine splits, but I think something is going wrong in protein conflation here. I'm tracking this at https://github.com/TranslatorSRI/NodeNormalization/issues/224. Is it correct to assume that this is at the same level of priority as other cliquing issues, or is there something particularly bad about this issue?

sstemann commented 3 weeks ago

@Genomewide can you please retest, i'm not clear what we're looking for

Genomewide commented 2 weeks ago

Synonyms from ARAX production don't work at all now. image

Appears to be connected on CI though and returns results. image