RTX-KG2 edge merging example

saramsey commented 2 years ago

At today's AHM, it was asked if there could be an example provided of an edge in RTX-KG2 that is the result of merging more than one source triple. Using the Neo4j endpoint for KG2.7.6pre, kg2endpoint3.rtx.ai, the following Cypher query produces an example:

match (n)-[r]->(m) where size(r.knowledge_source) > 1 return r limit 1;

Here is the example triple:

{
  "original_predicate": "UMLS:RB",
  "predicate": "biolink:related_to",
  "knowledge_source": [
    "infores:semmeddb",
    "infores:umls-metathesaurus"
  ],
  "negated": "False",
  "relation_label": "inverse_of_rn",
  "publications_info": "{'PMID:3995051': {'publication date': '1985 Jun 10', 'sentence': 'The calculated exchange fluxes have been compared with measurements of 15N label exchange between creatine and phosphocreatine and 14C label exchange between ATP and ADP.', 'subject score': 1000, 'object score': 1000}, 'PMID:6335584': {'publication date': '1984 Oct', 'sentence': 'In comparison with sartorius muscles of untreated frogs, they contained more free creatine and less phosphocreatine, but the same content in total creatine and ATP.', 'subject score': 802, 'object score': 861}, 'PMID:7562608': {'publication date': '1995 Jun 15', 'sentence': 'Creatine (6 mM) with or without 2 mM Na2ATP was less effective than phosphocreatine in maintaining Icat.', 'subject score': 1000, 'object score': 1000}, 'PMID:10668041': {'publication date': '1999 Dec', 'sentence': 'A comparison of pertinent metabolite concentrations revealed a magnetization transfer attenuation factor of the methyl and methylene resonances of creatine and phosphocreatine of 0.87 +/- 0.05 (p < 0.01).', 'subject score': 1000, 'object score': 1000}, 'PMID:12765850': {'publication date': '2003', 'sentence': 'In comparison with control animals receiving magnetic stimulation over the lumbar spine, quantitative evaluations of cerebral metabolite concentrations by proton MRS revealed no significant alterations of N-acetyl-aspartate, creatine and phosphocreatine, choline-containing compounds, myo-inositol, glucose and lactate after chronic rTMS.', 'subject score': 1000, 'object score': 1000}, 'PMID:12450068': {'publication date': '2002 Oct', 'sentence': 'Muscle total creatine and phosphocreatine were greater in the extensor digitorum longus in the CD and CD-PRED groups as compared with the CON and PRED groups (P < 0.05); however, total creatine and phosphocreatine in the soleus were not different.', 'subject score': 888, 'object score': 1000}, 'PMID:15930147': {'publication date': '2005 Oct', 'sentence': 'Comparison of the distribution patterns of the CRT in vascular and avascular vertebrate retinas and studies of the mouse retina during development indicate that creatine and phosphocreatine are important for ATP homeostasis.', 'subject score': 1000, 'object score': 1000}, 'PMID:22431193': {'publication date': '2012 Nov', 'sentence': 'At physiological temperature and pH, the exchange rate of amine protons in Cr was found to be 7-8 times higher than PCr and ATP.', 'subject score': 1000, 'object score': 1000}, 'PMID:22567176': {'publication date': '2012', 'sentence': 'Glutamate concentrations in the occipital cortex were found to be lower in the patients compared to controls and the concentrations of creatine and phosphocreatine were significantly lower in the parietal cortex of the patients.', 'subject score': 1000, 'object score': 1000}, 'PMID:22055567': {'publication date': '1986', 'sentence': 'This was compared with the lactate, creatine and phosphocreatine development as measured by proton NMR.', 'subject score': 1000, 'object score': 888}, 'PMID:28961344': {'publication date': '2017 Dec', 'sentence': 'Comparison between the WT and GAMT-/- mice provided strong evidence for three types of contribution to the peak in the Z-spectrum at 1.95 ppm, namely proteins, Cr and PCr, the latter fitted as tCr.', 'subject score': 1000, 'object score': 1000}, 'PMID:23412909': {'publication date': '2014 Jan', 'sentence': 'The CEST effect from Cr results were compared with (31) P magnetic resonance spectroscopy results showing good agreement in the Cr and phosphocreatine recovery kinetics.', 'subject score': 1000, 'object score': 851}, 'PMID:17097074': {'publication date': '2007 Jun 01', 'sentence': 'Bonferroni-adjusted comparisons revealed that ACC levels of N-acetyl aspartate (NAA)-creatine and phosphocreatine (Cr) were lower and that levels of choline (Cho)-NAA were higher in the methamphetamine abusers compared with the controls, at the adjusted p value of .0125.', 'subject score': 916, 'object score': 1000}, 'PMID:7585833': {'publication date': '1995 Sep', 'sentence': 'While no change was seen in the placebo group compared to baseline, creatine supplementation increased skeletal muscle total creatine and creatine phosphate by 17 +/- 4% (P < 0.05) and 12 +/- 4% (P < 0.05), respectively.', 'subject score': 861, 'object score': 1000}}",
  "subject": "UMLS:C0010286",
  "predicate_label": "related_to",
  "id": "UMLS:C0010286---UMLS:RB---UMLS:C0031634---umls_source:MTH",
  "update_date": "2020",
  "publications": [
    "PMID:10668041",
    "PMID:12450068",
    "PMID:12765850",
    "PMID:15930147",
    "PMID:17097074",
    "PMID:22055567",
    "PMID:22431193",
    "PMID:22567176",
    "PMID:23412909",
    "PMID:28961344",
    "PMID:3995051",
    "PMID:6335584",
    "PMID:7562608",
    "PMID:7585833",
    "PMID:7562608",
    "PMID:22431193"
  ],
  "object": "UMLS:C0031634"
}

This question came up in a discussion of agenda item "New ask from Architecture" which is summarized in this PR in the NCATSTranslator/TranslatorArchitecture project.

saramsey commented 2 years ago

Tagging @edeutsch

edeutsch commented 2 years ago

Great, thanks! Here is also an example in TRAPI: https://arax.ncats.io/?r=41651

One question we pondered: in the TRAPI attributes, is it better to represent multiple knowledge sources in a single attribute with a list as the value, or as multiple attributes?

Currently it is the latter (see link above). But maybe the former is better? Thoughts?

saramsey commented 2 years ago

@edeutsch does the TRAPI spec give an indication which way we should go, in this case? (list type attribute value or multiple attributes?)

edeutsch commented 2 years ago

It does not. But it could. Any reason not to recommend that they be combined (unlike we're currently doing)?

amykglen commented 2 years ago

the only reason I'm aware of is if the attribute_type_id differs for different knowledge_sources. since it's kind of difficult to decide whether each source is an aggregator vs. original source or whatever, I think we just decided to call all of them biolink:knowledge_source for KG2 for now, and wait to see if it became important to get more fine-grained. so far I don't think we've heard any complaints?

Screen Shot 2022-05-18 at 1 20 09 PM

edeutsch commented 2 years ago

Use of biolink:knowledge_source is discouraged: "In practice, implementers should use one of the more specific subtypes of this generic property."

There were no complaints because no one is really looking carefully I suspect.

Probably biolink:primary_knowledge_source is what we should be using.

Here are the docs: https://biolink.github.io/biolink-model/docs/knowledge_source.html

biolink:knowledge_source An Information Resource from which the knowledge expressed in an Association was retrieved, directly or indirectly. This can be any resource through which the knowledge passed on its way to its currently serialized form. In practice, implementers should use one of the more specific subtypes of this generic property.

biolink:aggregator_knowledge_source An intermediate aggregator resource from which knowledge expressed in an Association was retrieved downstream of the original source, on its path to its current serialized form.

biolink:primary_knowledge_source The most upstream source of the knowledge expressed in an Association that an implementer can identify (may or may not be the ‘original’ source).

biolink:original_knowledge_source The Information Resource that created the original record of the knowledge expressed in an Association (e.g. via curation of the knowledge from the literature, or generation of the knowledge de novo through computation, reasoning, inference over data).

edeutsch commented 2 years ago

ongoing discussion of this in the Architecture call..

edeutsch commented 2 years ago

The emerging consensus on the Architecture call is that RTX-KG2 should NOT be doing this semantic merging. That this scenario should be be represented as 3 different edges, each with a SINGLE biolink:primary_knowledge_source.

See architecture PR 73 to make your thoughts known (update there not yet made while I'm writing this, but is planned)

RTXteam / RTX

RTX-KG2 edge merging example #1840