RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

In ARAX, for the KnowledgeGraph object, fill in the `url` field for `edge_attributes` for which we have a URL? #750

Closed saramsey closed 4 years ago

saramsey commented 4 years ago

Request posted by @isbluis in #743:

Hi @saramsey , For the edge_attributes that have a value that is an obvious URL (e.g. "http://w3id.org/biolink/vocab/subclassOf"), could you populate the "url" property of that element, either instead of, or in addition to its "value"? Not sure what would be more elegant/simple/correct. Thanks!

saramsey commented 4 years ago

My response to this request, reproduced here for convenience:

Hi @isbluis OK I can see this being very useful downstream. I'm just hesitating about where to code it up. Isn't the KnowledgeGraph object created by expand? It is just passed to resultify. If we were to have resultify start modifying the KnowledgeGraph, isn't that mixing two somewhat orthogonal operations? That would add a lot of code in resultify to be cycling through KG nodes and modifying them, when presumably expand is already visiting every KG node in the course of generating the KG? @amykglen and @edeutsch what do you think?

amykglen commented 4 years ago

Yes - I think this is something for me - all nodes/edges are loaded into the API model in expand. (So that's where the NodeAttributes/EdgeAttributes are created.)

And yep, I can easily put urls into EdgeAttribute.url!

isbluis commented 4 years ago

Great, thanks!

edeutsch commented 4 years ago

Agreed, expand() is the right place to handle this. (or possibly during the building of KG2 depending on how this information is currently encoded in KG2)

amykglen commented 4 years ago

So if an EdgeAttribute consists of a list of URLs, should that go in the url property as well?

amykglen commented 4 years ago

Ok - I've pushed changes for this to master - so now urls are put in EdgeAttribute.url/NodeAttribute.url rather than EdgeAttribute.value/NodeAttribute.value.

(If it's a list of urls, it's still going in value currently, but that could be changed. I think it's pretty infrequent that a list of URLs is stored in an EdgeAttribute/NodeAttribute..)

isbluis commented 4 years ago

Hi @amykglen . Do you have an example of an edge instance where its value (url, value,or otherwise) is a list instead of a scalar/string? (or ideally, a simple query that returns one?)

I currently do not have code to handle any of the elements within edge_attributesas anything other than single values. Or maybe @edeutsch is handling these via the API before passing the json object to the front-end? I would need to add code to handle lists if that is a requirement.

Thanks!

edeutsch commented 4 years ago

At the moment the spec says that edge_attribute elements should all be scalars. Technically the value should be a string. but we have reason to lobby to allow that to be float or int. As far as a list, I am hesitant. Is there a good reason to need to have a list there? That is not in the spec and not even a discussion item for addition.

amykglen commented 4 years ago

Ah, I didn't realize that about EdgeAttributes/NodeAttributes requiring scalars.

I've been dumping extra KG2 node/edge properties that do not map to existing API object model properties into EdgeAttributes/NodeAttributes (per #630), but perhaps that will need to change if lists aren't allowed.

I think it's actually only NodeAttributes that lists are currently being put into... (specifically the node publications and synonym fields from KG2 neo4j). If you run this query:

add_qnode(id=n00, curie=CHEMBL.COMPOUND:CHEMBL1771)
expand(kp=ARAX/KG2, synonym_handling=add_all)
resultify()

https://arax.rtx.ai/?m=2078

Here are some examples of NodeAttributes with lists in the resulting knowledge graph:

For node CHEMBL.COMPOUND:CHEMBL1771:

{
   "name":"publications",
   "type":"None",
   "url":"None",
   "value":[
      "PMID:26807542",
      "PMID:24164581",
      "PMID:24900388",
      "PMID:25075638",
      "PMID:24393581",
      "PMID:23899349",
      "PMID:22197392",
      "PMID:16248836",
      "PMID:24859764",
      "PMID:24513044",
      "PMID:24387347",
      "PMID:19932971",
      "PMID:20014752",
      "PMID:22055718",
      "PMID:22409598",
      "PMID:26948801",
      "PMID:27996269",
      "PMID:24931384",
      "PMID:27908762",
      "PMID:15646539",
      "PMID:22428882",
      "PMID:28653847",
      "PMID:22194678",
      "PMID:21467212",
      "PMID:28219047",
      "PMID:22232427"
   ]
}
{
   "name":"synonym",
   "type":"None",
   "url":"None",
   "value":[
      "clopidogrel",
      "PUBCHEM_BIOASSAY:49666423",
      "SID49666423",
      "GKTWGGQPFAXNFI-HNNXBMFYSA-N",
      "None",
      "InChI=1S/C16H16ClNO2S/c1-20-16(19)15(12-4-2-3-5-13(12)17)18-8-6-14-11(10-18)7-9-21-14/h2-5,7,9,15H,6,8,10H2,1H3/t15-/m0/s1",
      "ASTRAZENECA:1793",
      "COC(=O)[C@@H](N1CCc2sccc2C1)c3ccccc3Cl",
      "BNF:16755",
      "CLOPIDOGREL",
      "PUBCHEM_BIOASSAY:49666065",
      "ATC:16755",
      "DRUGMATRIX:228",
      "SID49666065",
      "Clopidogrel"
   ]
}
amykglen commented 4 years ago

Also - I think my question about what to do with a list of URLs is actually irrelevant - occasionally the node publications list contains URLs, but not always. So to my knowledge we actually don't have a need to store a list of (only) URLs in a NodeAttribute/EdgeAttribute.

edeutsch commented 4 years ago

Well, the spec is here: https://github.com/NCATS-Tangerine/NCATS-ReasonerStdAPI/blob/6a8a86b3b5cba67af9c750f64e738a52bb8a86e8/API/TranslatorReasonersAPI.yaml#L787

But we can lobby for change. I am a fan of providing whatever information you can. I would be willing to lobby for changing this to allow more types.

But Luis may have other ideas.

But I would say that it would be best to change this: "type":"None", to this: "type":"list", in your above examples. The 'type' field is there specifically to tell the reader what type is in the value.

edeutsch commented 4 years ago

oops, actually 'type' should probably be an ontology concept of "list of publications" for example. The 'type' should probably be the semantic type instead of the data type. So my above comment isn't really right. Although 'list' might still be better than 'None'.

amykglen commented 4 years ago

Ah, got it re: setting the type...

Rather than storing these extra KG2 properties in NodeAttributes/EdgeAttributes, is it worth considering adding these as further properties in our API object model? Not to the translator-wide standard, but just to our output? I thought I recall Patrick Wang saying during the hackathon that that's perfectly legal for the API standard (you're allowed to have extra properties).

edeutsch commented 4 years ago

It is true that it is perfectly legal. But I question whether it is wise. Since we expected that we would want to provide additional information on nodes and edges, we created an explicit mechanism to do so, the node_attribute and edge_attribute. It would seem less desirable to circumvent this mechanism and just add our own custom attributes to node and edge. Also, by using node_attribute and edge_attribute, via the 'type' mechanism, these attributes could in theory be machine interpretable. If the 'type' defined the true semantic type, then smart software could potentially do something useful with it automatically. If it's just an additional attribute, then it can be displayed and humans might recognize it, but it would be difficult for software to automatically recognize it. But that's all just my opinion and not necessarily widely shared.

So after, further reflection, I think this would be best:

{ "name":"publications", "type":"data:1187", "url":"https://pubmed.ncbi.nlm.nih.gov/$value", "value":[ "PMID:26807542", "PMID:24164581", "PMID:24900388", "PMID:25075638", "PMID:24393581", "PMID:23899349", "PMID:22197392", "PMID:16248836", "PMID:24859764", "PMID:24513044", "PMID:24387347", "PMID:19932971", "PMID:20014752", "PMID:22055718", "PMID:22409598", "PMID:26948801", "PMID:27996269", "PMID:24931384", "PMID:27908762", "PMID:15646539", "PMID:22428882", "PMID:28653847", "PMID:22194678", "PMID:21467212", "PMID:28219047", "PMID:22232427" ] }

We call it "publications" but the true semantic type is provided. I suppose the fact that it's a list can be obtained through introspection and does not need to be explicit.

Clearly I've changed my mind 3 times in 3 posts, so maybe it's all wrong. Or maybe one of them is right!

isbluis commented 4 years ago

I am also a fan of providing all relevant info; I just need to know what to expect so I can handle it.

I already handle lists in provided_by and publications; glad to add more!

And at least from my end, checking the data type directly from the json is equivalent to looking at the value of the type attribute; so I don't really need it. On the other hand, I would need to know what the extra properties are so I can deal with them.

edeutsch commented 4 years ago

or yet one more thought: instead of the silly $value thing, if value is a list, then url should be a list with all the urls of the individual items in the list:

"url":[ "https://pubmed.ncbi.nlm.nih.gov/26807542", "https://pubmed.ncbi.nlm.nih.gov/24164581" ], "value":[ "PMID:26807542", "PMID:24164581" ]

amykglen commented 4 years ago

Sounds good! I'm fine with continuing to use NodeAttributes/EdgeAttributes.

I like where you're going with the publications attribute, @edeutsch, although one problem I foresee is that it's not only PMIDs that are put into this list - here are some other examples:

{
   "type":"None",
   "name":"publications",
   "value":[
      "",
      "PMID:9862332",
      "PMID:11513144",
      "PMID:15057824",
      "DOI:10.1111/j.1399-0039.2007.00930.x",
      "DOI:10.1111/j.1399-0039.2006.00567.x",
      "PMID:11960306",
      "DOI:10.1002/(SICI)1521-4141(199812)28:12<3959::AID-IMMU3959>3.0.CO",
      "PMID:17900289",
      "PMID:16634863",
      "DOI:10.1038/nature02399",
      "DOI:10.1034/j.1600-065X.2001.1810119.x",
      "DOI:10.1038/sj.gene.6363836",
      "2-2"
   ],
   "url":"None"
}
{
   "type":"None",
   "name":"publications",
   "value":[
      "ISBN:0198506732"
   ],
   "url":"None"
}
{
   "type":"None",
   "name":"publications",
   "value":[
      "https://rarediseases.info.nih.gov/diseases/5878/babesiosis"
   ],
   "url":"None"
}

(I'm not sure what all the different possible things that might appear in this property in KG2 look like... these are just some of the first I came across)

edeutsch commented 4 years ago

hmm, that make it more complicated. Perhaps the better thing to do is instead of making the node_attribute value a list with N items, instead there should be N node_attributes, each one with the appropriate type. EDAM has a type for DOI which is data:1188 http://bioportal.bioontology.org/ontologies/EDAM/?p=classes&conceptid=http%3A%2F%2Fedamontology.org%2Fdata_1188&jump_to_nav=true etc.

ISBN is data:2634 http://bioportal.bioontology.org/ontologies/EDAM/?p=classes&conceptid=http%3A%2F%2Fedamontology.org%2Fdata_2634&jump_to_nav=true

If it's not a PMID or a DOI or a ISBN, maybe just a generic article: http://bioportal.bioontology.org/ontologies/EDAM/?p=classes&conceptid=http%3A%2F%2Fedamontology.org%2Fdata_0971&jump_to_nav=true

It's not clear if such care will be rewarded, but that's probably the right way to do it.

amykglen commented 4 years ago

Ok cool - I just separated this out into a new issue: #756. Will pick up there!

amykglen commented 4 years ago

And also created #757 to deal with the missing types.

Ok, so now that I separated out that stuff, I think this issue is complete - all (scalar) Node/EdgeAttributes that are urls are now put in the url field rather than the value field. Just needs to be rolled out/confirmed working.

amykglen commented 4 years ago

Tested on production - appears to be functioning as desired. This is what these fields are now looking like in 'results' in the UI, fyi:

Screen Shot 2020-05-20 at 4 50 15 PM

edeutsch commented 4 years ago

I suggest actually displaying the URL itself (hyperlinked) because a human can look at the URLs and understand what they signify.