Closed d2fong closed 5 years ago
Passage: ‘Canonical, SMAD-dependent TGFβ signalling pathway.’ “The transforming growth factor beta (TGFB) superfamily of cytokines are involved in a multitude of activities. The TGFB1 ligand can activate the cell-surface receptor ALK5. In the cytoplasm, activated ALK5 phosphorylates the SMAD2 protein. Phosphorylated SMAD2 is able to bind to the common mediator SMAD4, which regulates transcription.”
Seems to give a larger response. I have truncated it for clarity. For reference, the REACH response is 26kb, the INDRA response is 1300b for a 4 line abstract.
{
"events": {
"frames": [
{
"frame-id": "evem-api149750-UAZ-r1-Reach-1-13",
"text": "TGFB1 ligand can activate the cell-surface receptor ALK5",
"arguments": [
{
"text": "ALK5",
"argument-type": "entity",
"type": "controlled",
"object-type": "argument",
"index": 0,
"arg": "ment-api149750-UAZ-r1-Reach-1-16"
},
{
"text": "TGFB1",
"argument-type": "entity",
"type": "controller",
"object-type": "argument",
"index": 0,
"arg": "ment-api149750-UAZ-r1-Reach-1-15"
}
],
"type": "activation",
"frame-type": "event-mention",
"subtype": "positive-activation",
"is-direct": false,
"end-pos": {
"reference": "pass-api149750-UAZ-r1-Reach",
"offset": 171,
"object-type": "relative-pos"
},
"trigger": "activate",
"object-type": "frame",
"start-pos": {
"reference": "pass-api149750-UAZ-r1-Reach",
"offset": 115,
"object-type": "relative-pos"
},
"sentence": "sent-api149750-UAZ-r1-Reach-1",
"found-by": "Positive_activation_syntax_1_verb",
"verbose-text": "The TGFB1 ligand can activate the cell-surface receptor ALK5."
}
}
},
"entities": {
"frames": [
{
"text": "TGFB",
"frame-id": "ment-api149750-UAZ-r1-Reach-0-14",
"type": "family",
"frame-type": "entity-mention",
"end-pos": {
"reference": "pass-api149750-UAZ-r1-Reach",
"offset": 41,
"object-type": "relative-pos"
},
"xrefs": [
{
"namespace": "interpro",
"species": "human",
"object-type": "db-reference",
"id": "IPR015615"
}
],
"object-type": "frame",
"start-pos": {
"reference": "pass-api149750-UAZ-r1-Reach",
"offset": 37,
"object-type": "relative-pos"
},
"sentence": "sent-api149750-UAZ-r1-Reach-0"
},
{
"text": "transforming growth factor beta",
"frame-id": "ment-api149750-UAZ-r1-Reach-0-13",
"type": "protein",
"frame-type": "entity-mention",
"end-pos": {
"reference": "pass-api149750-UAZ-r1-Reach",
"offset": 35,
"object-type": "relative-pos"
},
"xrefs": [
{
"namespace": "uaz",
"object-type": "db-reference",
"id": "UAZ00196"
}
],
"object-type": "frame",
"start-pos": {
"reference": "pass-api149750-UAZ-r1-Reach",
"offset": 4,
"object-type": "relative-pos"
},
"sentence": "sent-api149750-UAZ-r1-Reach-0"
},
},
},
"sentences": {
"frames": [
{
"text": "The transforming growth factor beta (TGFB) superfamily of cytokines are involved in a multitude of activities. The TGFB1 ligand can activate the cell-surface receptor ALK5. In the cytoplasm, activated ALK5 phosphorylates the SMAD2 protein. Phosphorylated SMAD2 is able to bind to the common mediator SMAD4, which regulates transcription",
"frame-id": "pass-api149750-UAZ-r1-Reach",
"section-id": "NoSection",
"frame-type": "passage",
"is-title": false,
"section-name": "NoSection",
"object-type": "frame",
"index": "Reach",
"object-meta": {
"component": "nxml2fries",
"object-type": "meta-info"
}
},
{
"text": "The transforming growth factor beta ( TGFB ) superfamily of cytokines are involved in a multitude of activities .",
"frame-id": "sent-api149750-UAZ-r1-Reach-0",
"passage": "pass-api149750-UAZ-r1-Reach",
"frame-type": "sentence",
"end-pos": {
"reference": "pass-api149750-UAZ-r1-Reach",
"offset": 110,
"object-type": "relative-pos"
},
"object-type": "frame",
"start-pos": {
"reference": "pass-api149750-UAZ-r1-Reach",
"offset": 0,
"object-type": "relative-pos"
},
"object-meta": {
"component": "BioNLPProcessor",
"object-type": "meta-info"
}
},
{
"text": "The TGFB1 ligand can activate the cell-surface receptor ALK5 .",
"frame-id": "sent-api149750-UAZ-r1-Reach-1",
"passage": "pass-api149750-UAZ-r1-Reach",
"frame-type": "sentence",
"end-pos": {
"reference": "pass-api149750-UAZ-r1-Reach",
"offset": 172,
"object-type": "relative-pos"
},
"object-type": "frame",
"start-pos": {
"reference": "pass-api149750-UAZ-r1-Reach",
"offset": 111,
"object-type": "relative-pos"
},
"object-meta": {
"component": "BioNLPProcessor",
"object-type": "meta-info"
}
},
{
"text": "In the cytoplasm , activated ALK5 phosphorylates the SMAD2 protein .",
"frame-id": "sent-api149750-UAZ-r1-Reach-2",
"passage": "pass-api149750-UAZ-r1-Reach",
"frame-type": "sentence",
"end-pos": {
"reference": "pass-api149750-UAZ-r1-Reach",
"offset": 239,
"object-type": "relative-pos"
},
"object-type": "frame",
"start-pos": {
"reference": "pass-api149750-UAZ-r1-Reach",
"offset": 173,
"object-type": "relative-pos"
},
"object-meta": {
"component": "BioNLPProcessor",
"object-type": "meta-info"
}
},
{
"text": "Phosphorylated SMAD2 is able to bind to the common mediator SMAD4 , which regulates transcription",
"frame-id": "sent-api149750-UAZ-r1-Reach-3",
"passage": "pass-api149750-UAZ-r1-Reach",
"frame-type": "sentence",
"end-pos": {
"reference": "pass-api149750-UAZ-r1-Reach",
"offset": 336,
"object-type": "relative-pos"
},
"object-type": "frame",
"start-pos": {
"reference": "pass-api149750-UAZ-r1-Reach",
"offset": 240,
"object-type": "relative-pos"
},
"object-meta": {
"component": "BioNLPProcessor",
"object-type": "meta-info"
}
}
],
"object-type": "frame-collection",
"object-meta": {
"processing-end": "2018-02-22T15:19:15Z",
"doc-id": "api149750",
"component-type": "machine",
"component": "Reach",
"processing-start": "2018-02-22T15:19:14Z",
"object-type": "meta-info",
"organization": "UAZ"
}
}
}
{
"statements": [
{
"type": "Phosphorylation",
"enz": {
"name": "TGFBR1",
"db_refs": {
"TEXT": "ALK5",
"UP": "P36897",
"HGNC": "11772"
},
"sbo": "http://identifiers.org/sbo/SBO:0000460"
},
"sub": {
"name": "SMAD2",
"db_refs": {
"TEXT": "SMAD2",
"UP": "Q15796",
"HGNC": "6768"
},
"sbo": "http://identifiers.org/sbo/SBO:0000015"
},
"evidence": [
{
"source_api": "reach",
"pmid": "api149751",
"text": "In the cytoplasm, activated ALK5 phosphorylates the SMAD2 protein.",
"annotations": {
"found_by": "Phosphorylation_syntax_1a_verb"
},
"epistemics": {
"section_type": null,
"direct": true
}
}
],
"id": "5d0c4427-a78f-4e66-a2a2-43dbe9ea3e09",
"sbo": "http://identifiers.org/sbo/SBO:0000216"
},
{
"type": "Activation",
"subj": {
"name": "TGFB1",
"db_refs": {
"TEXT": "TGFB1",
"UP": "P01137",
"HGNC": "11766"
},
"sbo": "http://identifiers.org/sbo/SBO:0000459"
},
"obj": {
"name": "TGFBR1",
"db_refs": {
"TEXT": "ALK5",
"UP": "P36897",
"HGNC": "11772"
},
"sbo": "http://identifiers.org/sbo/SBO:0000643"
},
"obj_activity": "activity",
"evidence": [
{
"source_api": "reach",
"pmid": "api149751",
"text": "The TGFB1 ligand can activate the cell-surface receptor ALK5.",
"annotations": {
"found_by": "Positive_activation_syntax_1_verb"
},
"epistemics": {
"section_type": null,
"direct": false
}
}
],
"id": "6bae5845-c459-4bd4-b928-3bc2eeb73230",
"sbo": "http://identifiers.org/sbo/SBO:0000182"
}
]
}
Note: I am not familiar with the REACH response format, but it seems to me that REACH recognizes more statements than INDRA
"sub": {
"name": "SMAD2",
"db_refs": {
"TEXT": "SMAD2",
"UP": "Q15796",
"HGNC": "6768"
},
"sbo": "http://identifiers.org/sbo/SBO:0000015"
}
2 statements:
It would be nice to know:
Edit: Found https://github.com/sorgerlab/indra/blob/master/indra/statements.py which describes the INDRA statement logic. Could potentially use this to determine what statements have what contents, etc.
Edit:
Found the comments on the read the docs: http://indra.readthedocs.io/en/latest/modules/statements.html?highlight=statement
Passage 2. FASL/CD95 signaling The Fas family of cell surface receptors initiate the apoptotic pathway through interaction with the external ligand, FasL. The cytoplasmic domain of Fas interacts with a number of molecules in the transduction of the external signal to the cytoplasmic side of the cell membrane. The most notable cytoplasmic domain is the Death Domain (DD) that is involved in recruiting the FAS-associating death domain-containing protein (FADD).
REACH returns a 25kb response. INDRA returns an empty statements array.
Passage 1. p53-Dependent G1 DNA Damage Response
“Under normal conditions, p53 is a short-lived protein. The MDM2 protein, usually interacts with p53, and by virtue of its E3 ubiquitin ligase activity, mediates its degradation by the ubiquitin-proteasome machinery. Upon detection of DNA damage, the ATM kinase mediates the phosphorylation of the Mdm2 protein to block its interaction with p53. The p53 protein activates the transcription of cyclin-dependent kinase inhibitor, p21. p21 inactivates the CyclinE:Cdk2 complexes, which prevents entry into S-phase of the cell cycle.”
REACH and INDRA both return something. 25kb and 2.7kb responses respectively.
Thanks for the extensive testing @d2fong , this is excellent. It would be also interesting to know if what REACH returns in https://github.com/PathwayCommons/factoid/issues/188#issuecomment-367794192 is actually garbage, which would justify indra's empty response. But https://github.com/PathwayCommons/factoid/issues/188#issuecomment-367720381 is interesting, as in theory indra is just a broker to Reach, with some added functionalities.
@bgyori @johnbachman is a behavior to be attributed to the confidence measure (and some threshold) you mentioned past thursday?
Hi @d2fong, thanks for the analysis! Just to clarify INDRA is a model assembly system with input interfaces to multiple natural language processing systems, including REACH. INDRA doesn't read the text itself, it is "at the mercy of" the reading systems' output. When you read text through the reach/process_text interface of the INDRA web service, it gets sent to REACH for reading, and the events in the resulting JSON response are processed to extract INDRA Statements.
Looking at the size of the responses (i.e. 25 kb vs 2.7 kb) is not relevant - the difference is due to the large amount of meta-information that REACH returns, the largest portion of which is the "sentences" section, which simply contains meta-data about the input text that was being read. The relevant section is the events section of the REACH response. This is where extracted mechanisms are represented. INDRA extracts Statements from this event section.
In the comment https://github.com/PathwayCommons/factoid/issues/188#issuecomment-367720381, you list "sentences recognized by REACH" vs "sentences recognized by INDRA". This is somewhat misleading since REACH will list every sentence that it reads in the "sentence" section of its extractions; this doesn't mean that anything was extracted from that sentence. So simply because REACH read a sentence, doesn't mean that the sentence contained any interesting events or that if it did, REACH was able to extract it. Again, the relevant place to look for what was extracted is the "events" section of the REACH response. Let's look at each sentence specifically:
In the comment https://github.com/PathwayCommons/factoid/issues/188#issuecomment-367794192, the paragraph "The Fas family of cell surface receptors initiate the apoptotic pathway through interaction with the external ligand, FasL. The cytoplasmic domain of Fas interacts with a number of molecules in the transduction of the external signal to the cytoplasmic side of the cell membrane. The most notable cytoplasmic domain is the Death Domain (DD) that is involved in recruiting the FAS-associating death domain-containing protein (FADD)." yields 0 events from REACH! So there is nothing extracted by the reading at all, hence no Statements from INDRA. Having looked at literally thousands of these over the last few years, I am not surprised that nothing was extracted from this paragraph since it doesn't contain the type of declarative descriptions of mechanisms that a machine would understand.
In the comment https://github.com/PathwayCommons/factoid/issues/188#issuecomment-367795250, from the paragraph, we get 4 INDRA Statements:
This result looks perfectly good to me, everything usable from the events extracted by REACH are picked up by INDRA. Looking at the REACH output, the things that are missing are lost at the level of reading.
Hope this is helpful!
Thanks @bgyori for the clarification on how INDRA generates statements from REACH. Yes I also agree that size is not necessarily relevant, I was just illustrating how much more verbose REACH is compared to INDRA and potential implications of that from the perspective of a user.
So can you confirm that INDRA only looks at the events
section of the REACH output?
@sacdallago The current REACH-Factoid Document converter looks at events
, entities
, and sentences
to generate the model. There is also complexity that I don't fully understand yet in the REACH-Factoid document converter, but my initial hypothesis is that the model generated from INDRA might be a bit more barren than the model generated from REACH.
INDRA processes events
extracted by REACH and instantiates INDRA Statements from these. But events
refer to entities
as arguments, and so INDRA does look these up in the entities
section of the REACH output to gather information on database identifiers. Finally, to find the evidence sentence from which the given event was extracted, INDRA looks up the associated text in the sentences
section of the REACH output and includes it in the Evidence of the INDRA Statement.
Okay, maybe INDRA's output will be more similar to the current REACH-Factoid document converter than I initially thought.
I generated a Factoid Document from input 2
and it gave an empty model too.
Hi @bgyori - thanks for the insight. Is the REACH instance INDRA is wrapping local or somewhere else?
There are three usage modes implemented in INDRA by which REACH can be used for reading (and a similar setup exists for other reading systems that we interface with):
from indra.sources import reach
reach.process_text('MEK binds ERK.')
reach.process_pmc('PMC1234...')
from indra.sources import reach
reach.process_text('MEK binds ERK.', offline=True)
reach.process_nxml_file('PMC1234... .nxml', offline=True)
Having said this, the web service of INDRA you guys are currently testing is set up to just call the REACH web service to read. As we discussed with @sacdallago, we could change this setting to use a local instance of REACH (i..e. option 2 above).
Thanks, you answered my question! (congrats on the paper btw).
@bgyori
I have another question about INDRA.
Here is a snippet from the entities
section of a REACH response.
"entities": {
"frames": [
{
"text": "p53",
"frame-id": "ment-api150222-UAZ-r1-Reach-0-375",
"type": "protein",
"frame-type": "entity-mention",
"end-pos": {
"reference": "pass-api150222-UAZ-r1-Reach",
"offset": 28,
"object-type": "relative-pos"
},
"xrefs": [
{
"namespace": "uniprot",
"species": "homo sapiens",
"object-type": "db-reference",
"id": "P04637"
}
],
If you look at the xrefs
key there is some information that the current Factoid model uses to "ground" entities. Does INDRA not provide this information?
The grounding information REACH gives is standardized and extended by INDRA, and is represented in each Agent's db_refs
attribute. For instance, for the above example with "p53" this is the corresponding argument representing the entity
{
"name": "TP53",
"db_refs": {
"UP": "P04637",
"HGNC": "11998",
"TEXT": "p53"
},
}
What you see here is that the name
has been standardized by INDRA to the official gene symbol (when possible this is done, when not it isn't) to TP53. The original raw text string as it appeared in text is in db_refs['TEXT']
, the UniProt ID is in db_refs['UP']
, and the HGNC ID that was added by INDRA is in db_refs['HGNC']
. Grounding for chemicals, etc. is similar but they would be referring to IDs in other databases like CHEBI or PUBCHEM. One special name space we developed, which has been integrated into REACH, is BE
which is a new name space and ontology for protein families and complexes. You can read more about it here: https://github.com/sorgerlab/bioentities
Okay. I am actually not that familiar with the concept of grounding but I do know that:
@jvwong @maxkfranz do any of you have comments or questions about how INDRA handles grounding and what that means for Factoid?
Good process, here. :-)
@sacdallago see me Monday re this?
Great work. Mihai told us we need to use since the web service version of REACH is very out of date and not production ready. So if this analysis was done with the reach web service, I think we need to redo it with the latest reach version. So how soon can we get INDRA working in our hands with the latest version of REACH via local JAR? Or can we ask the INDRA team for their perspective of how much better the latest version of reach is via indra vs. what we evaluated here?
As discussed today, let's keep this for a later point in time. @d2fong you can close if you feel like it's necessary
Closing for now. Can be reopened if we need to discuss this again in the future.
Reopening based on new learnings from the presentation today. We might eventually use it.
Closing for now.
@IgorRodchenkov
Based on dev call today, and https://github.com/PathwayCommons/factoid/issues/388#issuecomment-441108836
Based on dev call today, and #388 (comment)
Based on above I think we can close.
INDRA uses REACH for NLP. I'm not a fan of adding another layer of concept mapping/assumptions between Factoid and REACH. To boot we'd need to write a mapper between INDRA and Factoid which I'm not sure provides more benefits than updating (#388) our own from raw REACH output. There is the notion that INDRA could be a universal model for different NLP and sources. However, we're being pretty careful about our Factoid model types, so we'd probably need to rewrite any NLP-To-INDRA mapper if we wanted to easily switch between them. Close unless there's a reason here...
URL differences
Reach
Reach has you specify a text query param and it is only available using POST. IMO it is kind of weird that they require you to POST but it doesn't work if you send the text to process in the body of the POST request.
Example REACH url:
INDRA
Requires you to POST as well as have a
text
key in a JSON body. Seems like the more standard way to interact with a service that I would expect.