PathwayCommons / factoid

A project to capture biological pathway data from academic papers
https://biofactoid.org
MIT License
28 stars 7 forks source link

Evaluate INDRA, compare it to REACH #188

Closed d2fong closed 5 years ago

d2fong commented 6 years ago

URL differences

Reach

Reach has you specify a text query param and it is only available using POST. IMO it is kind of weird that they require you to POST but it doesn't work if you send the text to process in the body of the POST request.

Example REACH url:

http://agathon.sista.arizona.edu:8080/odinweb/api/text?text=The%20transforming%20growth%20factor%20beta%20(TGFB)%20superfamily%20of%20cytokines%20are%20involved%20in%20a%20multitude%20of%20activities.%20The%20TGFB1%20ligand%20can%20activate%20the%20cell-surface%20receptor%20ALK5.%20In%20the%20cytoplasm,%20activated%20ALK5%20phosphorylates%20the%20SMAD2%20protein.%20Phosphorylated%20SMAD2%20is%20able%20to%20bind%20to%20the%20common%20mediator%20SMAD4,%20which%20regulates%20transcription

INDRA

Requires you to POST as well as have a text key in a JSON body. Seems like the more standard way to interact with a service that I would expect.

d2fong commented 6 years ago

Response differences

Input

Passage: ‘Canonical, SMAD-dependent TGFβ signalling pathway.’ “The transforming growth factor beta (TGFB) superfamily of cytokines are involved in a multitude of activities. The TGFB1 ligand can activate the cell-surface receptor ALK5. In the cytoplasm, activated ALK5 phosphorylates the SMAD2 protein. Phosphorylated SMAD2 is able to bind to the common mediator SMAD4, which regulates transcription.”

REACH

Seems to give a larger response. I have truncated it for clarity. For reference, the REACH response is 26kb, the INDRA response is 1300b for a 4 line abstract.

{
    "events": {
        "frames": [
            {
                "frame-id": "evem-api149750-UAZ-r1-Reach-1-13",
                "text": "TGFB1 ligand can activate the cell-surface receptor ALK5",
                "arguments": [
                    {
                        "text": "ALK5",
                        "argument-type": "entity",
                        "type": "controlled",
                        "object-type": "argument",
                        "index": 0,
                        "arg": "ment-api149750-UAZ-r1-Reach-1-16"
                    },
                    {
                        "text": "TGFB1",
                        "argument-type": "entity",
                        "type": "controller",
                        "object-type": "argument",
                        "index": 0,
                        "arg": "ment-api149750-UAZ-r1-Reach-1-15"
                    }
                ],
                "type": "activation",
                "frame-type": "event-mention",
                "subtype": "positive-activation",
                "is-direct": false,
                "end-pos": {
                    "reference": "pass-api149750-UAZ-r1-Reach",
                    "offset": 171,
                    "object-type": "relative-pos"
                },
                "trigger": "activate",
                "object-type": "frame",
                "start-pos": {
                    "reference": "pass-api149750-UAZ-r1-Reach",
                    "offset": 115,
                    "object-type": "relative-pos"
                },
                "sentence": "sent-api149750-UAZ-r1-Reach-1",
                "found-by": "Positive_activation_syntax_1_verb",
                "verbose-text": "The TGFB1 ligand can activate the cell-surface receptor ALK5."
            }
                 }
    },
    "entities": {
        "frames": [
            {
                "text": "TGFB",
                "frame-id": "ment-api149750-UAZ-r1-Reach-0-14",
                "type": "family",
                "frame-type": "entity-mention",
                "end-pos": {
                    "reference": "pass-api149750-UAZ-r1-Reach",
                    "offset": 41,
                    "object-type": "relative-pos"
                },
                "xrefs": [
                    {
                        "namespace": "interpro",
                        "species": "human",
                        "object-type": "db-reference",
                        "id": "IPR015615"
                    }
                ],
                "object-type": "frame",
                "start-pos": {
                    "reference": "pass-api149750-UAZ-r1-Reach",
                    "offset": 37,
                    "object-type": "relative-pos"
                },
                "sentence": "sent-api149750-UAZ-r1-Reach-0"
            },
            {
                "text": "transforming growth factor beta",
                "frame-id": "ment-api149750-UAZ-r1-Reach-0-13",
                "type": "protein",
                "frame-type": "entity-mention",
                "end-pos": {
                    "reference": "pass-api149750-UAZ-r1-Reach",
                    "offset": 35,
                    "object-type": "relative-pos"
                },
                "xrefs": [
                    {
                        "namespace": "uaz",
                        "object-type": "db-reference",
                        "id": "UAZ00196"
                    }
                ],
                "object-type": "frame",
                "start-pos": {
                    "reference": "pass-api149750-UAZ-r1-Reach",
                    "offset": 4,
                    "object-type": "relative-pos"
                },
                "sentence": "sent-api149750-UAZ-r1-Reach-0"
            },
                },
    },
    "sentences": {
        "frames": [
            {
                "text": "The transforming growth factor beta (TGFB) superfamily of cytokines are involved in a multitude of activities. The TGFB1 ligand can activate the cell-surface receptor ALK5. In the cytoplasm, activated ALK5 phosphorylates the SMAD2 protein. Phosphorylated SMAD2 is able to bind to the common mediator SMAD4, which regulates transcription",
                "frame-id": "pass-api149750-UAZ-r1-Reach",
                "section-id": "NoSection",
                "frame-type": "passage",
                "is-title": false,
                "section-name": "NoSection",
                "object-type": "frame",
                "index": "Reach",
                "object-meta": {
                    "component": "nxml2fries",
                    "object-type": "meta-info"
                }
            },
            {
                "text": "The transforming growth factor beta ( TGFB ) superfamily of cytokines are involved in a multitude of activities .",
                "frame-id": "sent-api149750-UAZ-r1-Reach-0",
                "passage": "pass-api149750-UAZ-r1-Reach",
                "frame-type": "sentence",
                "end-pos": {
                    "reference": "pass-api149750-UAZ-r1-Reach",
                    "offset": 110,
                    "object-type": "relative-pos"
                },
                "object-type": "frame",
                "start-pos": {
                    "reference": "pass-api149750-UAZ-r1-Reach",
                    "offset": 0,
                    "object-type": "relative-pos"
                },
                "object-meta": {
                    "component": "BioNLPProcessor",
                    "object-type": "meta-info"
                }
            },
            {
                "text": "The TGFB1 ligand can activate the cell-surface receptor ALK5 .",
                "frame-id": "sent-api149750-UAZ-r1-Reach-1",
                "passage": "pass-api149750-UAZ-r1-Reach",
                "frame-type": "sentence",
                "end-pos": {
                    "reference": "pass-api149750-UAZ-r1-Reach",
                    "offset": 172,
                    "object-type": "relative-pos"
                },
                "object-type": "frame",
                "start-pos": {
                    "reference": "pass-api149750-UAZ-r1-Reach",
                    "offset": 111,
                    "object-type": "relative-pos"
                },
                "object-meta": {
                    "component": "BioNLPProcessor",
                    "object-type": "meta-info"
                }
            },
            {
                "text": "In the cytoplasm , activated ALK5 phosphorylates the SMAD2 protein .",
                "frame-id": "sent-api149750-UAZ-r1-Reach-2",
                "passage": "pass-api149750-UAZ-r1-Reach",
                "frame-type": "sentence",
                "end-pos": {
                    "reference": "pass-api149750-UAZ-r1-Reach",
                    "offset": 239,
                    "object-type": "relative-pos"
                },
                "object-type": "frame",
                "start-pos": {
                    "reference": "pass-api149750-UAZ-r1-Reach",
                    "offset": 173,
                    "object-type": "relative-pos"
                },
                "object-meta": {
                    "component": "BioNLPProcessor",
                    "object-type": "meta-info"
                }
            },
            {
                "text": "Phosphorylated SMAD2 is able to bind to the common mediator SMAD4 , which regulates transcription",
                "frame-id": "sent-api149750-UAZ-r1-Reach-3",
                "passage": "pass-api149750-UAZ-r1-Reach",
                "frame-type": "sentence",
                "end-pos": {
                    "reference": "pass-api149750-UAZ-r1-Reach",
                    "offset": 336,
                    "object-type": "relative-pos"
                },
                "object-type": "frame",
                "start-pos": {
                    "reference": "pass-api149750-UAZ-r1-Reach",
                    "offset": 240,
                    "object-type": "relative-pos"
                },
                "object-meta": {
                    "component": "BioNLPProcessor",
                    "object-type": "meta-info"
                }
            }
        ],
        "object-type": "frame-collection",
        "object-meta": {
            "processing-end": "2018-02-22T15:19:15Z",
            "doc-id": "api149750",
            "component-type": "machine",
            "component": "Reach",
            "processing-start": "2018-02-22T15:19:14Z",
            "object-type": "meta-info",
            "organization": "UAZ"
        }
    }
}

INDRA

{
    "statements": [
        {
            "type": "Phosphorylation",
            "enz": {
                "name": "TGFBR1",
                "db_refs": {
                    "TEXT": "ALK5",
                    "UP": "P36897",
                    "HGNC": "11772"
                },
                "sbo": "http://identifiers.org/sbo/SBO:0000460"
            },
            "sub": {
                "name": "SMAD2",
                "db_refs": {
                    "TEXT": "SMAD2",
                    "UP": "Q15796",
                    "HGNC": "6768"
                },
                "sbo": "http://identifiers.org/sbo/SBO:0000015"
            },
            "evidence": [
                {
                    "source_api": "reach",
                    "pmid": "api149751",
                    "text": "In the cytoplasm, activated ALK5 phosphorylates the SMAD2 protein.",
                    "annotations": {
                        "found_by": "Phosphorylation_syntax_1a_verb"
                    },
                    "epistemics": {
                        "section_type": null,
                        "direct": true
                    }
                }
            ],
            "id": "5d0c4427-a78f-4e66-a2a2-43dbe9ea3e09",
            "sbo": "http://identifiers.org/sbo/SBO:0000216"
        },
        {
            "type": "Activation",
            "subj": {
                "name": "TGFB1",
                "db_refs": {
                    "TEXT": "TGFB1",
                    "UP": "P01137",
                    "HGNC": "11766"
                },
                "sbo": "http://identifiers.org/sbo/SBO:0000459"
            },
            "obj": {
                "name": "TGFBR1",
                "db_refs": {
                    "TEXT": "ALK5",
                    "UP": "P36897",
                    "HGNC": "11772"
                },
                "sbo": "http://identifiers.org/sbo/SBO:0000643"
            },
            "obj_activity": "activity",
            "evidence": [
                {
                    "source_api": "reach",
                    "pmid": "api149751",
                    "text": "The TGFB1 ligand can activate the cell-surface receptor ALK5.",
                    "annotations": {
                        "found_by": "Positive_activation_syntax_1_verb"
                    },
                    "epistemics": {
                        "section_type": null,
                        "direct": false
                    }
                }
            ],
            "id": "6bae5845-c459-4bd4-b928-3bc2eeb73230",
            "sbo": "http://identifiers.org/sbo/SBO:0000182"
        }
    ]
}
d2fong commented 6 years ago

Key Response Differences

Note: I am not familiar with the REACH response format, but it seems to me that REACH recognizes more statements than INDRA

Sentences recognized by REACH:

Sentences recognized by INDRA:

INDRA includes identity references in an simple JSON structure

            "sub": {
                "name": "SMAD2",
                "db_refs": {
                    "TEXT": "SMAD2",
                    "UP": "Q15796",
                    "HGNC": "6768"
                },
                "sbo": "http://identifiers.org/sbo/SBO:0000015"
            }

REACH response seems more verbose and response keys are not that understandable / traversable compared to INDRA

d2fong commented 6 years ago

Keys in indra statement seem different based on statement type

2 statements:

It would be nice to know:

Edit: Found https://github.com/sorgerlab/indra/blob/master/indra/statements.py which describes the INDRA statement logic. Could potentially use this to determine what statements have what contents, etc.

Edit:

Found the comments on the read the docs: http://indra.readthedocs.io/en/latest/modules/statements.html?highlight=statement

d2fong commented 6 years ago

Input 2

Passage 2. FASL/CD95 signaling The Fas family of cell surface receptors initiate the apoptotic pathway through interaction with the external ligand, FasL. The cytoplasmic domain of Fas interacts with a number of molecules in the transduction of the external signal to the cytoplasmic side of the cell membrane. The most notable cytoplasmic domain is the Death Domain (DD) that is involved in recruiting the FAS-associating death domain-containing protein (FADD).

REACH returns a 25kb response. INDRA returns an empty statements array.

d2fong commented 6 years ago

Input 3

Passage 1. p53-Dependent G1 DNA Damage Response

“Under normal conditions, p53 is a short-lived protein. The MDM2 protein, usually interacts with p53, and by virtue of its E3 ubiquitin ligase activity, mediates its degradation by the ubiquitin-proteasome machinery. Upon detection of DNA damage, the ATM kinase mediates the phosphorylation of the Mdm2 protein to block its interaction with p53. The p53 protein activates the transcription of cyclin-dependent kinase inhibitor, p21. p21 inactivates the CyclinE:Cdk2 complexes, which prevents entry into S-phase of the cell cycle.”

REACH and INDRA both return something. 25kb and 2.7kb responses respectively.

jvwong commented 6 years ago

Just leaving a paper trail:

sacdallago commented 6 years ago

Thanks for the extensive testing @d2fong , this is excellent. It would be also interesting to know if what REACH returns in https://github.com/PathwayCommons/factoid/issues/188#issuecomment-367794192 is actually garbage, which would justify indra's empty response. But https://github.com/PathwayCommons/factoid/issues/188#issuecomment-367720381 is interesting, as in theory indra is just a broker to Reach, with some added functionalities.

@bgyori @johnbachman is a behavior to be attributed to the confidence measure (and some threshold) you mentioned past thursday?

bgyori commented 6 years ago

Hi @d2fong, thanks for the analysis! Just to clarify INDRA is a model assembly system with input interfaces to multiple natural language processing systems, including REACH. INDRA doesn't read the text itself, it is "at the mercy of" the reading systems' output. When you read text through the reach/process_text interface of the INDRA web service, it gets sent to REACH for reading, and the events in the resulting JSON response are processed to extract INDRA Statements.

Looking at the size of the responses (i.e. 25 kb vs 2.7 kb) is not relevant - the difference is due to the large amount of meta-information that REACH returns, the largest portion of which is the "sentences" section, which simply contains meta-data about the input text that was being read. The relevant section is the events section of the REACH response. This is where extracted mechanisms are represented. INDRA extracts Statements from this event section.

In the comment https://github.com/PathwayCommons/factoid/issues/188#issuecomment-367720381, you list "sentences recognized by REACH" vs "sentences recognized by INDRA". This is somewhat misleading since REACH will list every sentence that it reads in the "sentence" section of its extractions; this doesn't mean that anything was extracted from that sentence. So simply because REACH read a sentence, doesn't mean that the sentence contained any interesting events or that if it did, REACH was able to extract it. Again, the relevant place to look for what was extracted is the "events" section of the REACH response. Let's look at each sentence specifically:

In the comment https://github.com/PathwayCommons/factoid/issues/188#issuecomment-367794192, the paragraph "The Fas family of cell surface receptors initiate the apoptotic pathway through interaction with the external ligand, FasL. The cytoplasmic domain of Fas interacts with a number of molecules in the transduction of the external signal to the cytoplasmic side of the cell membrane. The most notable cytoplasmic domain is the Death Domain (DD) that is involved in recruiting the FAS-associating death domain-containing protein (FADD)." yields 0 events from REACH! So there is nothing extracted by the reading at all, hence no Statements from INDRA. Having looked at literally thousands of these over the last few years, I am not surprised that nothing was extracted from this paragraph since it doesn't contain the type of declarative descriptions of mechanisms that a machine would understand.

In the comment https://github.com/PathwayCommons/factoid/issues/188#issuecomment-367795250, from the paragraph, we get 4 INDRA Statements:

This result looks perfectly good to me, everything usable from the events extracted by REACH are picked up by INDRA. Looking at the REACH output, the things that are missing are lost at the level of reading.

Hope this is helpful!

d2fong commented 6 years ago

Thanks @bgyori for the clarification on how INDRA generates statements from REACH. Yes I also agree that size is not necessarily relevant, I was just illustrating how much more verbose REACH is compared to INDRA and potential implications of that from the perspective of a user.

So can you confirm that INDRA only looks at the events section of the REACH output?

@sacdallago The current REACH-Factoid Document converter looks at events, entities, and sentences to generate the model. There is also complexity that I don't fully understand yet in the REACH-Factoid document converter, but my initial hypothesis is that the model generated from INDRA might be a bit more barren than the model generated from REACH.

bgyori commented 6 years ago

INDRA processes events extracted by REACH and instantiates INDRA Statements from these. But events refer to entities as arguments, and so INDRA does look these up in the entities section of the REACH output to gather information on database identifiers. Finally, to find the evidence sentence from which the given event was extracted, INDRA looks up the associated text in the sentences section of the REACH output and includes it in the Evidence of the INDRA Statement.

d2fong commented 6 years ago

Okay, maybe INDRA's output will be more similar to the current REACH-Factoid document converter than I initially thought.

I generated a Factoid Document from input 2 and it gave an empty model too.

jvwong commented 6 years ago

Hi @bgyori - thanks for the insight. Is the REACH instance INDRA is wrapping local or somewhere else?

bgyori commented 6 years ago

There are three usage modes implemented in INDRA by which REACH can be used for reading (and a similar setup exists for other reading systems that we interface with):

  1. Call the web service with some text or a PMC ID to read, e.g.
    from indra.sources import reach
    reach.process_text('MEK binds ERK.')
    reach.process_pmc('PMC1234...')
  2. Use a compiled REACH JAR locally (some setup required) to read text or PMC NXML e.g.
    from indra.sources import reach
    reach.process_text('MEK binds ERK.', offline=True)
    reach.process_nxml_file('PMC1234... .nxml', offline=True)
  3. Use a command line interface to the REACH CLI to run in parallel on a large amount of text content in batch mode and then process the resulting output files (this, I think is not that relevant for Factoid since it's meant for high-throughput reading of large quantities of literature)

Having said this, the web service of INDRA you guys are currently testing is set up to just call the REACH web service to read. As we discussed with @sacdallago, we could change this setting to use a local instance of REACH (i..e. option 2 above).

jvwong commented 6 years ago

Thanks, you answered my question! (congrats on the paper btw).

d2fong commented 6 years ago

@bgyori

I have another question about INDRA.

Here is a snippet from the entities section of a REACH response.

    "entities": {
        "frames": [
            {
                "text": "p53",
                "frame-id": "ment-api150222-UAZ-r1-Reach-0-375",
                "type": "protein",
                "frame-type": "entity-mention",
                "end-pos": {
                    "reference": "pass-api150222-UAZ-r1-Reach",
                    "offset": 28,
                    "object-type": "relative-pos"
                },
                "xrefs": [
                    {
                        "namespace": "uniprot",
                        "species": "homo sapiens",
                        "object-type": "db-reference",
                        "id": "P04637"
                    }
                ],

If you look at the xrefs key there is some information that the current Factoid model uses to "ground" entities. Does INDRA not provide this information?

bgyori commented 6 years ago

The grounding information REACH gives is standardized and extended by INDRA, and is represented in each Agent's db_refs attribute. For instance, for the above example with "p53" this is the corresponding argument representing the entity

{
  "name": "TP53",
  "db_refs": {
   "UP": "P04637",
   "HGNC": "11998",
   "TEXT": "p53"
  },
}

What you see here is that the name has been standardized by INDRA to the official gene symbol (when possible this is done, when not it isn't) to TP53. The original raw text string as it appeared in text is in db_refs['TEXT'], the UniProt ID is in db_refs['UP'], and the HGNC ID that was added by INDRA is in db_refs['HGNC']. Grounding for chemicals, etc. is similar but they would be referring to IDs in other databases like CHEBI or PUBCHEM. One special name space we developed, which has been integrated into REACH, is BE which is a new name space and ontology for protein families and complexes. You can read more about it here: https://github.com/sorgerlab/bioentities

d2fong commented 6 years ago

Okay. I am actually not that familiar with the concept of grounding but I do know that:

@jvwong @maxkfranz do any of you have comments or questions about how INDRA handles grounding and what that means for Factoid?

cccsander commented 6 years ago

Good process, here. :-)

@sacdallago see me Monday re this?

gbader commented 6 years ago

Great work. Mihai told us we need to use since the web service version of REACH is very out of date and not production ready. So if this analysis was done with the reach web service, I think we need to redo it with the latest reach version. So how soon can we get INDRA working in our hands with the latest version of REACH via local JAR? Or can we ask the INDRA team for their perspective of how much better the latest version of reach is via indra vs. what we evaluated here?

sacdallago commented 6 years ago

As discussed today, let's keep this for a later point in time. @d2fong you can close if you feel like it's necessary

d2fong commented 6 years ago

Closing for now. Can be reopened if we need to discuss this again in the future.

d2fong commented 6 years ago

Reopening based on new learnings from the presentation today. We might eventually use it.

sacdallago commented 6 years ago

Closing for now.

d2fong commented 5 years ago

@IgorRodchenkov

sacdallago commented 5 years ago

Based on dev call today, and https://github.com/PathwayCommons/factoid/issues/388#issuecomment-441108836

jvwong commented 5 years ago

Based on dev call today, and #388 (comment)

Based on above I think we can close.

INDRA uses REACH for NLP. I'm not a fan of adding another layer of concept mapping/assumptions between Factoid and REACH. To boot we'd need to write a mapper between INDRA and Factoid which I'm not sure provides more benefits than updating (#388) our own from raw REACH output. There is the notion that INDRA could be a universal model for different NLP and sources. However, we're being pretty careful about our Factoid model types, so we'd probably need to rewrite any NLP-To-INDRA mapper if we wanted to easily switch between them. Close unless there's a reason here...