Closed johnbachman closed 5 years ago
Thanks John! We'll take a look at these soon.
@enoriega, @maxaalexeeva: something more for you. To be discussed tomorrow.
@enoriega: please check if this is caused by broken syntax or something else. If it's syntax, I propose to postpone this until we merge UD in master. Please update on this thread.
@johnbachman: I am relatively new on the project, so I apologize if my question does not make much sense. Below is the output we get from the system*--tyrosine is extracted as 'site' and 'IRS-1' as 'theme'. What would your desired output look like? Did you mean that 'IRS-1' should be extracted as 'site' instead of 'tyrosine', that 'tyrosine' should come as modification to a protein, or maybe some other option? Thank you in advance.
*just the relevant part of one of your examples to make the relations in question easier to see
@maxaalexeeva Actually, now that you point it out, that particular sentence is fairly ambiguous (and poorly written). "Tyrosine phosphorylation of IRS1 and pSer 473 AKT" could, in principle, indicate "tyrosine phosphorylation of IRS1" and "tyrosine phosphorylation of pSer 473 AKT". However, "pSer 473 AKT" refers (in abbreviated form) to "AKT phosphorylated at serine 473" (i.e., not tyrosine). So the interpretation should actually be that IGF-1 induces two distinct things: phosphorylation of IRS-1 at tyrosine and phosphorylation of AKT at serine 473.
Looking at the output, I can see that there is a phosphorylation event with IRS-1 as the theme and tyrosine as the site, which is correct. However, there is also a phosphorylation event stemming from the same "phosphorylation" word with "AKT" as the theme, which (in this particular case) is not correct. However, I can definitely see now that this sentence is linguistically quite ambiguous!
The third sentence is more clear:
Insulin stimulated rapid tyrosine phosphorylation of IR, phosphorylation of the PI3K effector Akt at Ser473...
IR (insulin receptor) is phosphorylated at tyrosine, whereas AKT is phosphorylated at Serine 473.
Here's another simpler example from the set of 44 sentences:
Insulin stimulated tyrosine phosphorylation of IRS-1, PI 3-kinase activity, and Akt phosphorylation.
The "tyrosine" applies only to the phosphorylation of IRS-1 here, not to the phosphorylation of AKT.
Btw I just realized that the JSON txt file I attached above was not pretty printed, which makes it really difficult to manually scan for sentences. I've attached a new version so it's easier to review.
Thanks @johnbachman!
@maxaalexeeva: can you please paste here the outputs for the latter 2 sentences sent by John? Also, can you please book a 1-hour slot for next week (separate from the regular ASDF meeting), so we can plan how to address these issues? For this meeting, I would like @maxaalexeeva, @zhengzhongliang, and ideally @enoriega (if you can join remotely) to participate. Please find a slot that works for the 3 of you. Thanks!
@johnbachman: it is possible that there are 2 issues at play here:
Here they are (run on master):
Thanks @maxaalexeeva!
Let's look at the second figure, it's simpler. Further, let's call the first instance of "phosphorylation" "p1", and the second instance "p2". Then, the following relations are incorrect: p1-PI, p1-Akt, p2-IRS-1, and p2-PI
Can you please see which rules create these extractions? These rules should be constrained to not cross commas in enumerations of events.
@MihaiSurdeanu those rules are syntax rules (syntax_1a_noun and syntax_4_noun).
Here's the dep. parse for the second figure:
My thinking was to limit what words hops can go through, e.g., we could prevent p1 relating to Akt through putting a constraint along these lines: (?! nmod_of [word=phosphorylation] )
Your suggestion makes more sense, but I haven't found a way to write that constraint yet.
Also, maybe this can be fixed through some sort of mention filter? E.g., if the entity has been previously found as a theme for some other simple event, filter this event out?
Does this look more like what we want?
That is from trying to solve it the first way I suggested (still haven't found a way to put a constraint on commas). Still need to try to make it more generalizable by not limiting the constraint to just the word 'phosphorylation' and see how it influenced the other tests. For now, all the changes are local and on my branch.
@maxaalexeeva: this is much better! What exactly did you do?
@MihaiSurdeanu, in the syntax rules I mentioned, I basically disallowed hops if they involve a relation of this kind: 'phosphorylation of phosphorylation' (?! <nmod_of [word=phosphorylation]).
I also adjusted a token rule to avoid 'phosphorylation of IRS-1 and Akt' to be extracted as an event in a hypothetical scenario like this:
Here are all the changes I made so far:
diff --git a/main/src/main/resources/org/clulab/reach/biogrammar/events/simple-event_template.yml b/main/src/main/resources/org/clulab/reach/biogrammar/events/simple-event_template.yml
index 66f1a12..01c6384 100644
--- a/main/src/main/resources/org/clulab/reach/biogrammar/events/simple-event_template.yml
+++ b/main/src/main/resources/org/clulab/reach/biogrammar/events/simple-event_template.yml
@@ -186,7 +186,7 @@
pattern: |
trigger = [lemma=/${ nominalTriggerLemma }/ & ${triggerPrefix}]
cause:BioChemicalEntity? = ((<dobj|<'nmod:npmod')? (nmod_by|nmod_agent|"nsubj:xsubj") | compound | nmod_of nmod_by) /compound|conj_(and|or|nor)/{,2} | "nmod:poss"
- theme:BioChemicalEntity = nmod_of appos? /compound|conj_(and|or|nor)/{,2}
+ theme:BioChemicalEntity = (?! nmod_of [word=phosphorylation]) nmod_of appos? /compound|conj_(and|or|nor)/{,2}
site:Site? = (/nmod_/? compound{,2}){1,2}
@@ -243,7 +243,7 @@
action: ${ actionFlow }
pattern: |
trigger = [lemma=/${ nominalTriggerLemma }/ & ${triggerPrefix} & !outgoing=/nmod_(by|of)/]
- theme:BioChemicalEntity = </conj_(and|or|nor)/? /conj_(and|or|nor)|compound|nmod_of/{,2}
+ theme:BioChemicalEntity = (?! <nmod_of [word=phosphorylation]) </conj_(and|or|nor)/? /conj_(and|or|nor)|compound|nmod_of/{,2}
site:Site? = compound | <dobj? /nmod_(at|on)/ nummod?
@@ -303,7 +303,7 @@
label: ${ label }
action: ${ actionFlow }
pattern: |
- (?<trigger> [lemma=/${ nominalTriggerLemma }/ & ${triggerPrefix}]) /of/ [!tag=/^(V|JJ)/]{,2}? @theme:BioChemicalEntity (/^(on|at)$/ @site:Site)?
+ @site:Site? (?<trigger> [lemma=/${ nominalTriggerLemma }/ & ${triggerPrefix}]) /of/ [!tag=/^(V|JJ)/]{,2}? @theme:BioChemicalEntity (/^(on|at)$/ @site:Site)?
# verbose, but nec. to handle coordination
@@ -314,7 +314,7 @@
label: ${ label }
action: ${ actionFlow }
pattern: |
- (?<trigger> [lemma=/${ nominalTriggerLemma }/ & ${triggerPrefix}]) /of/ [!tag=/^(V|JJ)/]{,2}? @BioChemicalEntity [tag="CC"] @theme:BioChemicalEntity
+ (?<trigger> [lemma=/${ nominalTriggerLemma }/ & ${triggerPrefix}]) /of/ [!tag=/^(V|JJ)/]{,2}? @BioChemicalEntity [tag="CC"] @theme:BioChemicalEntity (?! [lemma=/${ nominalTriggerLemma }/])
I have rerun all the event tests, and there are no new tests failing. If the above is okay with you, similar changes will need to be made in other rules--now that these rules are fixed, incorrect events are extracted with different rules in other sentences.
@maxaalexeeva: this is a great intuition. I think we'll have to generalize this into a mention filter that is executed after the simple event grammar. This filter will disable any simple events that contain the trigger of another simple event in the syntactic dependency path between its own trigger and the theme. This has multiple advantaged:
What do you think? Is this clear? Do you agree?
@maxaalexeeva agreed to work on a mention filter for this issue.
I need help writing the filter. Will anyone be available Monday between 11 and 4:30? @MihaiSurdeanu @marcovzla @enoriega.
Alternatively, does anyone have tips on how to approach this? I've been trying to use .paths (like these: https://github.com/clulab/reach/blob/afff6776cf2449c6de1c16c79776a499e1f06dc6/main/src/main/scala/org/clulab/reach/mentions/CorefMention.scala#L118), but I don't know how to filter mentions based on the information that is so deep in a nested map. And this might not be the best way.
If @marcovzla and @enoriega can't make it, can you pick a 30 min slot Monday from my meetings?
I won't be at campus but we can zoom at 3:00 for this. Does that work for you @maxaalexeeva ?
@enoriega it does! Thank you!
A solution suggested by @myedibleenso to avoid the check being too expensive (the steps below are a direct quote):
Find any intersecting args + triggers for competing events.
Grab SynPaths corresponding to any intersecting args (the .paths attribute).
Check if any SynPaths (note: these are not necessarily shortest paths) pass through one of the trigger tokens for any competing events.
This makes sense!
@MihaiSurdeanu, that's a lot of information below, but I could use some feedback.
@bsharpataz wrote the bulk of the filter for me. See here: https://github.com/clulab/reach/blob/e5e11dc1a8534e5eb0d018759ce66ad9265c7e95/main/src/main/scala/org/clulab/reach/darpa/MentionFilter.scala#L27
I finally managed to get it to run without breaking any tests. I call it here: https://github.com/clulab/reach/blob/4b59961dccbffaabb9023aec0dd91db6b30664ed/main/src/main/scala/org/clulab/reach/ReachSystem.scala#L106
The issue with that version of the filter was that it only filtered out mentions based on the shortest path and that is not sufficient to filter out all the erroneous mentions. In 1a and 1b, you can see that it filtered out the relation between P1 (the first instance of phosphorylation) and Akt (compare with the original 2a and 2b; the dependencies are shown in 4a and 4b).
I discussed the possibility of checking the whole path with @myedibleenso, and he seemed to think (please, correct me if I am wrong) that checking the whole path is expensive. I tried a few other things and the one that looked promising was checking the incoming relations to the overlapping argument --- filter out the mentions in which there is an incoming edge connecting the argument in question to the trigger of another mention (here: https://github.com/clulab/reach/blob/cfbb46316b856040977985181a55d732c4c96afc/main/src/main/scala/org/clulab/reach/darpa/MentionFilter.scala#L72). You can see the result of applying that filter in 3a and 3b.
Several apparent issues with these filters:
Actual questions:
The results of applying the filter that only checks the shortest path---it only filters out P1 -- Akt: a. b.
The relations produced by the system in the master branch: a.
b.
The results after applying the filter that checks if there is an incoming edge connecting the argument in question to the trigger of another mention---it filters out P2 -- IRS in both cases: a. b.
Dependencies for examples a and b in 1-3:
a. b.
Broken and correct (based on master) transcription test:
a.
b.
Broken and correct (based on master) translocation test: a. b.
Thanks @maxaalexeeva and @bsharpataz!
should be optimized. For example, it's simpler to just pass ms
to getOverlappingMentions, and, in getOverlappingMentions add to the filter that m != _
(to skip itself)
I think this filter will not be perfect but it is already useful!
I agree with adding some constraints to the filter, e.g., ignoring translocation, transcription, amount events makes sense (unless the original event is a translocation, transcription, amount; that is, never allow the overlap of events of same label). This would fix tests 5 and 6. Not sure about appositives... Do you have an example?
This filter is still much better and more generic than modifying rules, so I would say to try to improve this as much as possible and not modify the rules.
I think you should check all the paths. But you should be careful about it, to keep it efficient. That is, I recommend you build a single HashSet from the positions covered by all paths, and check that the other trigger is in there. But wait, isn't this what you're doing in this line:
If so, I am not sure what the question is. Are you saying that m.paths("theme").get(theme)
gets only the shortest paths? If so, I think we should leave it as is. Searching for all paths is expensive...
But can you please comment it better? - will do.
Line 33 -- done. I realized that this line I added for something else already makes sure a mention is not compared to itself: https://github.com/clulab/reach/blob/f520fc2d9efb154b6a02e4e4db0909883094f0ab/main/src/main/scala/org/clulab/reach/darpa/MentionFilter.scala#L53
I agree with adding some constraints to the filter - I will try that.
Not sure about appositives... Do you have an example? - in example 6, protein is part of an appositive to ASPP2 (see the dependency in 6a), i.e., it's part of a phrase that provides additional identifying information. In the second version of the filter, I check if the argument is connected to the trigger of a different event with an incoming edge and filter the mention out if that's true (just a heuristic). However, with appositives, the argument in question (e.g., ASPP2 in 6) will in many cases be correctly connected to two triggers: its own trigger and the one that the appositive (protein) gets since they refer to the same entity (ASPP2 = protein); this will trigger the filter unnecessarily. Does that make sense? I will want to look if this holds on more examples and if it will make sense to add a constraint based on that.
Are you saying that m.paths("theme").get(theme) gets only the shortest paths? If so, I think we should leave it as is. It looks like it only gets the shortest path. I'll check again. It might be originating here? https://github.com/clulab/reach/blob/afff6776cf2449c6de1c16c79776a499e1f06dc6/assembly/src/main/scala/org/clulab/reach/assembly/relations/classifier/PathFinder.scala#L47
On Jan 2, 2019, at 9:10 PM, Maria (Masha) Alexeeva notifications@github.com wrote:
But can you please comment it better? - will do.
Line 33 -- done. I realized that this line I added for something else already makes sure a mention is not compared to itself: https://github.com/clulab/reach/blob/f520fc2d9efb154b6a02e4e4db0909883094f0ab/main/src/main/scala/org/clulab/reach/darpa/MentionFilter.scala#L53
Ok.
I agree with adding some constraints to the filter - I will try that.
Not sure about appositives... Do you have an example? - in example 6, protein is part of an appositive to ASPP2 (see the dependency in 6a), i.e., it's part of a phrase that provides additional identifying information. In the second version of the filter, I check if the argument is connected to the trigger of a different event with an incoming edge and filter the mention out if that's true (just a heuristic). However, with appositives, the argument in question (e.g., ASPP2 in 6) will in many cases be correctly connected to two triggers: its own trigger and the one that the appositive (protein) gets since they refer to the same entity (ASPP2 = protein); this will trigger the filter unnecessarily. Does that make sense? I will want to look if this holds on more examples and if it will make sense to add a constraint based on that.
I agree. Good catch.
Are you saying that m.paths("theme").get(theme) gets only the shortest paths? If so, I think we should leave it as is. It looks like it only gets the shortest path. I'll check again. It might be originating here? https://github.com/clulab/reach/blob/afff6776cf2449c6de1c16c79776a499e1f06dc6/assembly/src/main/scala/org/clulab/reach/assembly/relations/classifier/PathFinder.scala#L47
Yep. I propose to leave it as is.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@MihaiSurdeanu, I am running the incorrect mention examples that John provided. It seems that in many cases, the filter is doing a reasonable job---it does not remove all of the incorrect relations, but it does some. I can't always be sure the result what it should be, but there is this one example that is clearly wrong (the parse is wrong and that interferes with the logic of the filter). Any suggestions? Maybe some sort of constraint?
Also, above John said this:
Actually, now that you point it out, that particular sentence is fairly ambiguous (and poorly written). "Tyrosine phosphorylation of IRS1 and pSer 473 AKT" could, in principle, indicate "tyrosine phosphorylation of IRS1" and "tyrosine phosphorylation of pSer 473 AKT". However, "pSer 473 AKT" refers (in abbreviated form) to "AKT phosphorylated at serine 473" (i.e., not tyrosine).
Is there some place in the system where pSer 473 AKT can be de-abbreviated? If it is stays in the abbreviated form, the rules won't handle it correctly, will they?
Thanks @maxaalexeeva!
My brain can't handle that many regulations at once... Which ones do you think are incorrect? I think most of those are fine. IL-1alpha indeed controls the two phosphorylations mentioned later in the sentence, which would lead to many regulations.
Also, John's comments does not seem to be about this text, no?
@MihaiSurdeanu, sorry, I should have been more specific. The incorrect rel that I meant was IRS-1 being attached to phosphorylation2, but not phosphorylation1. The parser basically does what I tried to avoid with the filter -- it attaches an argument (IRS-1) by going through a different phosphorylation.
Maybe we can have some sort of token rule/action for 'of' (only for cases when it is right after the trigger to avoid other rels being broken) to take precedence over what parse does?
The example comes from the set of 44 examples that John attached in one of his previous messages on this thread.
Hmm... This clearly a broken parse. I wonder if we need to add a surface rule like: SITE simple_event_trigger of PROTEIN? Can you please look into this?
In any case, this comment is independent of your filter, which I think is good and should be merged. Anything stopping us from merging this in master?
@MihaiSurdeanu the tests seem to be passing locally with the filter. I do not think the filter should cause any major issues, but we will hear from the end users if it does and can take it out, right? I will start a pull request.
I will look into the surface rule.
@MihaiSurdeanu, I looked into the possibility of using a surface rule to help with the broken parse (we discussed that a few messages up, but here is what the wrong output looks like---notice how IRS-1 is the theme to phosphorylation2 here instead of to phosphorylation1):
This token rule that I wrote based on your suggestion takes care of the incorrect parse, but only if its priority is higher than the priority of all the other simple events in simple-events_template.yml:
to produce this output (IRS-1 is attached as a theme to phosphorylation1):
If it's okay to use this rule, what would be the best way to change the priority? I see three options (all of which don't sound too good to me but I can't think of anything else):
1) hardcode the priority in the rule 2) create another variable (something like the ${ edgeCasePriority } in the rule screenshot above) and add it to the list of variables in events_master.yml like this: 3) put this rule in a separate file; that will mean that simple event entries in events_master.yml will have to have duplicates to import rules from two different files and with different priorities, right? Here's an example of what I think that would look like:
Is there a better way to change priority of one rule?
Hey -- why would the priority matter here? i.e., I completely believe you about what's happening, but why? It could be the case that the fix lies elsewhere...?
On Mon, Feb 4, 2019 at 11:08 PM Maria (Masha) Alexeeva < notifications@github.com> wrote:
@MihaiSurdeanu https://github.com/MihaiSurdeanu, I looked into the possibility of using a surface rule to help with the broken parse (we discussed that a few messages up, but here is what the wrong output looks like---notice how IRS-1 is the theme to phosphorylation2 here instead of to phosphorylation1):
[image: image] https://user-images.githubusercontent.com/31713912/52253598-48b57d00-28c6-11e9-88cd-440e61bc38e6.png
This token rule that I wrote based on your suggestion takes care of the incorrect parse, but only if its priority is higher than the priority of all the other simple events in simple-events_template.yml: [image: image] https://user-images.githubusercontent.com/31713912/52253670-a0ec7f00-28c6-11e9-862c-175bc9fccc7d.png
to produce this output (IRS-1 is attached as a theme to phosphorylation1):
[image: image] https://user-images.githubusercontent.com/31713912/52253679-b8c40300-28c6-11e9-9685-cab0b3b3dbee.png
If it's okay to use this rule, what would be the best way to change the priority? I see three options (all of which don't sound too good to me but I can't think of anything else):
- hardcode the priority in the rule
- create another variable (something like the ${ edgeCasePriority } in the rule screenshot above) and add it to the list of variables in events_master.yml like this: [image: image] https://user-images.githubusercontent.com/31713912/52254207-a9928480-28c9-11e9-9dea-3e76c49d868b.png
- put this rule in a separate file; that will mean that simple event entries in events_master.yml will have to have duplicates to import rules from two different files and with different priorities, right? Here's an example of what I think that would look like:
[image: image] https://user-images.githubusercontent.com/31713912/52255780-27a65980-28d1-11e9-9d31-07d8078c83f4.png
Is there a better way to change priority of one rule?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clulab/reach/issues/605#issuecomment-460524628, or mute the thread https://github.com/notifications/unsubscribe-auth/AFIniUPICaL9hF2xINNS2EeQduWkCTOOks5vKR_RgaJpZM4YW84U .
@bsharpataz: I suspect the priority is needed because we do not allows overlapping events. In the regular priority layer, parts of this correct event are captured for other things, which prevent the correct rule to create the right event.
@maxaalexeeva: my preference is to create a new grammar file with edge cases for simple events. Say edge_cases_simple_events.yml. This one will have only edge-case rules, all of them running with the higher priority. These rules would apply to all simple events, so you can reuse the same variables from the master file.
@bsharpataz, @MihaiSurdeanu, thank you both for the responses. I will deal with this in the next couple of days.
@mihai -- I know we don't allow overlapping mentions, but priority wouldn't affect that, depending on how you filter (i.e. are you using a keepLongest / keep most complete type of method?). That would be overriding if it were simply an overlap issue, right? (I could be misunderstanding the situation)
On Tue, Feb 5, 2019 at 10:03 AM Maria (Masha) Alexeeva < notifications@github.com> wrote:
@bsharpataz https://github.com/bsharpataz, @MihaiSurdeanu https://github.com/MihaiSurdeanu, thank you both for the responses. I will deal with this in the next couple of days.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clulab/reach/issues/605#issuecomment-460718496, or mute the thread https://github.com/notifications/unsubscribe-auth/AFIniWeB4Mdz3Hh1a0pCJfSynM2O5nJpks5vKblagaJpZM4YW84U .
Yeah, I remember that the order in which the mentions were created matters. Something along the lines of allow only certain overlaps with existing mentions.
I think the keepLongest heuristic applies for conflicting mentions generated in the same layer.
Hi guys, I've included a set of 44 sentences where REACH makes the same systematic error each time. REACH is extracting (essentially) that Insulin phosphorylates AKT on tyrosine, when in fact the sentences are referring to a tyrosine residue on another protein (either IR or IRS1) in each case. Here are three examples:
I've attached a file containing INDRA Statements in JSON format (Github made me set the extension to .txt) containing all 44 sentences.
reach_ins_akt_phos_tyrosine_errors.txt