clulab / eidos

Machine reading system for World Modelers
Apache License 2.0
37 stars 24 forks source link

Compositional grounding is not composing #1115

Open kwalcock opened 2 years ago

kwalcock commented 2 years ago

I wonder whether at

https://github.com/clulab/eidos/blob/8758be8a5860fc3404f1d6f66b5095a161a6d87c/src/main/scala/org/clulab/wm/eidos/groundings/grounders/SRLCompositionalGrounder.scala#L285

the exact and inexact groundings should not be combined in a sequence but rather into a combined predicate grounding.

kwalcock commented 2 years ago

The reason it doesn't stick with just drought is the noisyOr. Drought as an exact match has a local score of 1.0, but a noisyOr with epsilon = 0.01f of 0.99. When composed with other things, the 0.99 can be beaten. Impact is matching too well with preparedness. If is was some other word, it could be a better answer than simple drought.

THEME: wm/concept/crisis_or_disaster/environmental/drought (1.0) Theme properties: wm/property/preparedness (0.9787039) Total: 0.99479353

THEME: wm/concept/crisis_or_disaster/environmental/drought (1.0) Theme properties: wm/property/geographic_resolution/local (0.8740176) Total: 0.9942701

THEME: wm/concept/crisis_or_disaster/environmental/drought (1.0) Theme properties: wm/property/stability (0.9661405) Total: 0.9947307

MihaiSurdeanu commented 2 years ago
val text = "The impact of the drought has been exacerbated by high local cereal prices , excess livestock mortality , conflict and restricted humanitarian access in some areas ."
val effectGroundings = Seq("wm/concept/crisis_or_disaster/environmental/drought", "", "", "")
canonicalName: drought

Expected :"[]"
Actual   :"[wm/property/preparedness]"
// I will check this one.  Does it go on searching after finding an exact match?

I think the grounding above is not too bad, assuming the mention is "impact of the drought", especially considering your comments about NoisyOr.

val text = "As of the first quarter of 2011 the Djiboutian government remains genuinely worried that a potential Afar insurgency in the north could quickly spread to the south , especially in view of the fact that the Djiboutian National Army is weak and the population in Djibouti City is facing deteriorating economic conditions due to high unemployment and inflation , which surged to 3,8 per cent in 2010 ."
val causeGroundings = Seq("wm/concept/economy/unemployment", "", "", "")
canonicalName: high unemployment

Expected :"[]"
Actual   :"[wm/property/quality]"
// May have gotten unemployment and then gone back for high

The above I think is actually better than the ground truth!

val text = "The brewing conflict had already had a serious impact in disrupting farming which had led to higher prices ."
val effectGroundings = Seq("", "wm/property/price_or_cost", "", "")
canonicalName: higher prices

Expected :"[wm/property/price_or_cost]"
Actual   :"[]"

I don't understand the output above. what is the actual output for the mention "higher prices"?

val text = "The brewing conflict had already had a serious impact in disrupting farming which had led to higher prices ."
val effectGroundings = Seq("", "wm/property/price_or_cost", "", "")
canonicalName: higher prices

Expected :"[]"
Actual   :"[wm/process/consumption]"
// This seems to fill in the process rather than the property.  Perhaps price_or_cost doesn't make the cutoff.
MihaiSurdeanu commented 2 years ago

@kwalcock: please see my question above about some of the outputs. Also, looking at the new code for findInexactPredicateGroundings, I don't see any grounding outputs that combine the 3 tuple elements, i.e., a property, a concept, and a process. This would be needed for something like "the cost of the transportation of oil". It looks to me like all the PredicateTuples generated have at most two elements populated. Did I miss something?

kwalcock commented 2 years ago

It might take ~1 hour to investigate each of these cases. You are correct about there only being two slots filled in. Mostly I forgot, but if there is a potential property and one position to the right an argument, two to the right a predicate, one to the left a predicate, two to the left an argument and of these two are potential properties themselves, what should the property be connected to and how many of the others should be included? It will help if isPred = !isArg is changed to !isArg && !stopword, because otherwise there are almost always both arguments and predicates.

MihaiSurdeanu commented 2 years ago

I agree about !isArg && !stopword. The attachment of properties is tricky wo/ syntactic info, but I propose a similar heuristic as what you currently use:

kwalcock commented 2 years ago

In the unit tests there are two cases which call for three slots and one case where the property is attached to a vacant concept:

    val text = "* Late onset of rains and long midseason dry spells led to localized household food production shortfalls ."
    val effectGroundings = Seq("wm/concept/goods/food", "", "wm/process/production", "wm/property/unavailability")

    val text = "Despite the large impact of the FFD program on growth in food consumption , results show that receipt of free food distribution causes a significant increase in perceived famine risk ."    
    val effectGroundings = Seq("wm/concept/crisis_or_disaster/famine", "", "wm/process/perceive", "wm/property/risk") //todo: add 'perceive' as a process? 'perception' exists now as a concept.

    val text = "Attempts at stabilizing prices are rarely completely successful because they need to be combined with safety nets and other social protection measures to mitigate the impact of higher food prices and to help prevent violent conflicts ."
    val effectGroundings = Seq("", "wm/property/price_or_cost", "wm/process/stabilization", "")

These can't be produced (yet).

kwalcock commented 2 years ago

The code for exact matches can presently only ever fill in two slots as well :-(

I can change the inexact matches to use the closest concept or process rather than the first one of the mention. Either can be a potential property, so it is unclear what do to with things that are both. I just picked the easiest option.

kwalcock commented 2 years ago

The stopwords are coming from https://github.com/clulab/eidos/blob/master/src/main/resources/org/clulab/wm/eidos/english/filtering/stops.txt and https://github.com/clulab/eidos/blob/master/src/main/resources/org/clulab/wm/eidos/english/filtering/transparent.txt and aren't good words to rule out. I'll try POS:

Tag != IN*
!= DT*
!= PRP*

But also add . and , and probably lots of other things. It's expensive to unnecessarily ground against things.

kwalcock commented 2 years ago

Now only 11 instead of 13 originally passing tests fail, which is good. Overall, though, 186 fail instead of the low of 141. 718 pass, down from 774. I will spot check for stupid mistakes, but otherwise it will probably have to be left at close to this.

It is implemented at:

https://github.com/clulab/eidos/blob/50b60762265e4456e29dd56fae4aadb8b8695a09/src/main/scala/org/clulab/wm/eidos/groundings/grounders/SRLCompositionalGrounder.scala#L359-L432

kwalcock commented 2 years ago

This seems in retrospect to be ambiguous:

    4. Check if properties in 2 attach to them (for each property look first to the right, then to left if none found; pick the first process/concept)

By look to the right I understood to look anywhere to the right all the way until the end of the mention, and to the left to look until the beginning of the mention. I've got a mention of 17 words here and there are some very long distance relationships. I wonder whether it should be one to the right, one to the left, two to the right, two to the left, etc. to find the closest one but with the right still at an advantage. There's probably not time to experiment with that.

To the end doesn't really make sense, because that would always include the first one (in the sentence) which would always be found and not need to be a default. On the other hand, a different one in the sentence might be closer so that the first one is wrong.

kwalcock commented 2 years ago

Taking the closest one, not going all the way to the end, performs worse, with 191 failed tests instead of 186. If it eventually looks like the properties are being over-generated, the search could be limited to the immediate vicinity and then be abandoned.

kwalcock commented 2 years ago

Updated code:

https://github.com/clulab/eidos/blob/106b1ce1d2a33b988a54f85ea80985edf1203476/src/main/scala/org/clulab/wm/eidos/groundings/grounders/SRLCompositionalGrounder.scala#L412-L470

MihaiSurdeanu commented 2 years ago

The code looks great @kwalcock ! I think we're very close to being done. Before that:

Thanks!

kwalcock commented 2 years ago

I'm working on it.

kwalcock commented 2 years ago

water transportation, which has an exact match on water:

{
    "@type" : "Extraction",
    "@id" : "_:Extraction_3",
    "type" : "concept",
    "subtype" : "entity",
    "labels" : [ "Concept", "Entity" ],
    "text" : "water transportation",
    "rule" : "simple-np++Decrease_ported_syntax_1_verb",
    "canonicalName" : "water transportation",
    "groundings" : [ {
      "@type" : "Groundings",
      "name" : "wm_compositional",
      "version" : "7df8e4a4237596bb40ceea9e72bb46821fc01827",
      "versionDate" : "2021-10-19T19:46:27Z",
      "values" : [ {
        "@type" : "PredicateGrounding",
        "theme" : [ {
          "@type" : "Grounding",
          "ontologyConcept" : "wm/concept/goods/water",
          "value" : 1.0
        } ],
        "themeProcess" : [ {
          "@type" : "Grounding",
          "ontologyConcept" : "wm/process/transportation/",
          "value" : 1.2696834802627563
        }, {
          "@type" : "Grounding",
          "ontologyConcept" : "wm/process/transportation/commercial_transportation",
          "value" : 1.1144535541534424
        }, {
          "@type" : "Grounding",
          "ontologyConcept" : "wm/process/transportation/personal_transportation",
          "value" : 1.1033762693405151
        } ],
        "value" : 1.0025968551635742,
        "display" : "THEME: wm/concept/goods/water (1.0); THEME PROCESS: wm/process/transportation/ (1.2696835)"
      }, {
        "@type" : "PredicateGrounding",
        "theme" : [ {
          "@type" : "Grounding",
          "ontologyConcept" : "wm/concept/goods/water",
          "value" : 1.0
        } ],
        "value" : 0.9900000095367432,
        "display" : "THEME: wm/concept/goods/water (1.0)"
      } ]
    }
kwalcock commented 2 years ago

transportation of oil, only inexact matches:

{
    "@type" : "Extraction",
    "@id" : "_:Extraction_3",
    "type" : "concept",
    "subtype" : "entity",
    "labels" : [ "Concept", "Entity" ],
    "text" : "transportation of oil",
    "rule" : "simple-np++Decrease_ported_syntax_1_verb",
    "canonicalName" : "transportation oil",
    "groundings" : [ {
      "@type" : "Groundings",
      "name" : "wm_compositional",
      "version" : "7df8e4a4237596bb40ceea9e72bb46821fc01827",
      "versionDate" : "2021-10-19T19:46:27Z",
      "values" : [ {
        "@type" : "PredicateGrounding",
        "theme" : [ {
          "@type" : "Grounding",
          "ontologyConcept" : "wm/concept/goods/fuel",
          "value" : 1.8379604816436768
        }, {
          "@type" : "Grounding",
          "ontologyConcept" : "wm/concept/environment/natural_resources/fossil_fuels",
          "value" : 1.727161169052124
        }, {
          "@type" : "Grounding",
          "ontologyConcept" : "wm/concept/goods/protein_foods",
          "value" : 0.9827450513839722
        } ],
        "themeProcess" : [ {
          "@type" : "Grounding",
          "ontologyConcept" : "wm/process/transportation/",
          "value" : 1.2696834802627563
        }, {
          "@type" : "Grounding",
          "ontologyConcept" : "wm/process/transportation/commercial_transportation",
          "value" : 1.1144535541534424
        }, {
          "@type" : "Grounding",
          "ontologyConcept" : "wm/process/transportation/personal_transportation",
          "value" : 1.1033762693405151
        } ],
        "value" : 0.7849923372268677,
        "display" : "THEME: wm/concept/goods/fuel (1.8379605); THEME PROCESS: wm/process/transportation/ (1.2696835)"
      } ]
    }
kwalcock commented 2 years ago

cost of transportation of oil:

{
    "@type" : "Extraction",
    "@id" : "_:Extraction_3",
    "type" : "concept",
    "subtype" : "entity",
    "labels" : [ "Concept", "Entity" ],
    "text" : "cost of transportation of oil",
    "rule" : "gazetteer++simple-np++Decrease_ported_syntax_1_verb",
    "canonicalName" : "transportation oil",
    "groundings" : [ {
      "@type" : "Groundings",
      "name" : "wm_compositional",
      "version" : "7df8e4a4237596bb40ceea9e72bb46821fc01827",
      "versionDate" : "2021-10-19T19:46:27Z",
      "values" : [ {
        "@type" : "PredicateGrounding",
        "theme" : [ {
          "@type" : "Grounding",
          "ontologyConcept" : "wm/concept/goods/fuel",
          "value" : 1.8379604816436768
        }, {
          "@type" : "Grounding",
          "ontologyConcept" : "wm/concept/environment/natural_resources/fossil_fuels",
          "value" : 1.727161169052124
        }, {
          "@type" : "Grounding",
          "ontologyConcept" : "wm/concept/goods/protein_foods",
          "value" : 0.9827450513839722
        } ],
        "themeProcess" : [ {
          "@type" : "Grounding",
          "ontologyConcept" : "wm/process/transportation/",
          "value" : 1.2696834802627563
        }, {
          "@type" : "Grounding",
          "ontologyConcept" : "wm/process/transportation/commercial_transportation",
          "value" : 1.1144535541534424
        }, {
          "@type" : "Grounding",
          "ontologyConcept" : "wm/process/transportation/personal_transportation",
          "value" : 1.1033762693405151
        } ],
        "themeProcessProperties" : [ {
          "@type" : "Grounding",
          "ontologyConcept" : "wm/property/price_or_cost",
          "value" : 1.0
        } ],
        "value" : 0.890346109867096,
        "display" : "THEME: wm/concept/goods/fuel (1.8379605); THEME PROCESS: wm/process/transportation/ (1.2696835), Process properties: wm/property/price_or_cost (1.0)"
      } ]
    }
MihaiSurdeanu commented 2 years ago

If I can read the JSON correctly, it looks to me like all these 3 examples are correct! Nice job @kwalcock !

kwalcock commented 2 years ago
canonicalName: drought
words: of the drought
[info] - should process "The impact of the drought has been exacerbated by high local cereal prices , excess livestock mortality , conflict and restricted humanitarian access in some areas ." effect themeProperty correctly *** FAILED ***
[info]   "[wm/property/preparedness]" was not equal to "[]" (TestGrounding.scala:45)
[info]   Analysis:
[info]   "[wm/property/preparedness]" -> "[]"

It wanted
val effectGroundings = Seq("wm/concept/crisis_or_disaster/environmental/drought", "", "", "")
but apparently got
val effectGroundings = Seq("wm/concept/crisis_or_disaster/environmental/drought", "wm/property/preparedness", "", "")
kwalcock commented 2 years ago

Most of the errors are from a slot being filled in that wasn't before, particularly for properties and processes. Perhaps the tests paid less attention to those, or maybe the isPred = !isArg results in too many potential processes and maybe there are too many long-distance properties or the property match threshold is too low.

This is already a week overdue. I'm not sure there's time for analysis.

Examples:

[info]   "[wm/property/stability]" was not equal to "[]" (TestGrounding.scala:47)
[info]   "[wm/property/condition]" was not equal to "[]" (TestGrounding.scala:45)
[info]   "[wm/property/preparedness]" was not equal to "[]" (TestGrounding.scala:45)
[info]   "[wm/process/population/death]" was not equal to "[]" (TestGrounding.scala:46)
[info]   "[wm/process/combining]" was not equal to "[]" (TestGrounding.scala:46)
[info]   "[wm/property/violent]" was not equal to "[]" (TestGrounding.scala:47)
[info]   "[wm/property/quality]" was not equal to "[]" (TestGrounding.scala:45)
[info]   "[wm/property/aftermath]" was not equal to "[]" (TestGrounding.scala:47)
[info]   "[wm/property/stability]" was not equal to "[]" (TestGrounding.scala:45)
[info]   "[wm/property/stability]" was not equal to "[]" (TestGrounding.scala:45)
[info]   "[wm/property/preparedness]" was not equal to "[]" (TestGrounding.scala:45)
[info]   "[wm/property/price_or_cost]" was not equal to "[]" (TestGrounding.scala:45)
[info]   "[wm/property/preparedness]" was not equal to "[]" (TestGrounding.scala:47)
etc.

Fewer are changes:

[info]   "wm/process/[farming/herding]" was not equal to "wm/process/[population/migrate/]" (TestGrounding.scala:46)
[info]   "wm/concept/[entity/muslim_communities]" was not equal to "wm/concept/[agriculture/]" (TestGrounding.scala:44)
[info]   "wm/concept/[entity/hamas]" was not equal to "wm/concept/[plan/]" (TestGrounding.scala:44)
[info]   "wm/concept/[entity/muslim_communities]" was not equal to "wm/concept/[government/]" (TestGrounding.scala:44)
[info]   "...ocess/communication/[campaign]" was not equal to "...ocess/communication/[informing]" (TestGrounding.scala:46)
etc.
MihaiSurdeanu commented 2 years ago

Agreed that we can't spend too much time on this at this point. But the above example for "of the drought" worries me. Which tokens were used for the preparedness grounding? It seems "drought" was used for drought. The other two "of" and "the" are stop words. So, then what was used?

kwalcock commented 2 years ago

It is grounding to "impact of the drought". "Impact" doesn't get printed because it is special, probably transparent.

MihaiSurdeanu commented 2 years ago

Then I think it is fine!

Should we merge this PR?

kwalcock commented 2 years ago

I would need to disable failing tests before it can go into master. Right now it is more important to go into kwalcock/wmexchanger which has all the updates for the new DART protocols. That hadn't gone to master yet because it was only just put into use. It is headed there for the code freeze.

MihaiSurdeanu commented 2 years ago

Thanks! Let me know when it's done so we can email MITRE.

kwalcock commented 2 years ago

The documents were regrounded and uploaded. I slacked Ben and Robyn.

MihaiSurdeanu commented 2 years ago

Beautiful!! Thanks!

kwalcock commented 2 years ago

@MihaiSurdeanu

Regarding this:

Do NOT use them if they were used as triggers for increase/decrease attributes, AND they are tagged as JJ* or VBN.

There is code intended to do very much the same at these places which use a set of words which to exclude:

https://github.com/clulab/eidos/blob/d9fa808f725f23af0b2037b39655ba1dc519a25d/src/main/scala/org/clulab/wm/eidos/groundings/grounders/srl/SRLCompositionalGrounder.scala#L108

https://github.com/clulab/eidos/blob/d9fa808f725f23af0b2037b39655ba1dc519a25d/src/main/scala/org/clulab/wm/eidos/groundings/grounders/srl/SRLCompositionalGrounder.scala#L116

https://github.com/clulab/eidos/blob/d9fa808f725f23af0b2037b39655ba1dc519a25d/src/main/scala/org/clulab/wm/eidos/groundings/grounders/srl/SRLCompositionalGrounder.scala#L558-L568

The conditions are slightly different: the POS is not checked. It is also not used consequently throughout the code. It only rules out predicates right now. I wonder whether I should update the conditions or reuse them.

kwalcock commented 2 years ago

While I'm at it, the predicates are calculated in SentenceHelper and that definition might be reused. It is slightly different: https://github.com/clulab/eidos/blob/d9fa808f725f23af0b2037b39655ba1dc519a25d/src/main/scala/org/clulab/wm/eidos/groundings/grounders/srl/SentenceHelper.scala#L18-L34

MihaiSurdeanu commented 2 years ago

Yep, I think the attachment strings should be excluded from grounding. Do these include increase/decrease words? What type are they, TriggeredAttachment?

kwalcock commented 2 years ago

TriggeredAttachment is a superclass of things like Decrease, Increase, Quantification, Property, Pos(itive)Change, NegChange, Heding, Negation. This is code that Becky wrote, so probably well thought-out. It could be narrowed down to some of the subclasses.

I'm not liking that the present code rules out words rather than words at positions. If a word is repeated in a sentence, removing based only text will take it out everywhere even if it is used differently.

MihaiSurdeanu commented 2 years ago

I agree with everything you said. But having the same word repeated in the same phrase is pretty unlikely…

On Thu, Mar 10, 2022 at 20:21 Keith Alcock @.***> wrote:

TriggeredAttachment is a superclass of things like Decrease, Increase, Quantification, Property, Pos(itive)Change, NegChange, Heding, Negation. This is code that Becky wrote, so probably well thought-out. It could be narrowed down to some of the subclasses.

I'm not liking that the present code rules out words rather than words at positions. If a word is repeated in a sentence, removing based only text will take it out everywhere even if it is used differently.

— Reply to this email directly, view it on GitHub https://github.com/clulab/eidos/issues/1115#issuecomment-1064733688, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TQVC3VTJ7X4W4ZINT3U7K32LANCNFSM5PK663SQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

kwalcock commented 2 years ago

Exact matches are only run on the ontology node names. The ontology node examples have the approximate matches. We've been recently using lemmas to compare against the unlemmatized (raw) node names. The means that "higher temperatures" would be changed to "higher temperature" before comparison and not match the node wm/concept/environment/higher_temperatures. I don't think we can ask the ontology creators to use lemmas, but it is easy for us to test both words and lemmas and take the best. I have made that change.

kwalcock commented 2 years ago

Less easy is dealing with "higher". It is one of those words ruled out because it is an attribute or something and therefore would not be used in an exact match. I think that creators of the ontology might put specific texts in the node names that they want to match no matter what. I'm waiting to invalidate the attributes and parts of speech until after the exact match calculation has completed.

kwalcock commented 2 years ago

Another example seems to be armed conflict.

kwalcock commented 2 years ago

African governments works better in this version. It is a node. Also "cash crops". There are some times in which the match is a distraction, however. So, I've made it so that at least one exactly matching word needs to be valid. In the sentence

"exceptional prolonged and persistent agro-pastoral drought sequence Highlights 1"

it changed the match from prolonged to drought. wm/concept/time/prolonged is strange. It seems more like a property.

MihaiSurdeanu commented 2 years ago

Thanks @kwalcock ! I agree with the changes you made. I like the idea of having >= 1 valid words in the exact match!