clulab / eidos

Machine reading system for World Modelers
Apache License 2.0
37 stars 24 forks source link

Parentheses escaped as -LRB-, -RRB- #235

Closed adarshp closed 6 years ago

adarshp commented 6 years ago

When parentheses are present in the text, they get escaped as -LRB-, -RRB-, etc. This gets propagated to the sentence text in the JSON-LD output file. I suspect it might also cause some weird issues - such as the entity "Conflict" not being grounded in the sentence

"Conflict affects mostly the Greater Upper Nile Region (states of Upper Nile, Unity and Jonglei) with Central Equatoria remaining by and large unaffected after the early stages of the conflict."

Minimal working example with sbt console:

import org.clulab.wm.eidos.EidosSystem
val reader = new EidosSystem()
reader.extractFromText("X (Y) causes Z").document.sentences.head.getSentenceText
res2: String = X -LRB- Y -RRB- causes Z

I think the issue is related to Universal Dependencies - I managed to find the following issues filed in 2015:

https://github.com/UniversalDependencies/UD_English-EWT/issues/1 https://github.com/UniversalDependencies/docs/issues/148

Any ideas on how to fix this?

MihaiSurdeanu commented 6 years ago

This is necessary due to the fact that the syntactic parser expects parentheses to be represented this way. But I think this can (and should) be reverted in the canonical name.

BeckySharp commented 6 years ago

Agreed, I can make that change, thanks for the issue!

On Wed, Mar 28, 2018 at 7:45 PM Mihai Surdeanu notifications@github.com wrote:

This is necessary due to the fact that the syntactic parser expects parentheses to be represented this way. But I think this can (and should) be reverted in the canonical name.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/clulab/eidos/issues/235#issuecomment-377103037, or mute the thread https://github.com/notifications/unsubscribe-auth/AFIniRWM1Xd-0K8Y6NBHmeyte5DtqlY9ks5tjEq3gaJpZM4S_j-A .

kwalcock commented 6 years ago

Thanks, volunteer.

adarshp commented 6 years ago

Excellent, thanks Becky!

BeckySharp commented 6 years ago

hey all -- so I think that perhaps the only place they are occuring in the JSON-LD is in the text field, not canonical name, which kinda makes sense looking at the code. When I grep our example file, I can't find any examples of either with canonicalName -- @adarshp if you saw one, can you please send me a MWE so I can replicate it and write it up as a test? thanks! As I understand it, @MihaiSurdeanu we're not concerned about reverting parens in the doc text output correct?

adarshp commented 6 years ago

You're right, they don't occur in the canonicalName. However, we were extracting the "text" field from the "DirectedRelation" entries as part of the provenance (if MITRE/analysts wish to inspect the original sentence). The other puzzling thing I saw was that the word 'conflict' was not being grounded: see the output of Eidos run on 10_FAO_a-i5505e.txt (one of the 52 docs from MITRE):

    "@type" : "Entity",
    "@id" : "_:Entity_3447",
    "labels" : [ "NounPhrase", "Entity" ],
    "text" : "Conflict",
    "rule" : "simple-np",
    "canonicalName" : "Conflict",
    "grounding" : [ {
      "@type" : "Grounding",
      "ontologyConcept" : "/entities/human/nation",
      "value" : 0.0
    }, {
      "@type" : "Grounding",
      "ontologyConcept" : "/entities/natural/crop",
      "value" : 0.0
    }, {
      "@type" : "Grounding",
      "ontologyConcept" : "/events/human/human_migration",
      "value" : 0.0
    }, {
      "@type" : "Grounding",
      "ontologyConcept" : "/entities/human/fertilizer",
      "value" : 0.0
    }, {
      "@type" : "Grounding",
      "ontologyConcept" : "/entities/human/livelihood",
      "value" : 0.0
    }, {
      "@type" : "Grounding",
      "ontologyConcept" : "/entities/natural/soil/soil_contents",
      "value" : 0.0
    }, {
      "@type" : "Grounding",
      "ontologyConcept" : "/temporal/months",
      "value" : 0.0
    }, {
      "@type" : "Grounding",
      "ontologyConcept" : "/entities/human/financial/economic/revenue",
      "value" : 0.0
    }, {
      "@type" : "Grounding",
      "ontologyConcept" : "/entities/measurement/weight",
      "value" : 0.0
    }, {
      "@type" : "Grounding",
      "ontologyConcept" : "/events/human/famine",
      "value" : 0.0
    } ],
    "provenance" : [ {
      "@type" : "Provenance",
      "document" : {
        "@id" : "_:Document_3573"
      },
      "sentence" : {
        "@id" : "_:Sentence_3835"
      },
      "positions" : {
        "@type" : "Interval",
        "start" : 4,
        "end" : 4
      }
    } ]

I thought that this might be due to the parentheses escaping - but I could be wrong - could you check your output on 10_FAO_a-i5505e.txt (it's in the Google Drive folder I think)?

kwalcock commented 6 years ago

I appreciate people's detective work. Sometimes it's a thankless task.

MihaiSurdeanu commented 6 years ago

Fwiw, we should not revert parens in the doc text. Just in canonical names, if they occur there.

BeckySharp commented 6 years ago

ok -- then I am closing the Issue for now, based on the content of the thread. we can reopen if needed. thanks all!

adarshp commented 6 years ago

Sounds good - I can unescape them downstream to make the text more human-readable.