clulab / eidos

Machine reading system for World Modelers
Apache License 2.0
36 stars 24 forks source link

Extension of JSON-LD format to accommodate BBN data and our own future work #278

Closed kwalcock closed 6 years ago

kwalcock commented 6 years ago

Disclaimer: There are many ways to do this and picking one will likely involve unknown forces of subjectivity, intuition, and conjecture that won't match mine. In other words, this may be totally off base.

I'd like to divide the participants in our (binary) DirectedRelation and (binary) UndirectedRelation into essential and non-essential participants and not mix them. This is so that the focus is clear and simple and not everyone is forced to deal with the non-essential. This comes at the expense of needing different code to process the two types.

In the case of DirectedRelation, we have to somehow specify the two (sets of) participants and the direction. Without these there just isn't a DirectedRelation. These should be placed in unambiguous locations and not require extra reasoning to extract. We have something called source and destination to fulfill these requirements.

The situation is then complicated by the fact that there may be different kinds of DirectedRelation. We've been focussed on Causal, but there are more in our own pipeline and other groups will have theirs. Right now we have some indication of the particular relationship in the labels field and less directly, a rule field. The former contains a list for us and it is only by convention that someone would know that the first item is most specific. It doesn't seem likely that all groups will have a list or a rule field. I suggest a new field called relationName that specifies in a single string fit for human consumption the kind of DirectedRelation.

Each kind relation may have special names for the participants that will help a human understand what is meant by the generic source and destination. These might be used to label nodes even when the relationName is not recognized. I'd call these sourceName and destinationName.

It is possible to store this information in many different ways, such as participants: [ { role: "source", name: "cause", value: "lack of food" }, { role: "destination", name: "effect", value: "hunger" } ] and I will suggest that for the non-essential participants that have more open class roles. For the essential participants I think this complicates extraction of information such as the name of the destination. Finding the answer requires a search through a list.

The non-essential participants, which may be thought of as modifying the main relationship in any of a myriad of ways (and drawn as such) need a field name. Modifier and argument are already in use once, attachment is used in the program, but not JSON-LD. A brainstorm of other names is pasted below. I'd like to require that these non-essential participants be other instances of Entity, DirectedRelation, or UndirectedRelation and be included by reference. They can be in charge of their own rules, triggers, labels, provenance, etc. Other fields can be "role" for a formal, potentially standardized explanation of what the non-essential is doing there, and then "name" for a human friendly version, and lastly, "value".

Here is a simplified example:

She gave him a book yesterday at the library. (This is an increase event.)

{ @type: "DirectedRelation", trigger: { text : "gave" }, relationName : "transfer" sourceName : "transferrer", destinationName: "transferred", sources : [ "She" ], destinations : [ "book, a" ] relatedItems : [ { role : "indirectObject", name : "recipient", value : "him" }, { role : "time", name : "when", value: "yesterday" }, { role: "location", name: "where", value: "at the library" } ]

For UndirectedRelation, I would add relationName and just argumentName, since there is no distinction between source and destination. After that, the non-essential items can be added.

Brainstorm: environment, details, observers, hangersOn, audience, players, relatedItems, relatedInfo, relatives, accomplices, helpers, cast, crew, costars, public, detours, waymarks, context, supportingRoles, supportingArguments, sidekicks, optional

kwalcock commented 6 years ago

Those assignees are there just so they can get pinged for feedback (except for me).

MihaiSurdeanu commented 6 years ago

I personally think this is overcomplicating things... I suggest we get inspired in this representation by FrameNet (https://framenet.icsi.berkeley.edu/fndrupal/), who has been doing this for a while... What I would suggest is:

  1. We remove the distinction between Directed vs. Undirected because this will become clear from other things, see below.
  2. We instead allow the event @type to be something custom, from a fixed taxonomy. In our case, we currently have two types: Causal, Correlation. BBN will add a bunch more. We understand directionality from these types, e.g., Causal is directed, Correlation is not. Most of the others extracted by BBN are not directed. For example, I would argue that Keith's Give event, "She gave a book", is not directed.
  3. We simplify the participants. That is, instead of having both @role and @name, we can keep just @role. Following FrameNet, essential arguments are then Agent and Patient, e.g., "She" is Agent, "him" is Patient, "a book" is Object. All the others are non-essential, and can take other arbitrary names.

Directionality comes naturally out of this: for Causal events, it goes from Agent to Patient. And is not enforced on other frames, where we don't need it.

What do you all think? Also, I think we should invite BBN to this discussion. @bsharpataz: can you please ask Bonan for his github id?

adarshp commented 6 years ago

I agree with Mihai's suggestions. Since we are thinking of going to a 'hyperedge' representation, perhaps there should be a meta-label called participants at the same level as @type, which maps to a list. This would allow for arbitrary numbers of agents and patients.

kwalcock commented 6 years ago

My first reaction is "Collin Baker, Chuck Fillmore, ICSI? I used to know these people. Cool." (Not that I especially knew their work.) There are definitely some things I like about the idea. I saw the word revenge which is associated with avenger, degree, depictive, injured_party, injury, instrument, manner, offender, place, punishment, purpose, result, time. This sounds like the kind of thing heard at the meeting. Some group wanted to fill in a large number of blanks/slots like these. On the other hand, I can't find much information about their file format other than XML for both frame and LU. I'm hoping maybe for a large, possibly grandiose change in how we think about what seemed like simple nodes and edges before and yet somehow a small change in their representation.

  1. I'm also slightly worried about this distinction as well. One advantage of distinguishing it explicitely somewhere is that the information is sitting right there even if the kind of relation, maybe a revenge relation, is unknown to the receiver of the data.

  2. It seemed like we had quite a list: causal, isA, origin, transparentLink for directed; correlation and sameAs for undirected. Maybe they were just ideas. I'd be mostly concerned with the "fixed" part and how this taxonomy is maintained and distributed. It would be possible to include the necessary parts of the taxonomy in the exchanged file format. This is similar to how context is used in the current format. "If/when you see causal below, treat it as directed and expect a cause and effect. If you see gave, treat it as undirected and expect a giver and recipient and maybe a place and time..."

  3. I agree that role and name overlap greatly and that one can probably be eliminated. On the other hand, if there is an avenger and an injuredParty for revenge, it might be useful to compare them to an agent in patient for gave. I have no idea what our use cases are; this one does seem unlikely. How do we specify which are essential vs. non-essential? Arbitrary names causes a small alarm bell to go off for me, perhaps falsely.

So, my concern is mostly for the taxonomy and maybe what comes naturally. However, I think that these high level design decisions are best left to people with more background than me--those who know, for example, that this FrameNet even existed.

Should we use the online tool they have to try to process some sentences? GitHub isn't too bad at this, but do we have something better to use for further discussion with BBN?

BeckySharp commented 6 years ago

@MihaiSurdeanu sorry for the delay -- just emailed Bonan

@kwalcock about the list of relations -- those are largely going away/getting set aside for now. We'll end up with Causal and Correlation (and Inc/Dec/Quant, though they don't make it to the output)

BeckySharp commented 6 years ago

OK -- trying to add @bnmin now...

bnmin commented 6 years ago

Thanks for looping me in @bsharpataz ! I prefer the simplified solution as @MihaiSurdeanu suggested, though I might not know as much context as many of you do. If you could share a JSON-LD representation for a few sentences, that would be very helpful.

kwalcock commented 6 years ago

A draft of the new format can be found at https://github.com/clulab/eidos/wiki/JSON-LD2. The two relations have been reduced to one relation with the addition of a "type" field to specify something like "causation" or "correlation". Those must come from a second table which lists relations and their expected arguments. Arguments are described in a single list which is similar to what was there before, but each argument has its own "type" like "cause" or "effect" which again should come from an agreed upon list.

I had to use a Wiki page for the formatting, but will look for feedback here. There should be some real, working output in this format shortly. Thanks.

adarshp commented 6 years ago

Just as a heads-up @kwalcock : You can make formatted tables in Github issues as well: you just need to convert each + in the second row to a minimum of three dashes ---.

|Name|Property|Type|Description|
|---|---|---|---|
|Corpus|`@type`|"Corpus"|A corpus is typed.|
||documents|[Document]|It has a list of documents|

becomes:

Name Property Type Description
Corpus @type "Corpus" A corpus is typed.
documents [Document] It has a list of documents

(You need to put the backticks around @type to prevent Github from automatically creating a mention link and notifying the Github user with the username type.)

The JSON-LD description in the wiki looks good.

kwalcock commented 6 years ago

Thanks. I had concluded before that Org-mode was necessary, but I guess that's not the case. I just hadn't used the secret word.

bnmin commented 6 years ago

Thanks a lot! In general, BBN's CAG representation is similar to this.

Here are a few places that are different:

  1. Our CAG doesn't include sentences, words, and dependencies. We keep these for internal analysis but not as output to downstream applications.

For any extraction (entity, relation, event), we include a document ID and a pair of character offsets (start and end) as provenance. Our offsets are character offsets to the beginning of the document (with the beginning of the document as 0).

  1. BBN's causal factors (arguments of a causal relation) are mostly events. We define event broadly to include occurrences, action, process, state, change of states, etc. An event has 1) a trigger (a word or a phrase), 2) properties such as polarity (positive, negative), tense (past, present, and future), etc, and optionally 3) a list of arguments.

It looks like event could be defined as a sub-class that combines Entity and Relation

How about making a new concept Event in this way (it has all attributes from Entity and Relation)?

  1. Arguments of a relation can be entities/events/relations.

  2. We also output entities, their mentions, and value mentions such as dates. These entities and dates are used as arguments of events. Our entity types include Person, GeoPoliticalEnity (GPE), and Organization. We follow the ACE definition of entities.

We could reuse the "Entity" concept to represent our ACE-style entities as well as value mentions (as an instance of the temporal entity in your ontology).

  1. More relation types

We extract temporal relations (e.g., occurs_before) between events, as well as entity-entity relations such as <GPE1, part_of, GPE2>. We can provide a list of types to be added into the ontology (their representation will be similar to the relation "causation").

A few questions:

  1. Does grounding here mean assigning a type to an entity?

  2. The usage of "label" is not clear to us. It looks like it can refer to relation type ("Correlation"), directed or undirected ("UndirectedRelation"), among other things (e.g., "EntityLinker", "Event")

Thanks, Bonan

MihaiSurdeanu commented 6 years ago

Hi Bonan, Thanks for the detailed comments!

Let me try to answer: "Our CAG doesn't include sentences, words, and dependencies. We keep these for internal analysis but not as output to downstream applications." - this is fine. Sentences/words/deps are optional. Some systems may produce them, some not.

Offsets: we token offsets because during tokenization we may transform the text (e.g., replace Unicode greek characters with their ASCII version). But maybe we can change the provenance to allow for either token or character offsets?

Representation of events: I think these fit very well under Relation (btw, we can change the name "Relation" to something more descriptive...). Our Relations can take state as well, we just don't use it now. Correct, @bsharpataz? I think the simplest solution is adjust Relation to accommodate your Events. I think these adjustments will be minimal.

"Arguments of a relation can be entities/events/relations" - yep, same for us.

"We also output entities, their mentions, and value mentions such as dates" - We have a representation for entities (as you saw), which I think accommodates all these types. But we output only entities that participate in events to avoid overwhelming the downstream user, since in our case any NP is a potential entity. But this doesn't really matter for the format. I think our entities are similar (but yours have more types).

"We extract temporal relations" - this format supports an arbitrary number of relation types, so no issues here.

"Does grounding here mean assigning a type to an entity" - essentially yes. But we allow a 1-to-n mapping between one entity and possible types. Further, we plan to ground to multiple name spaces. For example, we will continue to ground to our in-house ontology, but we will add another that contains FAO indicators, which is much more fine grained.

"The usage of "label" is not clear to us" - yes, it does seem that @type and @labels are now redundant. @kwalcock, @bsharpataz: what did you have in mind?

kwalcock commented 6 years ago

The @type with @ is a JSON-LD thing. labels without @ and type without @ are largely redundant. If nobody needs the list of labels (or any of the other fields), they can be trimmed, of course. Some elements may be present more for completeness than usefulness. I would say that JSON is pretty good about ignoring things, though, so missing items may be more important than extra items.

MihaiSurdeanu commented 6 years ago

So, should we simply use @type then?

On Thu, Apr 19, 2018, 10:16 Keith Alcock notifications@github.com wrote:

The @type with @ is a JSON-LD thing. labels without @ and type without @ are largely redundant. If nobody needs the list of labels (or any of the other fields), they can be trimmed, of course. Some elements may be present more for completeness than usefulness. I would say that JSON is pretty good about ignoring things, though, so missing items may be more important than extra items.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clulab/eidos/issues/278#issuecomment-382813505, or mute the thread https://github.com/notifications/unsubscribe-auth/ABH-zub9Ny5cqdpHwbYl0afB8lk-76zyks5tqMZugaJpZM4TOxg9 .

kwalcock commented 6 years ago

I don't think so, at least not for Relation, but perhaps for Argument. The value for @type is a closed class value that is used to define the JSON syntax, much like a class name. Perhaps the second type should be renamed to avoid confusion. There we're just expecting a string from an open (but agreed upon with others as need be) class to describe the relation or argument, but not influence any syntax. As far as I know, if we see ("@type" : "newfangled"), we'd be expected to say "I don't know what you're talking about." If we see ("type" : "newfangled"), we say "That's a strange value, but whatever. You still have to have a this and that field because your @type tells us so." If the syntax of certain kinds of relations or arguments needs to be different, though, then we'd have to do more with @type.

MihaiSurdeanu commented 6 years ago

Thanks! To make sure we're on the same page, can you please include here a simple example, say, what is the JSON format for the causal relation extracted from "A causes B"?

bnmin commented 6 years ago

Thanks, Mihai!

Offsets: I think supporting both character offsets in raw documents and token offsets would be great! This allows systems that assume raw character offsets to be incorporated.

Representation of events: that matches my understanding. "relation" is less of a descriptive word to me:) Adjusting Relation to include states/event properties would be useful. I think it would be really helpful to rename it to something that's more "inclusive". I also heard "causal events" from other performers when they were describing causal relations.

Thanks, Bonan

kwalcock commented 6 years ago

It might look like this with the new design, whereby the changed part is at the bottom:

  {
    "@type" : "Relation",
    "@id" : "_:Relation_1",
    "labels" : [ "Causal", "DirectedRelation", "EntityLinker", "Event" ],
    "text" : "A causes B",
    "rule" : "ported_syntax_1_verb-Causal",
    "canonicalName" : "a cause b",
    "provenance" : [ {
      "@type" : "Provenance",
      "document" : {
        "@id" : "_:Document_1"
      },
      "sentence" : {
        "@id" : "_:Sentence_1"
      },
      "positions" : {
        "@type" : "Interval",
        "start" : 1,
        "end" : 3
      }
    } ],
    "trigger" : {
      "@type" : "Trigger",
      "text" : "causes",
      "provenance" : [ {
        "@type" : "Provenance",
        "document" : {
          "@id" : "_:Document_1"
        },
        "sentence" : {
          "@id" : "_:Sentence_1"
        },
        "positions" : {
          "@type" : "Interval",
          "start" : 2,
          "end" : 2
        }
      } ]
    },
    "type" : "causal",
    "arguments" : [ {
      "@type" : "Argument",
      "type" : "cause",
      "value" : {
        "@id" : "_:Entity_1"
      }
    }, {
      "@type" : "Argument",
      "type" : "effect",
      "value" : {
        "@id" : "_:Entity_2"
      }
    } ]
  }
MihaiSurdeanu commented 6 years ago

I see. So "@type" is simply the data structure type encoded in the JSON. I vote to rename this to Event from Relation.

"@labels" are the labels that apply to this event. In Keith's example above, these are hypernymy labels from our taxonomy. That is, Causal IS-A DirectedRelation IS-A EntityLinker IS-A Event, where Event is the top of the taxonomy, and Causal is the terminal. Bonan, I think what we store in here can be adjusted. Minimally, of course, we want at least the actual type of the relation/event.

I think we're close to convergence, no?

kwalcock commented 6 years ago

I renamed Relation to Event as was suggested.

For each of our words in a sentence, we have startOffset and endOffset. These appear to be offsets in characters from the start of the document. Perhaps it isn't the original document, though, and I'll check. The value in the sentence texts has at least spaces added and there is some conversion of things like ( to -LRB-. For those, the offsets indicate a single character and not five. We may want to add a text field for the entire document which preserves as pristine a copy as possible.

Still, the provenance we use is based on start and end words. We have wordPositionsInSentence rather than characterPositionsInDocument. It takes an extra mapping to follow the start and end words to startOffset and endOffset of the document. The words are contained in sentences, not directly in the document, so we have an extra layer there. Both could be included in something like the following (where documentWordPositions and sentenceCharPositions would also be possible). It seems like allowing for all possibilities may be more expensive than standardizing on just one of them.

    "provenance" : [ {
      "@type" : "Provenance",
      "document" : {
        "@id" : "_:Document_6"
      },
      "documentCharPositions" : {
        "@type" : "Interval",
        "start" : 0,
        "end" : 5
      },
      "sentence" : {
        "@id" : "_:Sentence_6"
      },
      "sentenceWordPositions" : {
        "@type" : "Interval",
        "start" : 12,
        "end" : 12
      }
    } ],

It sounds like some of the other BBN information could be encoded like

"extractions" : [ {
  "@type" : "Event",
  "type" : "occurrance|action|process|state|change_of_state|occurs_before|part_of",
  "trigger" : ...
  "states" : [ {
    "@type" : "State",
    "type" : "polarity",
    "text" : "positive|negative" /* maybe value is a better name than text */
  }, {
    "@type" : "State",
    "type" : "tense",
    "text" : "past|present|future"
  } ],
  "arguments" : [ {
    "@type" : "Argument",
    "type" : ?,
    "value" : {
      "@id" : "_:_Entity|Event|Relation_1"
  } ]
} , {
  "@type" : "Entity",
  "type" : "Person|GeoPoliticalEntity|Organization|Temporal?",
  "mentions" : [
    "?mention?"
  ],
  "valueMentions" : [
    "?date?"
  ]
} ]

Please feel free to copy and edit.

MihaiSurdeanu commented 6 years ago

Thank you @kwalcock! It seems to me that we have a format.

@bnmin: I wonder if you could produce the output of some representative BBN relations/events in this format, to make sure that we are on the same page? Then we can write a spec for this format, and share it with the rest of the program.

Thanks!

bnmin commented 6 years ago

Thank you! @kwalcock @MihaiSurdeanu

@MihaiSurdeanu Sure! We are working on implementing a serializer that can output this format. I'll keep you posted. There are a few issues we found. I will also post our comments here.

Thanks, Bonan

bnmin commented 6 years ago

Here is a CAG produced by our preliminary implementation of this JSON-LD format: BBN_wm_m6_debug_10doc.v0.1.json-ld.zip

We haven't implemented all required/useful features (for example, provenances for relations are missing). My apologies if this looks like "half-baked". We will send updated version in the next a few days.

The following block shows examples of a document (with sentences), an entity (with mentions), a "Cause-Effect" relation between a pair of events, and a "PART-WHOLE.Geographical" between two GeoPolitical Entities, etc.

Please let me know if you have any questions or suggestion of changes. I also plan to post our comments later on.

{
    "@context": {
        "Argument": "https://github.com/clulab/eidos/wiki/JSON-LD#Argument",
        "Corpus": "https://github.com/clulab/eidos/wiki/JSON-LD#Corpus",
        ...
    },
    "@type": "Corpus",
    "documents": [
        {
            "@id": "ENG_NW_20180124",
            "@type": "Document",
            "sentences": [
                {
                    "@id": "SEN-ENG_NW_20180124-42",
                    "@type": "Sentence",
                    "text": "1.5 million S. Sudanese risk facing famine, says UN"
                },
                {
                    "@id": "SEN-ENG_NW_20180124-43",
                    "@type": "Sentence",
                    "text": "January 24, 2018 (JUBA) –"
                },
                {
                    "@id": "SEN-ENG_NW_20180124-44",
                    "@type": "Sentence",
                    "text": "At least 1.5 million South Sudanese could face famine while up to 20,000 of them are experiencing famine conditions, a United Nations humanitarian officials told the Security Council on Wednesday."
                }
            ]
        }
    ],
    "extractions": [
        {
            "@id": "ENT-ENG_NW_20160629-64",
            "@type": "Entity",
            "canonicalName": "some 80 million people",
            "grounding": [
                {
                    "@type": "Grounding",
                    "ontologyConcept": "/entity/PER/Group",
                    "value": 0.5
                }
            ],
            "labels": [
                "Entity"
            ],
            "mentions": [
                {
                    "provenance": {
                        "@type": "Provenance",
                        "document": {
                            "@id": "ENG_NW_20160629"
                        },
                        "positions": {
                            "@type": "Interval",
                            "end": 4940,
                            "start": 4935
                        }
                    },
                    "text": "some 80 million people"
                }
            ]
        },
        {
            "@id": "EVE-ENG_NW_20180101-340",
            "@type": "Event",
            "arguments": [
                {
                    "@type": "Argument",
                    "type": "Place",
                    "value": { 
                        "@id": "ENT-ENG_NW_20180101-392"
                    }
                }
            ], 
            "grounding": [
                {
                    "@type": "Grounding", 
                    "ontologyConcept": "/event/Agriculture",
                    "value": 1.0
                }
            ], 
            "labels": [
                "Event"
            ], 
            "provenance": [
                {
                    "@type": "Provenance",
                    "document": {
                        "@id": "ENG_NW_20180101"
                    }, 
                    "positions": {
                        "@type": "Interval",
                        "end": 213, 
                        "start": 204
                    }
                }
            ], 
            "states": [
                {
                    "@type": "State", 
                    "text": "Asserted",
                    "type": "modality"
                },
                {
                    "@type": "State", 
                    "text": "Specific", 
                    "type": "genericity"
                },
                {
                    "@type": "State", 
                    "text": "Positive",
                    "type": "polarity"
                }
            ], 
            "trigger": { 
                "@type": "Trigger",
                "provenance": [
                    {
                        "@type": "Provenance",
                        "document": {
                            "@id": "ENG_NW_20180101"
                        }, 
                        "positions": {
                            "@type": "Interval",
                            "end": 213, 
                            "start": 204
                        }
                    }
                ], 
                "text": "production"
            }
        },
        {
            "@id": "REL-ENG_NW_20170811-229",
            "@type": "Relation",
            "arguments": [
                {
                    "@type": "Argument",
                    "type": "has_cause",
                    "value": {
                        "@id": "EVE-ENG_NW_20170811-247"
                    }
                },
                {
                    "@type": "Argument",
                    "type": "has_effect",
                    "value": {
                        "@id": "EVE-ENG_NW_20170811-248"
                    }
                }
            ],
            "labels": [
                "Cause-Effect"
            ]
        },
        {
            "@id": "REL-ENG_NW_20180117-19",
            "@type": "Relation",
            "arguments": [
                {
                    "@type": "Argument",
                    "type": "left_arg",
                    "value": {
                        "@id": "ENT-ENG_NW_20180117-75"
                    }
                },
                {
                    "@type": "Argument",
                    "type": "right_arg",
                    "value": {
                        "@id": "ENT-ENG_NW_20180117-76"
                    }
                }
            ],
            "labels": [
                "PART-WHOLE.Geographical"
            ]
        }
    ]
}
MihaiSurdeanu commented 6 years ago

Thanks @bnmin!

We are close, but there are a few differences:

kwalcock commented 6 years ago
Event type Argument type Notes
?Event?
Place
Cause-Effect
has_cause
has_effect
PART-WHOLE.Geographical
left_arg
right_arg
State type State text
modality Asserted
genericity Specific
polarity Positive

Here are some notes from a very low level perspective.

bnmin commented 6 years ago

@MihaiSurdeanu To answer your questions: 1) Entity mentions: Yes, we grouped all entity mentions under an entity block, if possible (there would be NPs that we only have one mention per entity) 2) Merge events and relations: Yes. They are similar in representation. We need to tweak our code a bit to support that.

Thanks! Bonan

bnmin commented 6 years ago

Please find a newer version of our CAG in this JSON-LD format: wm_m6_debug_10doc.json-ld.zip

A few issues/questions, or our thoughts:

1) Word, Dependency, and Sentence

We did not output Word nor Dependency. Words are not necessary because we only output document character offsets as provenances.

In fact, Sentences aren't very useful in our current output. We did output Sentence objects because we plan to provide event provenance pointing to sentences (for some events, provenance come from non-consecutive words, so it is more useful to just use sentence as provenance).

2) We use "documentCharPositions" instead of "positions", as suggested by @kwalcock

3) "labels", "type", and "groundings"

We include "labels", "type" for entities, events, and relations. We include "groundings" for entities and events.

4) Mentions and value mentions

Our entities (e.g., Person, GeoPolitical entity) will have a list of mentions. I think it is still useful to use the fillowing structure because it allows richer information such as "text", mention level (pronoun, name, descriptor) in addition to provenance.

"mentions" : [ "provenance": {}, "text": "Xyz Inc.", ... ],

5) "Unifies" relation

To represent soft grouping of entities, as well for events, we propose to add a new relation type "Unifies". This is similar to cross-document coreference, but can be broadly defined, for example,

Event1: "food insecurity" Event2: "famine in South Sudan" Event3: "famine in Sudan"

We will create another Event Event4 ("food insecurity"), and add

Event4 Unifies Event1 Event4 Unifies Event2 Event4 Unifies Event3

This kind of grouping allows higher-level of abstraction of causal semantics, and better visualization. Related events can be grouped (similar for relations).

The current JSON-LD representation is already sufficient for this purpose - We just have to add a relation type "Unifies".

6) Document location (path)

It is useful to include a path to the original document for each file. Neither "@id" or "title" is a good placeholder.

Can we create a property "path" or "filename" for Document?

7) namespaces for contexts and concepts

This is a minor issue. At the ontology telecon, ISI suggested us to use a URI naming schema that all preformers can access, contribute and "negotiate" content. While the "https://github.com/clulab/eidos/wiki/JSON-LD" namespace is certainly very useful, is it possible to put it in a place that can faciliate colloboration. An example suggested by ISI is w3id.org.

A similar problem apply to ontology concepts. For example: "ontologyConcept" : "/entities/human/livelihood"

Where is the ontology concepts defined? It would be great if it resides in a URL similar to the contexts.

Thank you!

Best, Bonan

MihaiSurdeanu commented 6 years ago

Thanks @bnmin! I think we're close. Answers to your points:

  1. Agreed. Words/sentences are only needed if the system outputs token offsets. Otherwise they should be optional.
  2. Ok with me. @kwalcock?
  3. Ok. This is compatible with the format. I would still like to merge Event and Relation. Semantically they seem very close to me... Plus, I think State should be supported on all types (from Entity to Event). @kwalcock: can you please indicate how we would represent BBN's state as attributes of these objects?
  4. I see. I think we should include a mentions [] block in the representation. In your case, this may contain multiple mentions for you. For us, at least for now, it will be 1 mention per entity. @kwalcock: can you please adjust the format to include such mentions? (We can talk off line if needed).
  5. I like this. Similar to your Unifies relation, we will add others, including CorefersWith. This is fully compatible with the format.
  6. I agree! @kwalcock: can you please add a field for this?
  7. I agree. @kwalcock: any suggestions for more global name space that is equally available to all performers? @bnmin: our ontologies are defined in github as well. Maybe we should include a path to each ontology used?
bnmin commented 6 years ago

Thank you @MihaiSurdeanu !

  1. Re: merge Event and Relation

Yes. We have merged Event and Relation. There will only be Event (Relation can be idenfied by looking at "labels").

  1. Re: #7

I might be missing something, but I couldn't find where the path to ontology is specified in your JSON-LD file. Is there a hidden assuption of where is it?

Here I'm throwing random thoughts (not a big fan of either of these two):

2.1. Maybe we can use prefix such as the following?

2.2 or it looks like we can also go with a single ontology that everyone can view and edit (this is more elegant, but might be hard to do at this moment)

Thanks, Bonan

MihaiSurdeanu commented 6 years ago

On 7: we do not report the path to the ontology in the grounding now, but we should. @kwalcock: what is the simple idea for this (see also the work that Ajay is implementing this week)?

MihaiSurdeanu commented 6 years ago

@bnmin: I slept on it, and I have some issue with 4 above: you seem to include mentions for entities but not for events/relations. Correct? If so, wouldn't be more elegant to store individual mentions for all extractions, and add a CorefersWith to link mentions of the same entity together?

bnmin commented 6 years ago

Thanks @MihaiSurdeanu. For the CoreferesWith relation, this would be more elegant. However, we would prefer to use "Unifies" relations which are broadly defined (it can group extractions that don't corefer with each other). This relation allows better visualization and showing of a reasonable, abstract causal graph from limited extractions.

kwalcock commented 6 years ago

That's a lot to process, but I'll try to update the Wiki document based on the comments and the one example file from above. In general, I did not try to indicate at all whether or not something was required. I think that the convention is to ignore anything unrecognized, such as an unexpected path/filename field, but on the other hand to try to explain anything that might reasonably be expected. "If you see this, then it means..."

Regarding the ontology or ontologies, we recently added the ability to use multiple ontologies. In the example below they have just been given a name, "un" or "fao". This could be something more involved like "/ontology/UA/un" but I would hesitate to combine the name (or some other description) of the ontology and the ontologyConcept into one string. Separating the parts requires extra knowledge about the conventions used. These ontologies are just local files which are published with the source code. Should they be made public in a more public way with details in the JSON-LD output?

    "groundings" : [ {
      "@type" : "Groundings",
      "name" : "un",
      "values" : [ {
        "@type" : "Grounding",
        "ontologyConcept" : "/entities/human/livelihood",
        "value" : 0.506400226133985
      }, {
        "@type" : "Grounding",
        "ontologyConcept" : "/entities/human/government/government_entity",
        "value" : 0.4335030428381624
      } ]
    }, {
      "@type" : "Groundings",
      "name" : "fao",
      "values" : [ {
        "@type" : "Grounding",
        "ontologyConcept" : "/events/Value/Value of food imports over total merchandise exports (%) (3-year average)",
        "value" : 0.4484180772282565
      } ]
} 
kwalcock commented 6 years ago

Right now we (UA) do not output the original document text, but only the processed sentence text. It would be good to see the original text without having to consult the original file. They (BBN) do record the original text at the sentence level (I'm not sure about any sentence separators), but then report documentCharPositions. To get to the text, one would have to concatenate all the sentence texts and then count to the correct position (give or take some separators?). Perhaps we should both include text at the document level. We could also allow for sentenceCharPositions if the provenance identifies a particular sentence.

adarshp commented 6 years ago

@kwalcock: I thought the original document text is contained in the Sentence objects (under Corpus/Document)? Is that not accurate?

kwalcock commented 6 years ago

FWIW, we have IDs like "@id" : "_:Document_1" rather than "@id": "SEN-ENG_NW_20180124-42" based on https://www.w3.org/TR/json-ld/#node-identifiers where it says

6.14 Identifying Blank Nodes This section is non-normative.

At times, it becomes necessary to be able to express information without being able to uniquely identify the node with an IRI. This type of node is called a blank node. JSON-LD does not require all nodes to be identified using @id. However, some graph topologies may require identifiers to be serializable. Graphs containing loops, e.g., cannot be serialized using embedding alone, @id must be used to connect the nodes. In these situations, one can use blank node identifiers, which look like IRIs using an underscore (_) as scheme. This allows one to reference the node locally within the document, but makes it impossible to reference the node from an external document. The blank node identifier is scoped to the document in which it is used.

kwalcock commented 6 years ago

@adarshp, here's an example:

"text" : "The International Food Policy Research Institute -LRB- IFPRI -RRB- , established in 1975 , provides evidence-based"

adarshp commented 6 years ago

Ok, got it. I think it would be good to have the original sentence text included with the JSON-LD output. If we encounter a link in a CAG that seems off, we can examine the sentence that produced it and try to understand why it happened (broken syntax/entity not being captured, etc). In other words, increased transparency, ability to 'drill down', etc.

kwalcock commented 6 years ago

If what are known in the table (https://github.com/clulab/eidos/wiki/JSON-LD2) as Entity and Event are consolidated, something needs to be done with Entity.mentions, Entity.state, Event.trigger, and Event.arguments. Entity.mentions might morph well into Event.arguments

Extraction type Argument type Notes
"entity" "mention"

but an argument is supposed to be an existing entity. Maybe the Mention is indeed a budding entity.

Maybe state and trigger are just not required for certain kinds of extractions.

kwalcock commented 6 years ago

The value "type": "/entity/GPE/Nation" for an entity, "labels": [ "Entity" ] found in the file, is not quite what we had in mind. I think that we would express it the other way around: "type": "entity" and "labels": [ "Nation", "GPE", "entity" ]. The extraction type (as in the table above) is thought to come from a small set of pre-defined values like causation, correlation, unification.

kwalcock commented 6 years ago
State type State text
modality Asserted
genericity Specific

polarity

There was one table above that describes State. For us the state is something like INC, DEC, and QUANT, which compares mostly OK. For state text we have actual text from the document rather than some pre-defined values like Asserted and Specific above. We're using these fields differently. The output may be mixed up in the file.

            "states": [
                {
                    "@type": "State", 
                    "text": "Asserted", 
                    "type": "modality"
                }, 
bnmin commented 6 years ago

Thanks @kwalcock ! I was confused by the usage of "type" and "labels". In your example, Entity doesn't have "type" but have "labels" such as "labels" : [ "NounPhrase", "Entity" ]. That led us to believe that "label" comes from a small set (e.g., whether this is a NounPhrase and/or an Entity) but "type" comes from a large sets of ontological types such as "/entity/GPE/Nation", etc.

Thanks, Bonan

kwalcock commented 6 years ago

Re: "https://github.com/clulab/eidos/wiki/JSON-LD" namespace is certainly very useful, is it possible to put it in a place that can faciliate collaboration. An example suggested by ISI is w3id.org.

Most of the JSON-LD schemas are at a place called schema.org, but I don't see a way to easily add your own schemas there. Our Wiki page has the additional disadvantage of not completely supporting HTML, so we can't easily make an anchor https://github.com/clulab/eidos/wiki/JSON-LD#Corpus. The link just gets us pretty close.

For w3id.org we need to come up with a PROJECT-ID to fit their format https://w3id.org/PROJECT-ID/SUB-ID... I need to figure out how to get an .htaccess file to redirect to wherever the data really is and then update the data there. It may be that the rewrite rules in the .htaccess file can convert wiki/JSON-LD/Corpus to something like wiki/JSON-LD#WikiStyleAnchorToCorpus. That would be an added bonus.

kwalcock commented 6 years ago

Not much has been happening here, so this is supposed to restart the conversation. Our output has changed slightly to account for the more straightforward issues via commit #305. There is document text, documentCharIntervals, provenance on most everything, multiple groundings, and arrays of intervals. There are multiple somewhat sticky issues remaining. One is the combination of what we in the JSON-LD called Entity, DirectedRelation, and UndirectedRelation into some unified kind of node/object. We've used the term extraction for this before. A comment above asks what to do with Entity.mentions from BBN, Entity.state, Event.trigger, and Event.arguments. Most could just be optional values. The Entity.mentions may become part of a unifies relation. Mihai in his most recent comment previous to this expressed some conerns. Then there were some smaller issues with State and whether type and text were OK or whether it is more type and value. See the table above. In any case this all needs to be worked out. Please have at it.

MihaiSurdeanu commented 6 years ago

Hi @bnmin, I hope your LREC trip was great. If you're back, any comments on @kwalcock's comment above? Thank you! Mihai

bnmin commented 6 years ago

Thanks @MihaiSurdeanu There are many points made in @kwalcock 's comment above, for which I don't fully understand. Could you please elaborate on each of them...?

I understand that states, trigger, arguments can be optional values.

I think Entity.mentions ("CoreferWith" relations between mentions) are different from what a broadly-defined "Unifies" relation would ideally represent ("Unifies" can group extractions that don't corefer with each other)

For provenance/offsets, we used documentCharPositions which has been removed from the latest JSON-LD2 format?

I think it would be nice to have a way to track changes in the JSON-LD format so that we know what has been changed.

@MihaiSurdeanu We would be happy to hear your thoughts on our latest JSON-LD files for iteration 2, including but not limited to representation:)

Thanks, Bonan

kwalcock commented 6 years ago

I don't think that documentCharPositions was ever in JSON-LD2, but I'll put it there. The plan is to adopt anything that is agreed upon from JSON-LD2 into the standard format and then drop this one. documentCharPositions was already advanced to JSON-LD.

I'll make an update to combine the three different kinds of extractions and note that some things are optional like the triggers.

To see the history of a Wiki page, click on the revisions text, apparently.

image

MihaiSurdeanu commented 6 years ago

Thanks @bnmin, @kwalcock!

On the Unifies vs. CoreferWith relations: yes, they are different. The point I was trying to make was that coreference relations could be represented in a similar fashion, with an explicit relation that links two mentions (CoreferWith), rather than packaging them in the same Entity block. I think having the explicit CoreferWith relation is more elegant for at least two reasons:

kwalcock commented 6 years ago

The JSON-LD2 page was updated. The Entity and Event have been combined into an Extraction (a better word is welcome). What used to be an Entity is now an Extraction with an Extraction type (see the bottom table) of "entity". Events have some other kind of Extraction type such as "causation", etc.

The trigger, state, and arguments may be difficult to discern and here is a summary of the differences as far as JSON-LD is concerned.

Type Arity Data Description
Trigger singular has text not an extraction, provenance only
State plural has text and string type from known list (INC, DEC, QUANT) not an extraction, provenance only
Argument plural has a string type from known list (cause, effect, argument) references an extraction with its own text and provenance
Modifier plural has text attaches to state, has some grounding info, provenance only

This may too closely match our software design in some places. The Trigger could easily be expressed as a State with a type TRIG.

The ?mentions? will probably be discussed.

bnmin commented 6 years ago

Thanks! @kwalcock @MihaiSurdeanu Thanks for the hint on viewing revisions @kwalcock! As you can see I don't use github as much as I should:) Agree with @MihaiSurdeanu on the value of CoreferWith relations. I think we can use CoreferWith relations to represent coreference (for entities or other types), in addition to the Unifies relations.

On @kwalcock 's post above, in general, these all make sense. I'll review with the team and let you know if we have questions.

Thanks, Bonan