ACE dataset tasks, evaluators and redactors

danyaljj commented 8 years ago

Here is the dataset: https://github.com/cogcomp-dev/illinois-cogcomp-nlp/blob/master/corpusreaders/doc/ACEReader.md

Entity Recognition: Task variants:
- Raw text
- Sentence boundaries
- Gold mentions:

We are using the SpanLabelView; so I think the existing evaluators/cleansers should work.

Relation extraction:
- Raw text
- Sentence boundaries
- Gold mentions: We are using the PredicateArgumentView; we have an evaluator for it. Although we need to write a cleanser for it.
Coreference:
- Raw text
- Sentence boundaries
- Gold mentions:

We are using the CoreferenceView; I am in process of checking in the evaluators for it: https://github.com/cogcomp-dev/illinois-cogcomp-nlp/pull/157 We need to write a cleanser for it.

joshuacamp commented 8 years ago

@danyaljj So for the PredicateArgumentView, do we want to remove the predicates, the arguments, or the Relation between them?

joshuacamp commented 8 years ago

Also, for the CoreferenceView, do we want to keep the canonical mentions and remove the coreferent mentions?

danyaljj commented 8 years ago

For PredicateArgumentView:

Gold mentions: clean Relations
Sentence boundaries: clean Relationss + Predicates + Arguments

For CoreferenceView:

Gold mentions: clean Relations
Sentence boundaries: clean Constituents + Relations

danyaljj commented 8 years ago

@joshuacamp So here is how we expect the output json to look like for each of the subtasks:

1) Raw text:

{
  "corpusId": "ACE2005",
  "id": "/Users/bhargav/code/cs546_project/entity-relations-coreference/data/ace05/data/English/bn/CNN_ENG_20030312_223733.14.apf.xml",
  "text": "  CNN_ENG_20030312_223733.14   NEWS STORY   2003-03-12 22:57:55     the morning papers, because morning pape..... "
 }

2) Sentence boundaries:

{
  "corpusId": "ACE2005",
  "id": "/Users/bhargav/code/cs546_project/entity-relations-coreference/data/ace05/data/English/bn/CNN_ENG_20030312_223733.14.apf.xml",
  "text": "  CNN_ENG_20030312_223733.14   NEWS STORY   2003-03-12 22:57:55     the morning papers, because morning papers around the country and around the world are remaking their front page to include the Elizabeth Smart case, and we end with that tonight. With her family on a day when their miracle finally came true.   she looks very, very healthy. She\u0027s grown a lot. And I\u0027m just so absolutely thrilled. ",
  "tokens": [
    "CNN_ENG_20030312_223733.14",
    "NEWS",
    "STORY",
    "2003-03-12",
    "22:57:55",
    "the",
    "morning",
    "papers",
    ",",
    "because",
    "morning",
    "papers",
    "around",
    "the",
    "country",
    "and",
    "around",
    "the",
    "world",
    "are",
    "remaking",
    "their",
    "front",
    "page",
    "to",
    "include",
    "the",
    "Elizabeth",
    "Smart",
    "case",
    ",",
    "and",
    "we",
    "end",
    "with",
    "that",
    "tonight",
    ".",
    "With",
    "her",
    "family",
    "on",
    "a",
    "day",
    "when",
    "their",
    "miracle"
  ],
  "sentences": {
    "generator": "UserSpecified",
    "score": 1.0,
    "sentenceEndPositions": [
      38,
      57,
      63,
      71,
      77
    ]
  }
}

3) Gold mention:

{
  "corpusId": "ACE2005",
  "id": "/Users/bhargav/code/cs546_project/entity-relations-coreference/data/ace05/data/English/bn/CNN_ENG_20030312_223733.14.apf.xml",
  "text": "  CNN_ENG_20030312_223733.14   NEWS STORY   2003-03-12 22:57:55     the morning papers, because morning papers around the country and around the world are remaking their front page to include the Elizabeth Smart case, and we end with that tonight. With her family on a day when their miracle finally came true.   she looks very, very healthy. She\u0027s grown a lot. And I\u0027m just so absolutely thrilled. ",
  "tokens": [
    "CNN_ENG_20030312_223733.14",
    "NEWS",
    "STORY",
    "2003-03-12",
    "22:57:55",
    "the",
    "morning",
    "papers",
    ",",
    "because",
    "morning",
    "papers",
    "around",
    "the",
    "country",
    "and",
    "around",
    "the",
    "world",
    "are",
    "remaking",
    "their",
    "front",
    "page",
    "to",
    "include",
    "the",
    "Elizabeth",
    "Smart",
    "case",
    ",",
    "and",
    "we",
    "end",
    "with",
    "that",
    "tonight",
    ".",
    "With",
    "her",
    "family",
    "on",
    "a",
    "day",
    "when",
    "their",
    "miracle"
  ],
  "sentences": {
    "generator": "UserSpecified",
    "score": 1.0,
    "sentenceEndPositions": [
      38,
      57,
      63,
      71,
      77
    ]
  }, 
    "views": [
    {
        {
      "viewName": "ENTITYVIEW",
      "viewData": [
        {
          "viewType": "edu.illinois.cs.cogcomp.core.datastructures.textannotation.SpanLabelView",
          "viewName": "ENTITYVIEW",
          "generator": "edu.illinois.cs.cogcomp.nlp.corpusreaders.ACEReader",
          "score": 1.0,
          "constituents": [
            {
              "score": 1.0,
              "start": 5,
              "end": 8
            },
            {
              "score": 1.0,
              "start": 13,
              "end": 15,
              "properties": {
                "EntityHeadEndCharOffset": "128",
                "EntityHeadStartCharOffset": "122"
              }
            },
            {
              "score": 1.0,
              "start": 17,
              "end": 19,
           ]
        }
    }
}

CogComp / open-eval

ACE dataset tasks, evaluators and redactors #149