KorAP / Krill

:mag: A Corpus Data Retrieval Index using Lucene for Look-Ups
BSD 2-Clause "Simplified" License
16 stars 3 forks source link

KWIC Json Snippet #72

Closed margaretha closed 2 years ago

margaretha commented 3 years ago

It would be nice to have another snippet representation, namely a simple JSON containing keyword in context (KWIC).

{ "snippet": { "left": "sollte er auch in mehrere Kategorien.", "match": "Merkel", "right": " ist nicht gleichzeitig noch Ministerpräsidentin oder" }, "matchID": "match-WUD17/G91/96055-p20826-20827", "UID": 0, ... }

kupietz commented 3 years ago

In addition (or instead?) a tokenized array representation would be nice:

{
  "tokenizedSnippet": {
    "left": ["sollte",  "er", "auch", "in", "mehrere", "Kategorien."],
    "match": ["Merkel"],
    "right": ["ist", "nicht",  "gleichzeitig", "noch", "Ministerpräsidentin", "oder"]
  },
  "matchID": "match-WUD17/G91/96055-p20826-20827",
  "UID": 0,
}

It doesn't make much sense to repeat the tokenization in the clients.

Akron commented 2 years ago

What would be a good approach to support classes in this scenario as well? I would propose something like that:

"tokens": {
  "list": ["sollte",  "er", "auch", "in", "mehrere", "Kategorien", "Merkel", "ist", "nicht",  "gleichzeitig", "noch", "Ministerpräsidentin", "oder"],
  "matchIdx":[6,7],
  "classesIdx":[[1,6,7],[2,6,7]]
}
Akron commented 2 years ago

Maybe that's better:

"tokens": {
  "left": ["sollte",  "er", "auch", "in", "mehrere", "Kategorien"],
  "match": ["Merkel", "ist"],
  "right": ["nicht",  "gleichzeitig", "noch", "Ministerpräsidentin", "oder"],
  "classes":[[1,0,1],[2,0,2]]
}

The match would be simple to analyze and as classes can only be inside the match, the offset positions should be clear as well.