Closed margaretha closed 2 years ago
In addition (or instead?) a tokenized array representation would be nice:
{
"tokenizedSnippet": {
"left": ["sollte", "er", "auch", "in", "mehrere", "Kategorien."],
"match": ["Merkel"],
"right": ["ist", "nicht", "gleichzeitig", "noch", "Ministerpräsidentin", "oder"]
},
"matchID": "match-WUD17/G91/96055-p20826-20827",
"UID": 0,
}
It doesn't make much sense to repeat the tokenization in the clients.
What would be a good approach to support classes in this scenario as well? I would propose something like that:
"tokens": {
"list": ["sollte", "er", "auch", "in", "mehrere", "Kategorien", "Merkel", "ist", "nicht", "gleichzeitig", "noch", "Ministerpräsidentin", "oder"],
"matchIdx":[6,7],
"classesIdx":[[1,6,7],[2,6,7]]
}
Maybe that's better:
"tokens": {
"left": ["sollte", "er", "auch", "in", "mehrere", "Kategorien"],
"match": ["Merkel", "ist"],
"right": ["nicht", "gleichzeitig", "noch", "Ministerpräsidentin", "oder"],
"classes":[[1,0,1],[2,0,2]]
}
The match would be simple to analyze and as classes can only be inside the match, the offset positions should be clear as well.
It would be nice to have another snippet representation, namely a simple JSON containing keyword in context (KWIC).
{ "snippet": { "left": "sollte er auch in mehrere Kategorien.", "match": "Merkel", "right": " ist nicht gleichzeitig noch Ministerpräsidentin oder" }, "matchID": "match-WUD17/G91/96055-p20826-20827", "UID": 0, ... }