KorAP / Krill

:mag: A Corpus Data Retrieval Index using Lucene for Look-Ups
BSD 2-Clause "Simplified" License
16 stars 3 forks source link

Metadata fields for text do not support space delimiters #32

Closed Akron closed 6 years ago

Akron commented 7 years ago

When searching a meta data field like "author" as part of a virtual corpus, currently it's not possible to query this as a string with a space delimiter, e.g. "author eq 'Theodor Fontane'" does not work. Maybe text fields with "eq" should be treated like sequences of tokens delimited by spaces.

Akron commented 7 years ago

Another possibility would be to index all text fields both as sequences of tokens and as a string.

Akron commented 6 years ago

One problem is, that meta data fields like "author" will be language specific. For the moment I will make this german-only, but we may need to come up with a good solution that is not language specific.

Akron commented 6 years ago

For the moment, I ignore language specific indexation and use the StandardTokenizer with standard lowercasing. It is now possible to term search a text string and search the string using a phrase query as well. This is realized by prepending the verbatim string as a token with a huge position gap to the real token stream. After playing around with the prefered tokenizer pipeline in Lucene I switched to creating the tokenstream myself.

(Will be closed once the changes are reviewed.)

Akron commented 6 years ago

Changes are now in master. They require reindexing to take effect.