Closed Akron closed 6 years ago
Another possibility would be to index all text fields both as sequences of tokens and as a string.
One problem is, that meta data fields like "author" will be language specific. For the moment I will make this german-only, but we may need to come up with a good solution that is not language specific.
For the moment, I ignore language specific indexation and use the StandardTokenizer with standard lowercasing. It is now possible to term search a text string and search the string using a phrase query as well. This is realized by prepending the verbatim string as a token with a huge position gap to the real token stream. After playing around with the prefered tokenizer pipeline in Lucene I switched to creating the tokenstream myself.
(Will be closed once the changes are reviewed.)
Changes are now in master. They require reindexing to take effect.
When searching a meta data field like "author" as part of a virtual corpus, currently it's not possible to query this as a string with a space delimiter, e.g. "author eq 'Theodor Fontane'" does not work. Maybe text fields with "eq" should be treated like sequences of tokens delimited by spaces.