KorAP / Krill

:mag: A Corpus Data Retrieval Index using Lucene for Look-Ups
BSD 2-Clause "Simplified" License
16 stars 3 forks source link

Handling maxContextTokenLength #140

Open margaretha opened 4 months ago

margaretha commented 4 months ago

The size of token context is not as precise as that of character context. It may happen that the token context size exceeds the character context size.

To prevent this, we may recheck the token context length of an actual search result with max character context size. If the token context is larger than max char context size, the context should be cut with lower number of tokens. For instance:

with maxContextTokenLength = 3 and maxContextCharLength=10

"This is a nice [example] for a snippet"

should be reduced to

"a nice [example] for a"