max_analyzed_offset issue with fulltext search

damisul commented 3 months ago

While working on API adjustements for search_after and checking it with fulltext search I've faced unexpected error:

Elasticsearch::Transport::Transport::Errors::BadRequest ([400] {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"The length [1243722] of field [fulltext] in doc[8853]/index[manifestations_1710940302313] exceeds the [index.highlight.max_analyzed_offset] limit [1000000]. To avoid this error, set the query parameter [max_analyzed_offset] to a value less than index setting [1000000] and this will tolerate long field values by truncating them."}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"manifestations_1710940302313","node":"bO_mL36HQnueTnzHIIBRWw","reason":{"type":"illegal_argument_exception","reason":"The length [1243722] of field [fulltext] in doc[8853]/index[manifestations_1710940302313] exceeds the [index.highlight.max_analyzed_offset] limit [1000000]. To avoid this error, set the query parameter [max_analyzed_offset] to a value less than index setting [1000000] and this will tolerate long field values by truncating them."}}],"caused_by":{"type":"illegal_argument_exception","reason":"The length [1243722] of field [fulltext] in doc[8853]/index[manifestations_1710940302313] exceeds the [index.highlight.max_analyzed_offset] limit [1000000]. To avoid this error, set the query parameter [max_analyzed_offset] to a value less than index setting [1000000] and this will tolerate long field values by truncating them.","caused_by":{"type":"illegal_argument_exception","reason":"The length [1243722] of field [fulltext] in doc[8853]/index[manifestations_1710940302313] exceeds the [index.highlight.max_analyzed_offset] limit [1000000]. To avoid this error, set the query parameter [max_analyzed_offset] to a value less than index setting [1000000] and this will tolerate long field values by truncating them."}}},"status":400}):

The reason is that for fulltext search we want to return highlight (snippet of text matching request), and for highlight by default ES can only scan first 1000000 of characters in work. If there are works with longer content we got such error.

There are two possible ways to handle this: 1) we can set index setting 'max_analyzed_offset' to be greater than max work lenght (e.g. 5000000). As usual it will lead to increased memory consumption 2) we can pass max_analyzed_offset = 1000000 to the request. In this case ES will only analyze first 1000000 characters while building highlight text.

Personally I would prefer to follow second approach. @abartov , WDYT?

abartov commented 3 months ago

Yes, I have encountered this too. We do indeed have several works that exceed 1M characters, and will continue to have such. If the only downside to the second option is that we won't have highlights when such a long text appears in search results, that's acceptable.

damisul commented 3 months ago

Ok, I've added commit to fix this to PR related to search_after functionality.

abartov / bybeconv

max_analyzed_offset issue with fulltext search #299