elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.43k stars 24.57k forks source link

Optimising Index_prefixes with option to ignore shingled tokens #102105

Open markharwood opened 9 months ago

markharwood commented 9 months ago

Description

TL/DR : index_prefixes on shingled fields costs a lot and benefits only a little.

Index_prefixes is a useful query-time optimisation but comes with an added cost in the extra disk space required for indexing partial tokens. This acts as a multiplier based on the number of unique tokens and the required range of prefix lengths.

What makes this particularly costly is when a field uses shingles because shingles greatly increases the number of unique terms. To quantify this cost I compared indexed sizes of ~1m news headlines:

Index sizes with various options

  No prefix (mb) With prefix (mb) Prefix overhead as % of non-prefixed content
Plain text 347 435 25.36%
Two-word shingled text 402 641 59.45%

The overhead of prefixes on plain text is only 25% but on a shingled index is a disproportionate 59% I'm happy to pay the shingle overhead over plain text of a 15.85% increase in space but the prefix overhead on top of that is high. I need to use index_prefixes on my index because my interval queries that use prefixes routinely fail with elasticsearch's hardcoded limit of 128 clauses (due to the high number of unique shingles)

solution

In practice I could live without my prefix queries being optimised for shingles - I tend to want two word shingles for discovery via significant_text but not necessarily as search terms that users type out to search for. For this reason it would be useful to add an option to the index_prefixes to ignore_shingles (Lucene's analyzer framework tags tokens that are shingles so these can easily be filtered from the prefix indexing logic).

elasticsearchmachine commented 9 months ago

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine commented 1 month ago

Pinging @elastic/es-search-foundations (Team:Search Foundations)