elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.58k stars 24.63k forks source link

Add back a terms index for numeric fields #94047

Open rkophs opened 1 year ago

rkophs commented 1 year ago

Description

Pre Elastic 5.x, numerics & dates used to be indexed into an inverted terms index. However, this PR changed the underlying data structure to Lucene points (implemented as binary KD trees). Lucene Points offer performance gains for range queries. However, they are expected to perform worse for exact queries (i.e. term/terms queries). This poses a problem when numeric identifiers (whose query patterns tend to be term-heavy) are stored as numbers.

Elastic's recommendation from ES 5.x forward is to store numeric identifiers as a stringified keyword in order to get the query performance boost for term/terms queries. This poses a several limitations:

All in all, for our application we created a plugin that re-implements the native ES numeric fields, indexing the terms into Lucene Points AND into a inverted terms index. Doing so has provided several benefits:

It would be much simpler if we can index the data into the appropriate data structures within a single field where the user can opt-in to enabling a term-index similar to how doc-values can be enabled & disabled. In this approach, by default, numerics will continue to operate as they always have. However, the user may specify a field setting called terms: (true|false) on the index mapping to define whether the numeric should additionally be indexed into an inverted terms index. I've applied the change to all numerics and date fields.

Spot the improvement in our query response times when we switched to using a numeric terms index to fulfill our term/terms queries:

Screenshot 2023-02-22 at 23 25 11

I will post a PR to demonstrate how we implemented the terms index. I understand if this may not be the Elastic community's preferred approach, but I do hope the community will consider some of these changes given the limitations of the current system and the substantial benefits we have seen on our own workloads as described above.

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-search (Team:Search)

yunhsincynthiachen commented 8 months ago

@javanna Sorry for directly pinging you, but I had noticed that you had added the team-discuss label almost a year ago, and I was wondering if there's any chance that the elements and proposal within this PR have been discussed already, and will be rejected or possibly approved for later versions?

It's just been over almost a year of waiting and having this change approved would be hugely beneficial for our team and 2024 roadmap, which is why I wanted to check in. Even if it's a "no and why" answer, it will greatly help us figure out some next steps.

javanna commented 8 months ago

Hi @yunhsincynthiachen thanks for the ping and sorry about the lag. The team has been very busy and we have not gotten many reports around this specific issue, although we are aware of it. That is why we were not able to prioritize it.

We did recently discuss it with the team and this is a short summary of our discussion:

The change to index numeric with points was originally made in Elasticsearch 5.0, some years ago. We are not convinced we should index numeric fields as keywords by default, in addition to indexing them with points. One point that was brought up is the pain around having to make client-side changes to use a multi-field to improve performance for exact queries, yet the linked PR (#94048 ) and its proposed approach would require to opt-in to enable terms on numeric fields. What is the difference between using a multi-field, conceptually, and enabling indexing terms as part of the numeric field definition? Wouldn't both require client-side changes? In my mind, both require to take action on the client side when defining mappings. We have discussed the possibility of providing some syntactic sugar, perhaps an additional field type like we have done with match_only_text, but that would obtain the same outcome as using a multi-field.

With this, we don't have concrete plans to prioritize this, especially as using a multi-field is a valid work-around. Please could you let us know if we missed some important aspects and we would be happy to discuss further. What prevents you from using a multi-field and address the performance of exact queries that way?

rkophs commented 6 months ago

Hi @javanna - thanks for discussing the issue with the team. I appreciate the feedback. Indeed, using a multi-field with a numeric+keyword could lead to similar improvements to terms queries. To answer your questions directly:

What is the difference between using a multi-field, conceptually, and enabling indexing terms as part of the numeric field definition?

Using a multi-field means dynamically choosing which of the fields in the multi-field to use in the written query based on whether you want to do a term's lookup or range-style query. This is far more cumbersome to do client-side, especially in larger organizations that have dozens, if not hundreds, of different teams leveraging Elastic. As maintainer of the Elastic stack at our current organization, where we manage hundreds of ElasticSearch clusters in production used by several dozen separate product teams, it's nearly impossible to ensure every team is leveraging the right field type for the query at hand. It's also extremely easy to overlook the key detail that id-style numerics should actually be keyword instead of a numeric to support faster terms lookups, especially given that you would intuitively choose a numeric type for numeric data in so many other data-stores. For our organization, it's far easier at scale to ensure the right settings are chosen only once at the time that the index is created. We can ensure this much more easily because index creation can be centralized behind a single service where validation can be put into place. Furthermore, this simplifies the query logic because teams no longer need to think about which field to use for range-style vs terms-style queries.

Wouldn't both require client-side changes?

Yes, but the key difference is client-side changes at query time vs index-creation time. It's much easier to manage/validate changes at index-creation time than query time. Especially when index-creation time is generally outside of the critical-path.

especially as using a multi-field is a valid work-around

Apart from usability mentioned above, there's some key limitations to keyword put forth in the PR that are equally important to call out:

javanna commented 6 months ago

Thanks for the feedback @rkophs . We discussed it again with the team, taking the points you made into consideration. We do see value in having something easier to interact with than multi-fields. It would be nice to have automatic resolution of the field being queried.

We did also take a step back and rediscussed whether it's expected that there is such a visible regression around exact queries made against the bkd-tree. We'd like to do some more digging on that to see if there's any fixes that we can make to improve that.

From a product perspective, this is not high priority for the team to work on.

elasticsearchmachine commented 6 months ago

Pinging @elastic/es-storage-engine (Team:StorageEngine)