elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.42k stars 24.57k forks source link

Allow enrich indices to be replicated to ingest nodes as non-shard data #95969

Open dakrone opened 1 year ago

dakrone commented 1 year ago

Description

Currently we tell users of enrich that they should co-locate the nodes that perform the enrichment (ingest nodes) with the actual enrich data so that enrich operations don't require a remote search operation.

However, for dedicated coordinating nodes (which are ingest nodes also), the nodes are prohibited from holding any data. One thing we could do to make this better is to treat the enrich index itself similar to the geoIP data. This would mean sticking it into an index as a raw file, then pulling it from whichever nodes need the data (ingest nodes) and querying it locally independently from our regular shard querying semantics (i.e., use Lucene on this the files directly). This would allow us to replicate it wherever needed.

This could also help in a serverless environment, where the ingest and search tiers are separated, since the indexing nodes (where ingest happens) won't hold the queryable data, and will have to make a remote search every time enrich occurs.

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-data-management (Team:Data Management)

joegallo commented 1 year ago

Another thing we could consider doing here is getting rid of the enrich cache.

One case where the cache bites us is if the cluster has vastly more documents in the enrich indices than the feature was generally designed for (let's go with millions), and/or if the data in the enrich indices is used uniformly and doesn't have a 'hot' (and therefore cache-able) component.

Another problem is that we have a thundering herd around the cache when the policies are executed -- since policy execution bumps the concrete index reference, and the cache key includes the concrete index reference, then a policy execution acts like a clearing of the cache (for the documents associated with that policy that were in the cache). Then an empty cache results in multithreaded hits against the coordinator proxy action, and we thundering herd it.

jbaiera commented 1 year ago

This is a great idea!

I think the biggest challenge we would need to address in making this a reality is likely storage. Ingest nodes are often also data nodes and thus generally equipped to hold data, but where this would obviously help most is on ingest-only nodes. The GeoIP processor databases are about 40mb. Enrich indices can be significantly larger and may require infrastructural changes on some clusters to support storing larger side data files. One way we could fix this is that we could trigger this as an optimization if the enrich index is below a certain size and target just ingest-only nodes. Ingest nodes that have the data role should probably rely on regular data locality.