_id-less indices - Githubissues

jpountz commented 5 years ago

Various experiments over time have highlighted that the _id field, and to some extent the _seq_no field, use non-negligible disk space. When indexing tiny documents, these two fields combined can end up using more than 50% of the index size. Are there conditions in which we could enable users to drop the _id field in order to save resources?

elasticmachine commented 5 years ago

Pinging @elastic/es-distributed (:Distributed/Distributed)

jpountz commented 5 years ago

We discussed two special kinds of indices that could work without ids:

Read-only indices. These would work by creating a regular index, adding data to it and at some point turning it read-only. This operation would be non reversible and could allow to get rid of the _id, _seq_no and _primary_term fields, since those are not needed for searching. After an index is turned read-only, indexing, GET, delete and update operations would be rejected.
Append-only indices. These indices would need to be created as append-only from the beginning and would only allow append index requests (POST index/_doc). GET, delete and update operations would be rejected. It would work internally by dropping the _id, _seq_no and _primary_term fields as part of the merge process on all segments whose sequence numbers are part of no retention lease.

Read-only indices would be much simpler to implement and would fit nicely into index lifecycle management. Its main drawback is that savings are delayed to a later time, so this might not address the perception that Elasticsearch uses a lot of disk space, and Elasticsearch would still perform poorly disk-usage-wise in benchmarks unless turned into a read-only index at the end of the benchmark. Append-only indices would address these points but also bring more complexity in order to avoid duplicates because of client-side retries, or because of retries on the coordinating node.

There is potential for better compression of these fields, but reducing by 20% would already be huge and would only reduce the overall index size by 10% assuming that _id/_seq_no take 50% of the index size, which is why we are rather looking at removing these fields completely. We will probably want to look into some compression improvements, but this is not the focus of this issue.

Since the relative overhead of _id/_seq_no is mostly noticeable when documents are very small, a point was made that we could also look into providing better tooling to help e.g. merge several measurements into a single document in order to have fewer, larger documents.

Next steps:

Get a better sense of the complexity of append-only indices and whether their benefits are worth this complexity.
Better understand when users create small documents and whether better tooling could help create fewer larger documents.

martijnvg commented 4 years ago

Dropping the _id field in the context of data streams makes perhaps more sense than just on regular indices. Currently a data stream only accepts append-only writes, but updates/deletes are allowed directly via the backing indices. We can think about introducing new data stream specializations then we can better enforce id-less / append-only writes.

I think there is a place for both kinds of _id removal in the context of data streams. In case that updating (or client retries) should be allowed for a long period of time then perhaps a backing index should drop the _id when it is effectively read only (we could then drop _id, _seq_no and _primary_term as part of the force merging performed by ilm). In the case that client retries aren't that important then the _id field can be dropped as soon as the document's seqno is past the minimum retained seqno (so _id is available for replica replication and ccr). Even in the latter case, client retries could still be possible if the previous write's seqno is greater than the minimum retained seqno (the window of operations could be kept large if that is required in certain setups).

Both kinds of implementations would be centered around a merge policy that drops the _id and new data stream semantics, so I think having both options available shouldn't add that much work.

jpountz commented 4 years ago

Some data points regarding how much we can expect to save with this change:

On an index containing nginx.access data, _id and _seq_no were responsible for respectively 4.5% and 2.5% of disk usage, ie. a total of 7% of disk usage.
On an index containing metrics pulled from Prometheus, _id and _seq_no were responsible for respectively 8% and 5% of disk usage, ie. a total of 13%.
On an index containing Elasticsearch logs, _id and _seq_no were responsible for respectively 25% and 1% of disk usage, ie. a total of 26%.

tsg commented 4 years ago

On an index containing Elasticsearch logs, _id and _seq_no were responsible for respectively 25% and 1% of disk usage, ie. a total of 26%.

I'm curious about this one because the _id percentage seems very large. Are Elasticsearch logs ingested with the Elasticsearch module in Beats or directly ingested with just the message field or similar? Is there a place where I can see the test data?

jpountz commented 3 years ago

@tsg The data can't be made public but I'll share more details privately with you.

jpountz commented 3 years ago

One aspect that I had not considered until now is that dropping _id upon merge might help with the indexing rate, as follow-up merges wouldn't have to care about the _id field.

timroes commented 3 years ago

I am trying to understand the full-impact of that change, and there is one thing not entirely clear to me yet: If we'd have _id less datastreams, what would be the way to uniquely identify a document in them? Or wouldn't there be any way in those data streams anymore?

We rely in Kibana in a lot of places on uniquely identifying objects by _id (and index) at the moment, though we could exchange that by another unique identifier, but would have problems, if there wouldn't be any unique identifier available for documents.

A non exhaustive list of features currently using _id (and would require some unique identification of documents):

The Maps application can show a layer of individual documents on a map, when opening a tooltip of one of those documents we load the other (non Geo fields) we want to show about that document via a separate GET request to get them.
In Discover you can "View a single document". We link to something like /view/<index>/<id> and when accessing that view, use an id query to load that document to show. Similar we load the document for "Viewing surrounding documents".
In Discover you can look at documents in a flyout, not just the configured columns. We are/will use id queries there to load the document to show.
We require in our frameworks often to specify a unique identifier when rendering a list of rows (i.e. documents), so the framework can keep track of those rows, e.g. when scrolling. We're using the document _ids (in conjunction with indexes) to keep track of them.

weltenwort commented 3 years ago

What @timroes mentioned for discover applies similarly to the Logs UI. We use the _id to disambiguate log entries when fetching them individually.

We rely in Kibana in a lot of places on uniquely identifying objects by _id (and index) at the moment, though we could exchange that by another unique identifier, [...]

Even if we had a different way to uniquely fetch specific docs it would ideally have to be applicable across all types of indices (datastream-backing or not). Otherwise separate code paths could represent an increased maintenance burden for many such single-document use-cases.

jpountz commented 3 years ago

@timroes @weltenwort This is the hardest part of this change indeed. Removing _id and _seq_no from an Elasticsearch index is relatively easy, the hard part will be to migrate applications that rely on IDs to something else.

We have made good progress on other major contributors to storage-induced costs (introduction of the Cold and Frozen tiers, better compression of stored fields, introduction of match_only_text, compression of terms and doc-values dictionaries, runtime fields for rarely queried fields - to name a few). The two main levers for storage efficiency that we have left are the removal of _id and _seq_no, and index sorting, which we are loking into separately.

In my opinion, the high disk footprint of _id and _seq_no makes it worth our time to look into how we stop storing them, as this would help our users save lots of money.

There have been a few ideas how we could make this change easier on applications built on top of Elasticsearch. In some cases, documents can be uniquely identified through a few other fields, e.g. metrics samples can be uniquely identified by the combination of their @timestamp and their dimensions. In such cases, Elasticsearch might be able to effectively stop storing an ID internally while still having a _id field that can be searched and returned in search responses. Hopefully we can find a more general solution for the common case when documents can't be uniquely identified by a combination of other fields.

mattkime commented 3 years ago

While the argument for omitting ids is from a storage perspective is quite clear, the story around reading id-less documents is quite murky. I'd like to turn this around - "I want to read data (documents??) and I DON'T WANT IDS!" - how does that work? What is the user story?

thomasdullien commented 2 years ago

Commenting from our (profiling) perspective: The usual case for profiling data is write-once, retain for a given time period (perhaps 90 days), and read during that time. Data that is a few days old would never need to be changed, but storage efficiency for individual events is crucial.

rockdaboot commented 2 years ago

To add more details to what @thomasdullien is pointing out: For profiling events we currently are at ~45 bytes per events. Without _id and _seq_no we could reduce this to ~17 bytes per event. Querying the data does not involve _id. The time range and a few other properties are used as filter.

nik9000 commented 2 years ago

So tsdb has started thinking a lot about _id lately. Previously we'd been assuming tsdb could go fully _id-less one day but in the short term we'd just use ES's auto-generated _ids. We'd consider removing them a space saving measure. But we learned that Kibana needs _id in a bunch of places. And that we'd need _id to support ingest-time-duplication detection. And deletes. And cross cluster replication.

So we build an _id customized to tsdb. The inverted index costs us about 5.6 bytes per record at the moment. We also store it for now, but we plan to drop that because we can recalculate it on the fly.

And most things should just work with it. Kibana can search by _id. And fetch it. You can delete. And overwrite. The big difference is that you can't specify the _id when indexing - we build it from the document.

Such a path may be an alternative, at least in the short term, to fully _id-less indices. An inferred _id that's comparatively cheap. Still, that's 1/3 of the bytes per event.

For tsdb I've been wondering if we can go further and drop the inverted index on the _id and run the search with for _id with a query on the constituent portions. It wouldn't be super fast. But it may not be that slow either. The @timetamp is part of the _id we generate and it's quite possible to do a precise query on that and get couple dozen candidates. Rechecking them for the right _id wouldn't be too bad. Maybe. I'm not sure! I'd need time to play with it.

elastic / elasticsearch

_id-less indices #48699