Open jpountz opened 5 years ago
Pinging @elastic/es-distributed (:Distributed/Distributed)
We discussed two special kinds of indices that could work without ids:
_id
, _seq_no
and _primary_term
fields, since those are not needed for searching. After an index is turned read-only, indexing, GET, delete and update operations would be rejected.POST index/_doc
). GET, delete and update operations would be rejected. It would work internally by dropping the _id
, _seq_no
and _primary_term
fields as part of the merge process on all segments whose sequence numbers are part of no retention lease.Read-only indices would be much simpler to implement and would fit nicely into index lifecycle management. Its main drawback is that savings are delayed to a later time, so this might not address the perception that Elasticsearch uses a lot of disk space, and Elasticsearch would still perform poorly disk-usage-wise in benchmarks unless turned into a read-only index at the end of the benchmark. Append-only indices would address these points but also bring more complexity in order to avoid duplicates because of client-side retries, or because of retries on the coordinating node.
There is potential for better compression of these fields, but reducing by 20% would already be huge and would only reduce the overall index size by 10% assuming that _id
/_seq_no
take 50% of the index size, which is why we are rather looking at removing these fields completely. We will probably want to look into some compression improvements, but this is not the focus of this issue.
Since the relative overhead of _id
/_seq_no
is mostly noticeable when documents are very small, a point was made that we could also look into providing better tooling to help e.g. merge several measurements into a single document in order to have fewer, larger documents.
Next steps:
Dropping the _id
field in the context of data streams makes perhaps more sense than just on regular indices. Currently a data stream only accepts append-only writes, but updates/deletes are allowed directly via the backing indices. We can think about introducing new data stream specializations then we can better enforce id-less / append-only writes.
I think there is a place for both kinds of _id
removal in the context of data streams. In case that updating (or client retries) should be allowed for a long period of time then perhaps a backing index should drop the _id
when it is effectively read only (we could then drop _id
, _seq_no
and _primary_term
as part of the force merging performed by ilm). In the case that client retries aren't that important then the _id
field can be dropped as soon as the document's seqno is past the minimum retained seqno (so _id is available for replica replication and ccr). Even in the latter case, client retries could still be possible if the previous write's seqno is greater than the minimum retained seqno (the window of operations could be kept large if that is required in certain setups).
Both kinds of implementations would be centered around a merge policy that drops the _id
and new data stream semantics, so I think having both options available shouldn't add that much work.
Some data points regarding how much we can expect to save with this change:
nginx.access
data, _id
and _seq_no
were responsible for respectively 4.5% and 2.5% of disk usage, ie. a total of 7% of disk usage._id
and _seq_no
were responsible for respectively 8% and 5% of disk usage, ie. a total of 13%._id
and _seq_no
were responsible for respectively 25% and 1% of disk usage, ie. a total of 26%.On an index containing Elasticsearch logs, _id and _seq_no were responsible for respectively 25% and 1% of disk usage, ie. a total of 26%.
I'm curious about this one because the _id percentage seems very large. Are Elasticsearch logs ingested with the Elasticsearch module in Beats or directly ingested with just the message
field or similar? Is there a place where I can see the test data?
@tsg The data can't be made public but I'll share more details privately with you.
One aspect that I had not considered until now is that dropping _id
upon merge might help with the indexing rate, as follow-up merges wouldn't have to care about the _id
field.
I am trying to understand the full-impact of that change, and there is one thing not entirely clear to me yet: If we'd have _id
less datastreams, what would be the way to uniquely identify a document in them? Or wouldn't there be any way in those data streams anymore?
We rely in Kibana in a lot of places on uniquely identifying objects by _id
(and index) at the moment, though we could exchange that by another unique identifier, but would have problems, if there wouldn't be any unique identifier available for documents.
A non exhaustive list of features currently using _id
(and would require some unique identification of documents):
/view/<index>/<id>
and when accessing that view, use an id query to load that document to show. Similar we load the document for "Viewing surrounding documents".What @timroes mentioned for discover applies similarly to the Logs UI. We use the _id
to disambiguate log entries when fetching them individually.
We rely in Kibana in a lot of places on uniquely identifying objects by
_id
(and index) at the moment, though we could exchange that by another unique identifier, [...]
Even if we had a different way to uniquely fetch specific docs it would ideally have to be applicable across all types of indices (datastream-backing or not). Otherwise separate code paths could represent an increased maintenance burden for many such single-document use-cases.
@timroes @weltenwort This is the hardest part of this change indeed. Removing _id
and _seq_no
from an Elasticsearch index is relatively easy, the hard part will be to migrate applications that rely on IDs to something else.
We have made good progress on other major contributors to storage-induced costs (introduction of the Cold and Frozen tiers, better compression of stored fields, introduction of match_only_text
, compression of terms and doc-values dictionaries, runtime fields for rarely queried fields - to name a few). The two main levers for storage efficiency that we have left are the removal of _id
and _seq_no
, and index sorting, which we are loking into separately.
In my opinion, the high disk footprint of _id
and _seq_no
makes it worth our time to look into how we stop storing them, as this would help our users save lots of money.
There have been a few ideas how we could make this change easier on applications built on top of Elasticsearch. In some cases, documents can be uniquely identified through a few other fields, e.g. metrics samples can be uniquely identified by the combination of their @timestamp
and their dimensions. In such cases, Elasticsearch might be able to effectively stop storing an ID internally while still having a _id
field that can be searched and returned in search responses. Hopefully we can find a more general solution for the common case when documents can't be uniquely identified by a combination of other fields.
While the argument for omitting ids is from a storage perspective is quite clear, the story around reading id-less documents is quite murky. I'd like to turn this around - "I want to read data (documents??) and I DON'T WANT IDS!" - how does that work? What is the user story?
Commenting from our (profiling) perspective: The usual case for profiling data is write-once, retain for a given time period (perhaps 90 days), and read during that time. Data that is a few days old would never need to be changed, but storage efficiency for individual events is crucial.
To add more details to what @thomasdullien is pointing out: For profiling events we currently are at ~45 bytes per events. Without _id
and _seq_no
we could reduce this to ~17 bytes per event. Querying the data does not involve _id
. The time range and a few other properties are used as filter.
So tsdb has started thinking a lot about _id
lately. Previously we'd been assuming tsdb could go fully _id
-less one day but in the short term we'd just use ES's auto-generated _id
s. We'd consider removing them a space saving measure. But we learned that Kibana needs _id
in a bunch of places. And that we'd need _id
to support ingest-time-duplication detection. And deletes. And cross cluster replication.
So we build an _id
customized to tsdb. The inverted index costs us about 5.6 bytes per record at the moment. We also store it for now, but we plan to drop that because we can recalculate it on the fly.
And most things should just work with it. Kibana can search by _id
. And fetch it. You can delete. And overwrite. The big difference is that you can't specify the _id
when indexing - we build it from the document.
Such a path may be an alternative, at least in the short term, to fully _id
-less indices. An inferred _id
that's comparatively cheap. Still, that's 1/3 of the bytes per event.
For tsdb I've been wondering if we can go further and drop the inverted index on the _id
and run the search with for _id
with a query on the constituent portions. It wouldn't be super fast. But it may not be that slow either. The @timetamp
is part of the _id
we generate and it's quite possible to do a precise query on that and get couple dozen candidates. Rechecking them for the right _id
wouldn't be too bad. Maybe. I'm not sure! I'd need time to play with it.
Various experiments over time have highlighted that the
_id
field, and to some extent the_seq_no
field, use non-negligible disk space. When indexing tiny documents, these two fields combined can end up using more than 50% of the index size. Are there conditions in which we could enable users to drop the_id
field in order to save resources?