Challenge of one metric per document

ruflin commented 1 year ago

Metricbeat and Elastic Agent ship multiple metrics with the same dimensions in a single document to Elasticsearch (example system memory doc). Having multiple metrics in a single document was the preferred way for Elasticsearch when Metricbeat was created. It is also nice for users to see in a single document many related metrics. But having many metrics in a single document only works as long as we are in control of the collection.

Many time series data bases think of a single metric with labels as a time serie. For example prometheus defines it as following:

Every time series is uniquely identified by its metric name and optional key-value pairs called labels.

During the collection of prometheus and otel metrics, Metricbeat tries to group together all the metrics with the same labels to not create one document per metric. In some cases this works well, in others not so much. In our data model, labels are used mostly for metadata, in prometheus labels are used for dimensions of metrics where we instead used the metric names for it.

Looking at the following prometheus example, {collect=*} is a label for the metric node_scrape_collector_duration_seconds. Each value of the label is unique to the metric so our conversion creates one document for each entry.

# HELP node_scrape_collector_duration_seconds node_exporter: Duration of a collector scrape.
# TYPE node_scrape_collector_duration_seconds gauge
node_scrape_collector_duration_seconds{collector="boottime"} 3.007e-05
node_scrape_collector_duration_seconds{collector="cpu"} 0.001108799
node_scrape_collector_duration_seconds{collector="diskstats"} 0.001491901
node_scrape_collector_duration_seconds{collector="filesystem"} 0.001846135
node_scrape_collector_duration_seconds{collector="loadavg"} 0.001142997
node_scrape_collector_duration_seconds{collector="meminfo"} 0.000129408
node_scrape_collector_duration_seconds{collector="netdev"} 0.004051806
node_scrape_collector_duration_seconds{collector="os"} 0.00528108
node_scrape_collector_duration_seconds{collector="powersupplyclass"} 0.001323725
node_scrape_collector_duration_seconds{collector="textfile"} 0.001216545
node_scrape_collector_duration_seconds{collector="thermal"} 0.00207716
node_scrape_collector_duration_seconds{collector="time"} 0.001271639
node_scrape_collector_duration_seconds{collector="uname"} 0.000994093

This would result in documents similar to:

"prometheus": {
    "labels": {
        "instance": "localhost:9100",
        "job": "prometheus",
        "collector: "boottime"
    },
    "metrics": {
        "node_scrape_collector_duration_seconds": 3.007e-05
    }
}

In our metrics format it would be more something like:

"metrics": {
    "labels": {
        "instance": "localhost:9100",
        "job": "prometheus",
    },
    "node_scrape_collector_bootime_duration_seconds": 3.007e-05
    "node_scrape_collector_cpu_duration_seconds": 0.001108799
    ...
}

As you see above, the label made it into the metric name but not necessarily in a predicable place. This becomes more obvious when there are many labels:

# HELP node_filesystem_avail_bytes Filesystem space available to non-root users in bytes.
# TYPE node_filesystem_avail_bytes gauge
node_filesystem_avail_bytes{device="//ruflin@111.111.111.111/ruflin",fstype="smbfs",mountpoint="/Volumes/.timemachine/111.111.111.111/212-1FFF-51FF-84F2/ruflin"} 3.952268886016e+12
node_filesystem_avail_bytes{device="/dev/disk1s1s1",fstype="apfs",mountpoint="/"} 6.62352285696e+11
node_filesystem_avail_bytes{device="/dev/disk1s2",fstype="apfs",mountpoint="/System/Volumes/Data"} 6.62352285696e+11
node_filesystem_avail_bytes{device="/dev/disk1s3",fstype="apfs",mountpoint="/System/Volumes/Preboot"} 6.64204152832e+11
n
...

How should we group this magically into a single document? Even in our scenario we send 1 document per file system but we group together multipel metrics for the same file system.

Push metrics

In more and more scenarios like otel, prometheus remote write, we don't collect the metrics but metrics are pushed to us. In these scenarios, we loose control around what metrics we receive in which order. Some caching would still allow us to group together some of the metrics but as described above, not all. The number of documents increases further.

Too many docs

The number of metric documents keeps increases compared to Metricbeat 10x to 100x even though we combine metrics and the number of metrics per document keeps decreasing. In some scenarios we reached 1 metric per document. But historically Elasticsearch did not like the overhead of single metrics docs.

I'm creating this issue to bring awareness to the Elasticsearch around this issue and discuss potential solutions.

Links

Open Telemetry data model

martijnvg commented 1 year ago

Let's say each metric is ingested as a single document (so metric name would be a dimension) all the time. How much times more documents would be ingested? For example in this node scraper integration. In the second json example that you shared, I think we would end up with sparse fields, which I think isn't ideal as well. The first case we would then end up with more dense fields.

ruflin commented 1 year ago

What I posted above is a cut out of the full output from the node exporter and only took a very small subset of the metrics to make it easier to explain the problem. To get a better sense on how many more documents we would talk when shipping one document per metric, lets have a look at some existing metricsets:

system.memory: 12 metrics
system.cpu: >27 metrics
redis.info: >70 metrics
apache.status: 33 metrics

There are smaller and larger metricsets but I think we are usually talking around 10x - 100x the number documents we would have if we have 1 metric per document.

If we follow an approach of only 1 metric per document, there is also the option to fundamentally change the structure to something like:

{
    "metric_name": "node_scrape_collector_duration_seconds",
    "value": "3.007e-05"
}

It would fundamentally change how we query for metrics.

@martijnvg Can you elaborate more on dense vs sparse (with examples) and what the related challenges are in this context?

martijnvg commented 1 year ago

Are the mentioned metrics also from integrations that push metrics to us?

It would fundamentally change how we query for metrics.

True, this would result in very different queries. I do think labels / dimensions should be included in that structure? Otherwise we would not join like functionality?

Can you elaborate more on dense vs sparse (with examples) and what the related challenges are in this context?

Actually this used to be problematic in Lucene but not today. A dense field is a field that every document in an index has a value for. A sparse field is a field that not every document has a value for. Sparse fields used to take as much disk space as dense fields and cpu time time during merging. This is no longer the case. So I don't think we need to worry about this.

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-analytics-geo (Team:Analytics)

felixbarny commented 1 year ago

For a time series database, I think it's appropriate to not have an identity or even a dedicated document for the individual data points within a time series.

Could we optimize time series data streams further if we didn't need to support operations that rely on the document's identity, such as the single document APIs?

For metric values, we only need them in doc_values. The rest is time series dimensions or tags which need to be indexed and added to doc_values. I don't know enough about what the overhead of a document is in Lucene but could we optimize by only storing the metadata for a time series once and by only adding metrics to doc_values instead of creating dedicated documents for them?

Maybe that's effectively what we're doing when with doc_value only mappings for metrics, synthetic source, and _id-less indices. But maybe there's more we can do by not thinking of individual data points as documents.

jpountz commented 1 year ago

For the record, here are the downsides of having one metric per document that I can think of:

Higher disk footprint, because fields like _id, _seq_no, @timestamp and dimensions need to be stored multiple times (once per metric) instead of just once. Plus metric fields become sparse, and Lucene requires additional data structures to keep track of which docs have a field vs. which docs don't.
Slower indexing since _id, _seq_no, @timestamp and dimensions need to be indexed multiple times independently.
Slower queries, because e.g. the same query (e.g. range on @timestamp) would match more documents and then we'd need either a filter to only select docs that have a value for the metric we're interested in, or rely on the fact that aggregations ignore docs without a value. Additionally sparse fields incur slightly higher access overhead compared to dense fields.

One interesting property that doesn't solve the problem is that all these issues would disappear with a downsampling operation: the downsampled index of an index that stored one metric per document or the same index that stored multiple metrics per document would be exactly the same.

For the record, another approach could consist of having one data stream per metric. This would remove issues around sparse fields and query performance, but it still has the same issue with the higher disk and indexing overhead, and may introduce more issues with high numbers of indexes that we'd like to avoid. One metric per document feels like it's a better option for us than one metric per data stream.

Being a bit creative, it may be possible for Elasticsearch to automatically "fix" sparseness by hooking into the merging process: when some input segments have sparse metrics, Elasticsearch could create a view around the incoming segments for the merge that would collapse adjacent documents (the fact that TSDB uses index sorting is important here) with the same dimensions and timestamps but different sets of metrics. This would not address the issue around slower indexing, but it would address issues with storage efficiency and query performance.

ruflin commented 1 year ago

One interesting property that doesn't solve the problem is that all these issues would disappear with a downsampling operation: the downsampled index of an index that stored one metric per document or the same index that stored multiple metrics per document would be exactly the same.

This is interesting. Would be nice if Elasticsearch could do something like this in real time during ingestion. For example group all the metrics of the same dimension within the same second?

For the record, another approach could consist of having one data stream per metric.

I prefer to think of it as one index per metrics behind a data stream so that from a user perspective it is an implementation detail on how it is persisted. But as there is a big chance that many metrics will only show up a few times, I expect we would have too many small indices.

"fix" sparseness by hooking into the merging process

Sounds similar to the approach mentioned before like having 1s downsampling. But here it is after ingestion instead of during ingestion. One important bit on this is that we would ensure on the edge that all separate metric docs that could be potentially merged must have an identical timestamp.

StephanErb commented 1 year ago

If we follow an approach of only 1 metric per document, there is also the option to fundamentally change the structure to something like: { "metric_name": "node_scrape_collector_duration_seconds", "value": "3.007e-05" }

This would have the nice property that I can query timeseries via metric names. For example, I can easily and simultaneously query all metrics starting with node_*. This is super helpful when troubleshooting where one has to explore what metrics are available for a target. Prometheus also works similarly underneath as it maps the metricname to a __name__ label which is then indexed like any other label.

But I guess this 1 metric per document approach could pose some cost challenges.

If you look at Prometheus's TSDB or Facebook's Gorilla, they are able to compress a metric sample (8 byte timestamp + 8 byte value) down to 1-2 bytes storage space. They need roughly 1 bit per timestamp (via double delta) and the rest for the compressed metric value and index structures.

In comparison: In our Metricbeat index with a mixture of Prometheus and Kubernetes metrics, we are seeing ~10 metrics per document with a size of ~180 bytes. We are not using TSDB, yet. I would assume this could bring us down to 60 bytes per document, so about 6 bytes per metric. So even with TSDB we'd be a factor less efficient than these other timeseries databases. This gap would probably widen if we'd move to one metric per document.

felixbarny commented 1 year ago

Higher disk footprint, because fields like _id, _seq_no, @timestamp and dimensions need to be stored multiple times (once per metric) instead of just once.

With https://github.com/elastic/elasticsearch/pull/92045, the overhead of the @timestamp is kept very low due to double delta compression. In the future, we'd ideally not even create _id and _seq_no for time series data (https://github.com/elastic/elasticsearch/issues/48699). I think that will be the game changer that mitigates the challenges described in this issue.

Plus metric fields become sparse, and Lucene requires additional data structures to keep track of which docs have a field vs. which docs don't.

What's the overhead of sparse fields and how does the on-disk structure look like for sparse vs dense doc values?

One interesting property that doesn't solve the problem is that all these issues would disappear with a downsampling operation: the downsampled index of an index that stored one metric per document or the same index that stored multiple metrics per document would be exactly the same. [...} Elasticsearch could create a view around the incoming segments for the merge that would collapse adjacent documents (the fact that TSDB uses index sorting is important here) with the same dimensions and timestamps but different sets of metrics.

During ingest, APM Server is already doing that for OTLP: Metrics with the same set of dimensions are stored in the same document. The issue is that the metrics make more use of dimensions which implies that not as many metrics can be packed into the same document. I don't think downsampling or merging documents in ES would help.

jpountz commented 1 year ago

In the future, we'd ideally not even create _id and _seq_no for time series data

That would be fantastic.

What's the overhead of sparse fields and how does the on-disk structure look like for sparse vs dense doc values?

Currently, sparse doc-values fields encode the set of documents that have a field in a similar way as roaring bitmaps. See javadocs of. https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90/IndexedDISI.java which give a bit more details. Dense doc values fields just store a flag in metadata about the field that says that the field is dense. At search time, such fields get a specialized implementation that skips checks for "does this doc have a value", and some queries go a bit further, e.g. exists queries may rewrite to match_all queries, which may in-turn enable further simplifications of queries.

So space-wise, the overhead of sparse fields can be very low (hardly noticeable) if docs that have the same fields are grouped together, and up to one bit per field per doc in the shard. I don't have a good intuition for the search-time CPU overhead of sparse fields however, it probably highly depends on queries.

felixbarny commented 1 year ago

the overhead of sparse fields can be very low (hardly noticeable) if docs that have the same fields are grouped together

That's good news. Due to the default index sorting in TSDB, docs with the same fields are almost always grouped together (except for cases when a new metric gets added). I think this means that this they would be stored in a dense range in Lucene even though the docs across the whole index may have sparse values, right?

jpountz commented 1 year ago

If documents that have the same dimension values also have the same metric fields, then the TSDB index sorting should help keep the overhead low indeed. Lucene has a nightly benchmark that verifies that: https://home.apache.org/~mikemccand/lucenebench/sparseResults.html#index_size. It indexes the same dataset both in a dense way, and in a sparse way by giving yellow cabs and green cabs different field names, and checks index sizes of dense vs. sparse vs. sparse sorted by cab color. As you can see the space efficiency is the same in the dense and sparse sorted cases.

wchaparro commented 1 year ago

Feedback provided, closing.

martijnvg commented 1 year ago

@ruflin @felixbarny Is there anything that we need to do now for this issue? I think removing the _id / seq_no is covered in another issue.

ruflin commented 1 year ago

Feedback provided, closing.

What is the conclusion? Would be great if it could be summarised here as a final comment.

@martijnvg We need a summary on what the recommendation is. Also if there are follow up issues, can you link them here?

Most important from my side is that the Elasticsearch team is aware more and more single metrics will be ingested and TSDB needs to be able to deal with it in a good way.

martijnvg commented 1 year ago

Summary of the current downsides of the one metric per document approach:

Slower indexing rate because each metric will have its own _id, _seq_no and @timestamp fields.
Higher disk space usage, because of the same reason as point 1.
Queries would be slower, because documents of all metrics would match. So an additional filter clause is required to filter out the documents for the requested metric(s).

Trimming or removing the _id / _seq_no fields only solve part of the 1 & 2 challenges. The @timestamp field will still be indexed individually for each metric, but due to the delta of delta encoding, the storage overhead on disk would be small.

Moving to a data stream per challenge could help with challenge 3, but we likely run into another scaling issue, the number of shards/indices/data streams. Although each data stream would then have a much smaller mapping than today.

felixbarny commented 1 year ago

Queries would be slower, because documents of all metrics would match. So an additional filter clause is required to filter out the documents for the requested metric(s).

Maybe this is where a dedicated metrics query language (ESQL, PromQL) could help that automatically adds a filter on the metric name.

Moving to a data stream per [metric] could help

When creating that many different data streams, it would also make data management tasks, such as managing retention and permission very difficult. Especially when you don't want to manage that on a per-metric but on a per-service or per-team basis.

martijnvg commented 1 year ago

Maybe this is where a dedicated metrics query language (ESQL, PromQL) could help that automatically adds a filter on the metric name.

True, but we still need an additional filter that we wouldn't need otherwise.

When creating that many different data streams, it would also make data management tasks, such as managing retention and permission very difficult. Especially when you don't want to manage that on a per-metric but on a per-service or per-team basis.

Good point. This would make authorisation much more complex. Another reason against a data stream per metric.

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-storage-engine (Team:StorageEngine)

elastic / elasticsearch