elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.61k stars 24.64k forks source link

Support array values for dimension fields #110387

Open felixbarny opened 2 months ago

felixbarny commented 2 months ago

At the moment, TSDB dimension fields must be single-valued, as documented for the keyword, ip and numeric field types. When providing an array, the following error is returned:

Error extracting routing: Routing values must be strings but found [START_ARRAY]

https://github.com/elastic/elasticsearch/blob/55476041d9daa4e1ccad2b732c3e4e5b0a785194/server/src/main/java/org/elasticsearch/cluster/routing/IndexRouting.java#L371-L375

This limitation is problematic for OpenTelemetry mappings, where all attributes are mapped as dimensions. The problematic thing is that attributes can be multi-valued (for example host.ip).

https://github.com/elastic/elasticsearch/blob/b7e1d5593b42f03aecc387160af6f452c4d25351/x-pack/plugin/otel-data/src/main/resources/component-templates/otel%40mappings.yaml#L20-L24

As a workaround, we could leverage the type information in OTel's data model to create a comma separated list for array-typed attributes, like host.ip, while retaining the type for scalar attributes. However, we then can't create a dynamic template to map *.ip fields to the IP field type. As there's no native type for ip fields in OTel, we therefore don't have a way to map IP fields for OTel metrics.

Alternatively, we could just send the first value but that would lead to silent data loss and we may lose information that uniquely identifies the time series.

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-storage-engine (Team:StorageEngine)

kkrik-es commented 2 months ago

Stepping back, mapping all attributes as dimensions seems problematic, since they're also used in tsid calculations etc. What's the reasoning behind storing all these IPs along with metric values? Do we anticipate filtering for a particular IP, as opposed to hostname etc? More so, how much value do we get by storing an array of IPs, instead of the first one as suggested, without additional info on what each IP represents?

felixbarny commented 2 months ago

OTel's metrics data model define a time series by an entity consisting of several metadata properties, which includes all attributes.

We can't make a generic assumption that the time series can still be uniquely identified when we just store the first value of an attribute array.

I do agree that in the specific case of the host.ip field, it may not always be an identifying attribute. But at this time, there's no concept of a non-identifying attribute in OpenTelemetry. This may change in the future and in that case we can store these non-identifying attributes in a namespace that is not a container for dimensions (time_series_dimension: false). However, that doesn't change the fact that there can still be multi-valued identifying attributes. It's something that the OTel data model allows for but the TSDS data model doesn't, which creates some kind of impedance mismatch when trying to map OTel metrics to TSDS.

While I do think that we need to allow for multi-valued dimensions at some point, I also think that we can live with the workaround of storing a comma-separated string and having no support for automatically mapping IP fields as the ip field type.

kkrik-es commented 2 months ago

Makes sense, thanks. While we do want to eventually support all corner cases, having concrete data about how much of an issue this is in practice can help us prioritize it.

felixbarny commented 2 months ago

I don't have data on that at the moment but I don't expect this to be a critical issue. We'll probably learn more along the way and maybe users can chime in here when and if this becomes a bigger problem than anticipated. As long as we're aligned that we want to fix this eventually, I'm good with putting this on the shelf for the time being.