Feature request: structural object type matching in dynamic templates

axw commented 4 years ago

Background

Over in Elastic APM we're investigating extending our support for application metrics. We already support recording simple, single-value metrics. We're now looking at adding support for histograms, which would be indexed using the histogram field type. Later we may want to add support for ingesting simpler summary metrics, indexed using the proposed aggregate_metric field type.

Problem

Metric names are not known (to us) up front, so we rely on dynamic mapping to give the fields the appropriate field type. Until now we've gotten away without any dynamic templates, as the metrics we support now are all simple single-value ones (counters/gauges).

Because histograms are indexed as an object with a conventional structure ({"values": [...], "counts": [...]}), the field mapping must first be explicitly defined or we'll end up with two distinct numeric fields.

Proposed solution

Extend dynamic templates with a means of matching objects (i.e. "match_mapping_type": "object") only when their fields match some criteria. Something along these lines:

PUT metrics
{
  "mappings": {
    "dynamic_templates": [
      {
        "histograms": {
          "match_mapping_type": "object",
          "match_mapping_fields": {
            "values": {"match_mapping_type": "double"},
            "counts": {"match_mapping_type": "long"}
          },
          "mapping": {
            "type": "histogram"
          }
        }
      }
    ]
  }
}

Ideally there would be a way of specifying that the set of fields is complete, along the lines of "dynamic": "strict" in mappings. For aggregate_metric fields we might

Alternatives

We could impose a path naming scheme, where histogram metrics are given a final field name of "histogram", and then use path_match: "*.histogram" and match_mapping_type: object. This would be a bit of an ugly artefact, particularly as it would end up in the UI through search completions, Metric Explorer dropdowns, etc. Ideally the name would be exactly what the user defined in the first place, which is what we store today.

elasticmachine commented 4 years ago

Pinging @elastic/es-search (:Search/Mapping)

elasticmachine commented 4 years ago

Pinging @elastic/es-core-features (:Core/Features/Features)

jpountz commented 4 years ago

We discussed this issue and were a bit on the fence about the fact that it might accidentally map an object as a histogram if it happens to have two sub fields that are called values and counts even though it's not a histogram.

Furthermore, my understanding is that the client that sends documents to Elasticsearch knows that these are histograms. If this is the case, why can't it also configure the mappings correctly when setting up the index or index template?

axw commented 4 years ago

The client (APM Server) knows that the field is a histogram, but the field name is not known up front - it is dynamically controlled by applications that send data to APM. To take care of this in a static mapping, APM users would have to know which metrics their applications will send, and then configure the index mapping accordingly.

In the case of APM we are in control of the object structure - it's just the field names we don't control. So misuse/accidental mapping will not be an issue in our case.

jpountz commented 4 years ago

Would it be an option for you to send a mapping update alongside the bulk request?

axw commented 4 years ago

I had a chat with @graphaelli about this. That could potentially work, and is attractive in that it would also enable us to record per-field metadata. We don't need that any time soon though.

There are some significant drawbacks:

We would have to send an update on every bulk request to account for rollovers, doubling the number of requests we make.
Alternatively we could update the index template only upon seeing new fields, but this would cause the template to be ever-growing. This could be problematic if metric fields change over time.
Either way, APM Server (as with all Beats) supports sending data to targets other than Elasticsearch, such as Logstash. If we were sending to Logstash and from there to Elasticsearch, then we would have to manage a separate connection to ES to perform the mapping updates first -- adding configuration/deployment complexity for users.

Another option that @graphaelli thought of is extending Ingest Node, so that it could specify the field type of a field. I think this has the same outcome as the original proposal though.

graphaelli commented 4 years ago

extending Ingest Node, so that it could specify the field type of a field. I think this has the same outcome as the original proposal though.

I think the outcome is different in that the field name could be controlled by the user. Perhaps ingest node uses the original field name to set the type and then renames it before it is finally ingested. This might look like:

processor input:

      "_source": {
        "system": {
          "disk": {
            "free": 9491243008,
            "total": 12565708800,
            "iolat.histogram": {
              "values": [1,3,7,15,31,63],
              "counts": [0,12,80,2,0,1]
            }
          }
        }
      }

processor output:

      "_source": {
        "system": {
          "disk": {
            "free": 9491243008,
            "total": 12565708800,
            "iolat": {
              "_type": "histogram",
              "values": [1,3,7,15,31,63],
              "counts": [0,12,80,2,0,1]
            }
          }
        }
      }

graphaelli commented 4 years ago

Revisiting this and wanted to make it clear that the original proposal is also my preference, I fear I may have muddled things with the alternative proposal.

might accidentally map an object as a histogram if it happens to have two sub fields that are called values and counts even though it's not a histogram.

In this case the user (APM) is opting in to this functionality by setting up the dynamic mapping, I think that avoids the accident situation.

markharwood commented 4 years ago

I think there's a broader issue here.

JSON value types are a poor predictor of elasticsearch type

The proposal is reliant on the shape of the JSON value dictating the type of the new field. Adrien's concern that values+counts might be presented for values other than histograms may be an edge case but there is a much more common precedent for JSON values being a poor predictor of type - the humble numeric fields are often mis-mapped. Numerics can represent:

Quantities of something (time, bytes, clicks etc). Typically queried by range queries and aggregated using histograms to display in bar charts.
Identifiers e.g. customerIDs or department codes. Typically queried for by exact-match queries and aggregated using aggs like terms or adjacency_matrix.

Our templates (and many users) assume that a JSON long should be mapped as an elasticsearch long but this is optimised for Quantities not Identifiers. This type imposes a 4x speed slow-down on exact-match lookups.

This is why I think it's generally problematic to trigger mapping decisions purely from value type.

Field-naming conventions and ingest pre-processors

JSON only has field names and values so if we don't want to use the value as the key for determining a new field's type we have to consider the name and using a naming convention.

As has been suggested - an ingest pipeline could more intelligently pre-process the names and offer new type guidance directly to elasticsearch without the worry of passing this through various intermediaries (the beats->logstash processing pipeline concerns raised above ).

axw commented 4 years ago

A bunch of us in the Elastic Observability team met yesterday and discussed this issue, and the broader issue we're setting out to address: to standardise how we index metrics across Metrics, APM, etc.

@exekias put forward an alternative proposal, which would index metrics using a naming convention like:

metric_name.value (gauge/point-in-time value metric)
metric_name.counter (counter/monotonically-increasing value metric)
metric_name.rate (rate/delta value metric)
metric_name.histogram (histogram metric)

The metric type informs Elasticsearch how it should be stored, and informs the UI how it should be interpreted. Value/counter/rate metrics would have the same field type, but would be interpreted differently by the UI.

We would use a dynamic template with match to set the field type, and set metric type in field metadata. Using a naming convention like this is nice in that it works today, and is fairly straightforward, but has a couple of downsides:

it exposes internal details to users
it hard-codes the metric type, whereas we might want to be able to switch metric types dynamically, to fix up coding/configuration errors – for a subset of metric types with the same fundamental data type, i.e. counters and gauges

Alternatively we could not suffix the fields, and only record the metric type in field metadata. We also want to be able to set other field metadata, such as the metric's unit: bytes, seconds, etc. This approach requires an alternative means of dynamically setting the field type.

We're thinking that it would be better to be more explicit about setting the field type and field metadata. Ideally, we would have some way of sending, along with a field value:

the Elasticsearch field type
field metadata to set, with an option of ignoring it if it's already set (so users can change field metadata and not have it overwritten)

e.g. index a document like:

{
  "latency": {
    "type": "histogram",
    "unit": "seconds",
    "counts": [1,2,3],
    "values": [4,5,6]
  }
}

Then we would have a script ingest processor that would process all fields with the above structure, and:

remove the "type" field
- set the field type to "histogram"
- if it doesn't already exist in the mapping, add the field metadata "type": "histogram"
remove the "unit" field
- add the field metadata "unit": "seconds"

Is this within the realms of possibility?

graphaelli commented 4 years ago

@exekias @roncohen This is the issue we need to resolve to move forward with a comprehensive metrics strategy IMO.

@markharwood are you the right assignee on this issue? How can we move this forward?

jpountz commented 4 years ago

@graphaelli Sorry for not being more responsive on this one, please use me as a point of contact for this issue for now. I probably won't end up implementing it but I'll own moving it forward.

Some more thoughts on this issue:

the original proposal is also my preference

I understand it would work for histograms, but how would it works for other types? I understand from above comments that you also need this for values, counters and rates. How would this work if you needed a different field type from counters and rates, wouldn't you be stuck due to the fact that both are numbers in JSON documents?

we could update the index template only upon seeing new fields, but this would cause the template to be ever-growing. This could be problematic if metric fields change over time.

This might be worth digging. If the list of fields across all indices that match the pattern might grow indefinitely, we might have other problems too.

[ingest node idea]

Ingest is currently side-effect free, it only takes bulk requests and turns them into new bulk requests. So we'd likely need to add something to _bulk requests as well if we went that route.

I'll set up a meeting to discuss it, I think it'll help us better undestand the problem we're trying to fix here.

jpountz commented 4 years ago

We discussed the challenge of creating dynamic mappings for histograms earlier today. This problem arises for histograms because there is no reliable way to distinguish a histogram from an object in a JSON document, but the general problem is actually not specific to histograms. More generally, configuring fields in an index template is challenging for APM and some Metricbeat modules such as Prometheus and Statsd.

For instance, for Prometheus metrics, Metricbeat polls metrics from Prometheus and then sends them to Elasticsearch. But new metrics could be added anytime, and there is no easy way for Metricbeat to keep state in order to know whether a metric is seen for the first time. It makes it hard to maintain an index template. So APM and Metricbeat have been relying on dynamic mappings until now, which can't easily work with histograms since histograms are JSON objects. By the way, dynamic mappings are not a fully satisfactory approach for numeric fields either, as relying on dynamic mappings prevents setting metadata on fields such as units.

The fact that fields are not under control might sound worrisome due to the fact that Elasticsearch doesn't perform well with thousands of fields, but systems that we pull metrics from such as Prometheus have similar constraints, so we feel that concerns around the number of fields in a given index can be safely ignored.

One workaround that would work would be to adopt naming conventions that make it easy to configure dynamic templates, such as mapping all fields suffixed by ".histogram" as histograms. However this makes field names artificially long, and arguably a bit ugly. It will also likely prove too limited once we need to not only configure the type via a naming convention, but also units and metric type (gauge/counter). So we would like to look into other ways that this problem could be addressed.

One idea we discussed that would work too would be to introduce a way to provide dynamic templates along with bulk requests. This would allow atomicity, which is important because of time-based indices (soon, data streams) as we want to make sure that the mappings that get appended to are those of the index where documents get appended, and not a different one because a rollover happened after the mapping update but before documents got indexed. Dynamic templates are appealing because by nature they can only add new fields to mappings, unlike a generic mapping update which can also modify existing mappings. These dynamic templates would only apply to a specific bulk request, they wouldn't be added to the mappings.

I'll soon close this issue in favor of a new issue that better reflects the problem that we are trying to solve.

exekias commented 4 years ago

ey @jpountz is there any update on this?

jpountz commented 4 years ago

@exekias No, it's been added to the list of things to look into next and hasn't been picked up yet. Cc @colings86 who I was discussing this with recently.

jpountz commented 4 years ago

Closing in favor of #61939.

elastic / elasticsearch