Support multi-key fields from SemConv

gregkalapos commented 2 months ago

Summary

(Not sure if multi-key fields is the right term; if there is a better short name describing it, let's update the title.)

OTel SemConv defines fields which can have multiple keys - examples are:

HTTP request headers: http.request.header.<key> , this is already stable
Database query parameters: db.query.parameter.<key>

There are 2 aspects of this: 1) When the OTel SemConv <-> ECS merge happens, how do such fields get into ECS? 2) How should the mapping look like for such fields for Elasticsearch?

We discussed this with @felixbarny shortly, regarding point 2:

We could use flattened field type
We could set enabled to false.

In APM we have a field called labels with similar dynamic keys, currently with this mapping:

    {
      "labels": {
        "path_match": "labels.*",
        "match_mapping_type": "string",
        "mapping": {
          "type": "keyword"
        }
      }
    }

Issue with above is that this leads to field explosion.

gregkalapos commented 2 months ago

Other similar case is http.request.headers and http.response.headers already existing in the traces-apm mapping:

          "http.request.headers": {
            "path_match": "http.request.headers.*",
            "match_mapping_type": "string",
            "mapping": {
              "type": "keyword"
            }
          }
        },
        {
          "http.response.headers": {
            "path_match": "http.response.headers.*",
            "match_mapping_type": "string",
            "mapping": {
              "type": "keyword"
            }
          }

felixbarny commented 2 months ago

Other similar case is http.request.headers and http.response.headers already existing in the traces-apm mapping:

That's interesting. I wasn't aware that we were adding all HTTP headers to the mapping. @axw did you hear of situations where this lead to mapping explosions? It seems a bit dangerous to me as anyone can just create a bunch of requests with unique HTTP headers and force the backend into a field explosion.

We could use flattened field type

This would also work in case subobjects is set to false

Click to expand example

``` DELETE felixbarny-test?ignore_unavailable=true PUT felixbarny-test { "mappings": { "properties": { "attributes": { "subobjects": false, "type": "object", "properties": { "http.request.header": { "type": "flattened" } } } } } } POST felixbarny-test/_doc?refresh { "attributes": { "http.request.header.foo": "bar", "http.request.header.bar": "baz" } } GET felixbarny-test/_mapping POST felixbarny-test/_search # DELETE felixbarny-test?ignore_unavailable=true 200 OK { "acknowledged": true } # PUT felixbarny-test 200 OK { "acknowledged": true, "shards_acknowledged": true, "index": "felixbarny-test" } # POST felixbarny-test/_doc?refresh 201 Created { "_index": "felixbarny-test", "_id": "7C2P8I4BRVt7x5oqUVes", "_version": 1, "result": "created", "forced_refresh": true, "_shards": { "total": 2, "successful": 2, "failed": 0 }, "_seq_no": 0, "_primary_term": 1 } # GET felixbarny-test/_mapping 200 OK { "felixbarny-test": { "mappings": { "properties": { "attributes": { "subobjects": false, "properties": { "http.request.header": { "type": "flattened" } } } } } } } # POST felixbarny-test/_search 200 OK { "took": 1, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 1, "relation": "eq" }, "max_score": 1, "hits": [ { "_index": "felixbarny-test", "_id": "7C2P8I4BRVt7x5oqUVes", "_score": 1, "_source": { "attributes": { "http.request.header.foo": "bar", "http.request.header.bar": "baz" } } } ] } } ```

We could set enabled to false.

This doesn't work when subobjects is set to false, because we can't add an object mapper for http.request.header with enabled: false to the mapping in a context where objects are disabled.

Click to expand example

``` DELETE felixbarny-test?ignore_unavailable=true PUT felixbarny-test { "mappings": { "properties": { "attributes": { "subobjects": false, "type": "object", "properties": { "http.request.header": { "type": "object", "enabled": false } } } } } } # DELETE felixbarny-test?ignore_unavailable=true 200 OK { "acknowledged": true } # PUT felixbarny-test 400 Bad Request { "error": { "root_cause": [ { "type": "mapper_parsing_exception", "reason": "Failed to parse mapping: Object mapper [http.request.header] was found in a context where subobjects is set to false. Auto-flattening [http.request.header] failed because the value of [enabled] is [false]" } ], "type": "mapper_parsing_exception", "reason": "Failed to parse mapping: Object mapper [http.request.header] was found in a context where subobjects is set to false. Auto-flattening [http.request.header] failed because the value of [enabled] is [false]", "caused_by": { "type": "illegal_argument_exception", "reason": "Object mapper [http.request.header] was found in a context where subobjects is set to false. Auto-flattening [http.request.header] failed because the value of [enabled] is [false]" } }, "status": 400 } ```

axw commented 2 months ago

That's interesting. I wasn't aware that we were adding all HTTP headers to the mapping. @axw did you hear of situations where this lead to mapping explosions?

I haven't.

It seems a bit dangerous to me as anyone can just create a bunch of requests with unique HTTP headers and force the backend into a field explosion.

That's a good point, I don't think anyone considered this.

trisch-me commented 2 months ago

@felixbarny should I bring into next semconv meeting your thoughts about field explosion? Or if you want you can create issue here

felixbarny commented 2 months ago

I'm not sure if this is an issue with semantic conventions per-se. It depends on how backends can deal with namespaces that don't have a bounded number of fields. Some may be more resilient than others when it comes to the total number of fields. For example, some vendors may only store fields of certain namespaces for retrieval but don't maintain in-memory metadata about them. What we can do when it comes to mapping http.request.headers.* is to store this as a flattened field type, which avoids creating a field in the mapping for each field, so that there's no risk of a mapping explosion. That field type does come with certain tradeoffs but they seem reasonable in this case.

Still, I think it would be interesting to get some insight into how other backends intend to deal with these multi-key fields. So if you could bring that up in the next semconv meeting, that would be highly appreciated.

elastic / ecs

Support multi-key fields from SemConv #2333