Add support for `subobjects: false`

ruflin commented 2 years ago

In https://github.com/elastic/elasticsearch/pull/86166 Elasticsearch introduced the subobjects: false parameter. This allows ingest flat key/value pairs that had in the past conflicts like host and host.name.

package-spec and Fleet should support this option in templates.

At first, the feature should be used with care in packages as the objects must be flattened before ingestion (see also https://github.com/elastic/apm/issues/347 for more details). At the moment ingestion tools like Beats often follow the opposite approach and expand all objects. There are potential future improvements that solve this problem: https://github.com/elastic/apm/issues/347#issuecomment-1140947927

From https://github.com/elastic/package-spec/issues/425, we need to ensure that at least one of the following mappings produce expected results, so this is supported.

- name: prometheus.labels.*
  type: object
  object_type: keyword
  subobjects: false

Should produce:

A dynamic mapping for prometheus.labels.*, with type keyword.
A static mapping for prometheus.labels, with type: object and subobjects: false.
name: prometheus.labels type: object subobjects: false

Should produce a mapping for prometheus.labels with type: object and subobjects: false, independently of the existence of other related dynamic mappings.

Maybe this already works, but we would need to check it.

### Tasks
- [ ] https://github.com/elastic/package-spec/pull/573
- [ ] https://github.com/elastic/kibana/pull/171826
- [ ] https://github.com/elastic/package-spec/pull/727

ruflin commented 2 years ago

Having subobject: false could have a broader impact on how we do things. Few thoughts:

For fields.yml, should we flatten these by default?
Should we flatten fields for component templates in Elasticsearch loaded by fleet for every object that does not have special settings on the object level? @kpollich
Should we ship data as mostly key/value pairs to Elasticsearch from Elastic Agent instead of the nested objects @ph @cmacknz
Can we remove most of the common.MapStr magic inside Beats / Agent?

We don't need to necessarily have all these discussions here but worth referencing. I'm excited about the potential future opportunities this offers. (@joshdover )

jsoriano commented 2 years ago

This is great, and I think that in many of the cases where we would take more advantage of this, basically where we dedot now, the field is already flattened, because we send it as is read from origin.

I think we could add support for this, and introduce it eventually in a case per case basis.

At first, the feature should be used with care in packages as the objects must be flattened before ingestion

We can add a check for this in elastic-package tests, as we have for other field values.

In a more general sense, do we know if storing everything flattened would have other advantages? as disk space savings or ingestion performance?

ph commented 2 years ago

This is indeed interesting, so that would also simply all the Schema in metricbeat.

kpollich commented 2 years ago

Should we flatten fields for component templates in Elasticsearch loaded by fleet for every object that does not have special settings on the object level?

I'd like to just solidify my understanding of the relation between the package spec and component templates generated by Fleet. The fields YML files in a given integration control the mappings Fleet generates for component templates, correct?

e.g. nginx/data_stream/access/fields/agent.yml defines a host mapping here on the logs-nginx.access@package eventually generated by Fleet.

The host fields correspond as such:

Show code blocks

```yml # agent.yml - name: host title: Host group: 2 description: 'A host is defined as a general computing instance. ECS host.* fields should be populated with details about the host on which the event happened, or from which the measurement was taken. Host types include hardware, virtual machines, Docker containers, and Kubernetes nodes.' type: group fields: - name: architecture level: core type: keyword ignore_above: 1024 description: Operating system architecture. example: x86_64 - name: domain level: extended type: keyword ignore_above: 1024 description: 'Name of the domain of which the host is a member. For example, on Windows this could be the host''s Active Directory domain or NetBIOS domain name. For Linux this could be the domain of the host''s LDAP provider.' example: CONTOSO default_field: false - name: hostname level: core type: keyword ignore_above: 1024 description: 'Hostname of the host. It normally contains what the `hostname` command returns on the host machine.' - name: id level: core type: keyword ignore_above: 1024 description: 'Unique host id. As hostname is not always unique, use values that are meaningful in your environment. Example: The current usage of `beat.name`.' - name: ip level: core type: ip description: Host ip addresses. - name: mac level: core type: keyword ignore_above: 1024 description: Host mac addresses. - name: name level: core type: keyword ignore_above: 1024 description: 'Name of the host. It can contain what `hostname` returns on Unix systems, the fully qualified domain name, or a name specified by the user. The sender decides which value to use.' - name: os.family level: extended type: keyword ignore_above: 1024 description: OS family (such as redhat, debian, freebsd, windows). example: debian - name: os.kernel level: extended type: keyword ignore_above: 1024 description: Operating system kernel version as a raw string. example: 4.4.0-112-generic - name: os.name level: extended type: keyword ignore_above: 1024 multi_fields: - name: text type: text norms: false default_field: false description: Operating system name, without the version. example: Mac OS X - name: os.platform level: extended type: keyword ignore_above: 1024 description: Operating system platform (such centos, ubuntu, windows). example: darwin - name: os.version level: extended type: keyword ignore_above: 1024 description: Operating system version as a raw string. example: 10.14.1 - name: type level: core type: keyword ignore_above: 1024 description: 'Type of host. For Cloud providers this can be the machine type like `t2.medium`. If vm, this could be the container, for example, or other information meaningful in your environment.' - name: containerized type: boolean description: > If the host is a container. - name: os.build type: keyword example: "18D109" description: > OS build information. - name: os.codename type: keyword example: "stretch" description: > OS codename, if any. ``` ```jsonc // logs-nginx.access@package component template mappings "host": { "properties": { "hostname": { "ignore_above": 1024, "type": "keyword" }, "os": { "properties": { "build": { "ignore_above": 1024, "type": "keyword" }, "kernel": { "ignore_above": 1024, "type": "keyword" }, "codename": { "ignore_above": 1024, "type": "keyword" }, "name": { "ignore_above": 1024, "type": "keyword", "fields": { "text": { "type": "text" } } }, "family": { "ignore_above": 1024, "type": "keyword" }, "version": { "ignore_above": 1024, "type": "keyword" }, "platform": { "ignore_above": 1024, "type": "keyword" } } }, "domain": { "ignore_above": 1024, "type": "keyword" }, "ip": { "type": "ip" }, "containerized": { "type": "boolean" }, "name": { "ignore_above": 1024, "type": "keyword" }, "id": { "ignore_above": 1024, "type": "keyword" }, "type": { "ignore_above": 1024, "type": "keyword" }, "mac": { "ignore_above": 1024, "type": "keyword" }, "architecture": { "ignore_above": 1024, "type": "keyword" } } } ```

The suggestion here to flatten all fields generated by Fleet on these component templates would alter our logic here to convert the host mappings generated to something along the lines of this:

// subset of mappings shown for brevity's sake
{
  "host": {
    "type": "object",
    "subobjects": false,
    "properties": {
      "hostname": {
        "ignore_above": 1034,
        "type": "keyword"
      },
      "os.build": {
        "ignore_above": 1024,
        "type": "keyword"
      },
      "os.kernel": {
        "ignore_above": 1024,
        "type": "keyword"
      },
      "os.codename": {
        "ignore_above": 1024,
        "type": "keyword"
      },
      "os.name": {
        "ignore_above": 1024,
        "type": "keyword",
        "fields": {
          "text": {
            "type": "text"
          }
        }
      }
    }
  }
}

Is my understanding of how subobjects: false is intended to interact with the mappings Fleet generates for component templates correct? I'm basing this example off what I'm seeing in the docs added in the referenced Elasticsearch PR: https://github.com/elastic/elasticsearch/pull/86166/files#diff-7fc27842bfad1be51be5b340e2dc4fed9048f72d6129c069fdca9c587aa97ed8R54-R76

I understand the flattening here being valuable for preventing collisions between fields like host and host.name (or the metric.value and metric.value.max fields described in https://github.com/elastic/elasticsearch/issues/63530) as mentioned, and I think it'd be a fairly straightforward implementation to apply this flattening logic when we parse object fields during Fleet's component template installation step.

I'd echo @jsoriano's question above as well, just out of general curiosity:

In a more general sense, do we know if storing everything flattened would have other advantages? as disk space savings or ingestion performance?

ruflin commented 2 years ago

@kpollich The initial idea I was proposing above was not directly related to subobject: false but it got triggered by it. The idea was to have:

{
   "host.hostname":{
      "ignore_above":1034,
      "type":"keyword"
   },
   "host.os.build":{
      "ignore_above":1024,
      "type":"keyword"
   },
   "host.os.kernel":{
      "ignore_above":1024,
      "type":"keyword"
   }
}

But you bring up additional interesting points.

javanna commented 2 years ago

I think the question here is taking the current solution to resolve the original collision problem to the next level, and apply it more broadly. We are reasoning about how useful intermediate objects are in the mappings and whether it would not be simpler to map leaf fields only using the dot notation.

In a more general sense, do we know if storing everything flattened would have other advantages? as disk space savings or ingestion performance?

No data on this, I believe what we are after is simplifying mappings, but @ruflin can correct me. It would have an impact on the total number of fields, because objects would no longer be mapped.

ruflin commented 2 years ago

No data on this, I believe what we are after is simplifying mappings, but @ruflin can correct me. It would have an impact on the total number of fields, because objects would no longer be mapped.

++, focus on simplification. It could have a positive side effect on # of fields mapped but I would expect it to be very minimal.

jsoriano commented 2 years ago

No data on this, I believe what we are after is simplifying mappings, but @ruflin can correct me. It would have an impact on the total number of fields, because objects would no longer be mapped.

++, focus on simplification. It could have a positive side effect on # of fields mapped but I would expect it to be very minimal.

I agree on simplification of the fields.yml files, and and on applying this to solve the collision problem.

But flattening everything we ship opposes to what we have been doing so far, in general we are sending unflattened data. This can be a big change in beats and integrations. I was asking to see if there are more advantages to help justifying this effort.

felixbarny commented 2 years ago

This can be a big change in beats and integrations.

If Elasticsearch is automatically handling the flattening of objects, then there's no work that needs to be done in Beats/integrations and using subobjects: false on the root level should be a drop-in replacement.

javanna commented 2 years ago

if Elasticsearch is automatically handling the flattening of objects, then there's no work that needs to be done in Beats/integrations and using subobjects: false on the root level should be a drop-in replacement.

I believe this is not correct. Or maybe I am referring to a problem that I saw on ECS that is not present in the mappings you are referring to. Elasticsearch is considering automatically flattening objects in incoming documents when subobjects is set to false, but some mappings do send objects and that will not be accepted. If you want to set subobjects to false, then the mappings should be flattened. From previous conversation, this change can be implemented already today: intermediate objects don't need to be mapped, you can already use the dot notation which Elasticsearch will expand to objects automatically, which it will no longer do as soon you switch to subobjects: false. Ping me if you have questions around this.

felixbarny commented 2 years ago

Right, good point. The mappings potentially need some changes but most importantly, the documents and queries don't, right?

the dot notation which Elasticsearch will expand to objects automatically

There's one instance where this isn't the case, though. Not sure if that's a bug or a feature. If dynamic is set to false or runtime, and when using dynamic templates with a path match containing dots, ES won't automatically create object mappings, which leads to an exception on ingest.

Example:

{
  "template": {
    "mappings": {
      "dynamic_templates": [
        {
          "log_level": {
            "path_match": "log.level",
            "mapping": { "type": "keyword" }
          }
        }
      ],
      "dynamic": "runtime",
      "properties": {
        "log": {
          "type": "object"
        }
      }
    }
  }
}

This doesn't work without explicitly mapping log as an object. It's not strictly related to subobjects: false but kinda related. And I needed to dump it somewhere 😅

javanna commented 2 years ago

@felixbarny dynamic:runtime maps every unknown field as a runtime field, hence within the runtime section. The runtime section does not support objects, and the behaviour will be then very similar to setting subobjects false to the root then. In this example you'd like to have everything mapped as runtime besides log.level, do I understand correctly?

Can you share more info, like what exception on ingest you get, maybe better on slack as this is unrelated to this issue?

In short, though, that objects are not dynamically not mapped is a feature in this case, I would like to dig further on why the object creation is necessary to make path_match work. Ping me, please?

felixbarny commented 1 year ago

During 8.10, we're working on removing all known blockers for subobjects: false adoption across different teams. @jlind23 @juliaElastic, is there a chance this issue could also be picked up in the 8.10 time frame? If that's unreasonable, could the observability contribute the necessary changes?

From my perspective, these are the concrete things that need to be implemented:

Instead of creating nested mappings, Fleet should always generate flattened mappings. This works both with subobjects set to true and false, because Elasticsearch will internally create nested object mappings if subobjects are enabled. See also this flat mapping as an example, and contrast it with this nested mapping. Both are equivalent but the former supports both subobjects to be set to true and false, whereas the latter only supports subobjects: false (the default). This will need to be changed in the the Fleet plugin in Kibana in the _generateMappings function.
Support setting subobjects: false at the root of the mapping (see the last section of the subobjects docs). This will need changes in the package spec and in Fleet (I suppose also in _generateMappings).
Support setting subobjects: false at a per-field level. This needs similar changes to the above.

In the future, we'll want to change the default for many of the data stream types so that subobjects: false is the default. But integrations should have the ability to opt-out on a per-data stream level.

For more background on why we're adopting subobjects: false, see this issue: https://github.com/elastic/logs-dev/issues/105

jlind23 commented 1 year ago

@felixbarny looking at our current roadmap there is no chance that we will be on time for 8.10, I am more than happy to find someone in our team that will help review this contribution.

felixbarny commented 1 year ago

Instead of creating nested mappings, Fleet should always generate flattened mappings. This works both with subobjects set to true and false, because Elasticsearch will internally create nested object mappings if subobjects are enabled.

We're currently discussing whether this can be done directly in Elasticsearch. That would also help if users have added custom mappings in the <package>@custom component template that contain nested mappings.

zmoog commented 10 months ago

I skimmed through the existing field definitions with type: object in the https://github.com/elastic/integrations repo to find patterns in the current usage.

Here are a few recurring cases:

- name: docker.container.labels.*
  type: object
  release: ga
  description: |
    Container labels

- name: labels
  level: extended
  type: object
  object_type: keyword
  description: Image labels.

- name: forgerock.request.detail.*
  type: object
  object_type: keyword
  object_type_mapping_type: '*'
  description: Details around the response status.

- name: input
  type: object

- name: payload
  type: object
  enabled: false

- name: labels.*
  type: object
  description: |
    Image labels.

- name: percpu
  type: object
  description: |
    CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup.

From these cases, I tried to identify mapping that would use the subobjects option and their outcome.

Here are four cases with field definition and expected mapping I am currently focusing on. I am using these cases to write the tests.

Case A

Definition

- name: prometheus.a.labels
  type: object
  subobjects: false

Mapping

{
    "properties": {
        "prometheus": {
            "properties": {
                "a": {
                    "properties": {
                        "labels": {
                            "type": "object",
                            "subobjects": false
                        }
                    }
                }
            }
        }
    }
}

Case B

Definition

- name: prometheus.b.labels.*
  type: object
  object_type: keyword
  subobjects: false

Mapping

{
    "dynamic_templates": [
        {
            "prometheus.b.labels.*": {
                "path_match": "prometheus.b.labels.*",
                "match_mapping_type": "string",
                "mapping": {
                    "type": "keyword"
                }
            }
        }
    ],
    "properties": {
        "prometheus": {
            "properties": {
                "b": {
                    "properties": {
                        "labels": {
                            "type": "object",
                            "subobjects": false
                        }
                    }
                }
            }
        }
    }
}

Case C

Definition

- name: prometheus.c.labels.*
  type: object
  subobjects: false

Mapping

This is the result I get from the current in-progress implementation. It doesn't feels right.

{
    "properties": {
        "prometheus": {
            "properties": {
                "c": {
                    "properties": {
                        "labels": {
                            "properties": {
                                "*": {
                                    "type": "object",
                                    "subobjects": false
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

Case D

Definition

- name: prometheus.d.labels
  type: object
  object_type: keyword
  subobjects: false

Mapping

{
    "dynamic_templates": [
        {
            "prometheus.d.labels": {
                "path_match": "prometheus.d.labels.*",
                "match_mapping_type": "string",
                "mapping": {
                    "type": "keyword"
                }
            }
        }
    ],
    "properties": {
        "prometheus": {
            "properties": {
                "d": {
                    "type": "object",
                    "subobjects": false
                }
            }
        }
    }
}

jsoriano commented 10 months ago

@zmoog are the packages defining these mappings using Package Spec v3? In principle this version doesn't allow the definition of objects without children (or without object_type), so for example in principle cases A and C would not be possible on recent packages, so maybe it is not worth supporting these cases.

Maybe we need to reconsider this, to allow specifying subobjects: false, but what would be the use case for such mappings?

Case C ... This is the result I get from the current in-progress implementation. It doesn't feels right.

Indeed this is not right, packages using mappings like this one should be reviewed. This is a mapping that in principle is not allowed by recent Package Specs.

Some cases that were still generating mappings with "*" members will be fixed in 8.12 after https://github.com/elastic/kibana/pull/169981.

zmoog commented 9 months ago

are the packages defining these mappings using Package Spec v3?

No, they are using package spec version < v3. I confirm that packages using spec v3 don't allow this case.

Maybe we need to reconsider this, to allow specifying subobjects: false, but what would be the use case for such mappings?

While troubleshooting an SDH, I tested a potential solution for a user trying to move the content of a field from a flattened to an object mapping.

On my test cluster, I was able to make it work by updating the logs-azure.platformlogs@custom component template with this request:

PUT _component_template/logs-azure.platformlogs@custom
{
  "template": {
    "mappings": {
      "_source": {
        "excludes": [],
        "includes": [],
        "enabled": true
      },
      "_routing": {
        "required": false
      },
      "dynamic": true,
      "numeric_detection": false,
      "date_detection": true,
      "dynamic_date_formats": [
        "strict_date_optional_time",
        "yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z"
      ],
      "dynamic_templates": [],
      "properties": {
        "azure.platformlogs.k8s": {
          "type": "object",
          "subobjects": false
        }
      }
    }
  },
  "_meta": {
    "package": {
      "name": "azure"
    },
    "managed_by": "fleet",
    "managed": true
  }
}

I can share more about this case with you on a separate channel and report the summary here.

Indeed this is not right, packages using mappings like this one should be reviewed. This is a mapping that in principle is not allowed by recent Package Specs.

Okay, this is invalid and not something we intend to support. It will resolve as more packages switch to spec v3.

jsoriano commented 9 months ago

Maybe we need to reconsider this, to allow specifying subobjects: false, but what would be the use case for such mappings?

While troubleshooting an SDH, I tested a potential solution for a user trying to move the content of a field from a flattened to an object mapping.

Out of curiosity, why did they want to move from flattened to object?

On my test cluster, I was able to make it work by updating the logs-azure.platformlogs@custom component template with this request:
PUT _component_template/logs-azure.platformlogs@custom
{
  ...
      "dynamic_templates": [],
      "properties": {
        "azure.platformlogs.k8s": {
          "type": "object",
          "subobjects": false
        }
      }
    }
  ...
}

With this mapping we are not defining the type for the properties of azure.platformlogs.k8s, so there may be conflicts if they are mixed (an integer is sent first, but later the same field can also have floats, or keywords). We have been trying to remove this kind of ambiguous mappings.

For this case we could have a definition like the following one (would be Case D then):

- name: "azure.platformlogs.k8s"
  type: object
  object_type: keyword
  subobjects: false

And ensure that we generate for it the following mappings:

{
  ...
      "dynamic_templates": [
        {
           "azure.platformlogs.k8s": {
             "path_match": "azure.platformlogs.k8s.*",
             "mapping": { "type": "keyword" }
           }
        }     
      ],
      "properties": {
        "azure.platformlogs.k8s": {
          "type": "object",
          "subobjects": false
        }
      }
    }
  ...
}

Btw, cases B and D are in principle equivalent, Fleet adds the wildcard at the end for objects whose name doesn't have any wildcard. But would be good to keep them as testing cases. Another equivalent case would be the following one:

- name: prometheus.d.labels.*
  type: keyword
  subobjects: false

When there are wildcards in the name for non-objects, Fleet translates the definition to type: object using the original type as object_type.

zmoog commented 9 months ago

Out of curiosity, why did they want to move from flattened to object?

They mentioned (my apologies for the link to a private repo):

The reason for this change was to be able to select the different azure.platformlogs.properties using Kibana discover module, which is not possible when the field is "flattened".

It's probably worth mentioning how Azure logs work in general:

Here's an example of Azure logs:

{
    "category": "kube-audit",
    "operationName": "Microsoft.ContainerService/managedClusters/diagnosticLogs/Read",
    "properties": {
        ...
    },
    "resourceId": "/SUBSCRIPTIONS/123RESOURCEGROUPS/abc/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/abc",
    "time": "2023-11-29T09:41:00.720070863Z"
}

The log event contains a handful of standard fields:

category the log category
operationName
resourceId is the ID of the Azure resource emitting the event
time the timestamp of when the described event occurred

In addition to these well-defined and documented fields, there's also the properties field that Azure uses for the service-specific part of the event.

The content of the properties field varies A LOT. Depending on the originating service, properties can contain strings, numbers objects, embedded JSON objects, or a combination. It's wild.

In this scenario, the original designer of the integration opted to map the properties using the flattened, which makes sense.

This user is ingesting kube-audit events that are pretty large objects. You can see an example (I apologize again for the private repo) of this even type unpacked from the Azure log event.

zmoog commented 9 months ago

With this mapping we are not defining the type for the properties of azure.platformlogs.k8s, so there may be conflicts if they are mixed (an integer is sent first, but later the same field can also have floats, or keywords). We have been trying to remove this kind of ambiguous mappings.

I see the risk here.

However, users want to tap into the treasure trove of significant and valuable events like kube-audit to get the most out of it.

jsoriano commented 9 months ago

The reason for this change was to be able to select the different azure.platformlogs.properties using Kibana discover module, which is not possible when the field is "flattened".

Yep, this is the most frequent issue with flattened :slightly_frowning_face: https://github.com/elastic/kibana/issues/25820

subobjects: false will be useful in cases where there is no risk of fields explosion.

zmoog commented 9 months ago

Support for per-field subobjects: false merged in package-spec and kibana.

felixbarny commented 8 months ago

Support for per data stream subobjects: false is blocked by https://github.com/elastic/elasticsearch/issues/99860

zmoog commented 5 months ago

The subobjects option is now available at two levels:

per field (8.13)
per data stream (8.14)

elastic / package-spec