Closed ruflin closed 5 months ago
Having subobject: false
could have a broader impact on how we do things. Few thoughts:
common.MapStr
magic inside Beats / Agent?We don't need to necessarily have all these discussions here but worth referencing. I'm excited about the potential future opportunities this offers. (@joshdover )
This is great, and I think that in many of the cases where we would take more advantage of this, basically where we dedot now, the field is already flattened, because we send it as is read from origin.
I think we could add support for this, and introduce it eventually in a case per case basis.
At first, the feature should be used with care in packages as the objects must be flattened before ingestion
We can add a check for this in elastic-package tests, as we have for other field values.
In a more general sense, do we know if storing everything flattened would have other advantages? as disk space savings or ingestion performance?
This is indeed interesting, so that would also simply all the Schema in metricbeat.
Should we flatten fields for component templates in Elasticsearch loaded by fleet for every object that does not have special settings on the object level?
I'd like to just solidify my understanding of the relation between the package spec and component templates generated by Fleet. The fields
YML files in a given integration control the mappings Fleet generates for component templates, correct?
e.g. nginx/data_stream/access/fields/agent.yml
defines a host
mapping here on the logs-nginx.access@package
eventually generated by Fleet.
The host
fields correspond as such:
The suggestion here to flatten all fields generated by Fleet on these component templates would alter our logic here to convert the host
mappings generated to something along the lines of this:
// subset of mappings shown for brevity's sake
{
"host": {
"type": "object",
"subobjects": false,
"properties": {
"hostname": {
"ignore_above": 1034,
"type": "keyword"
},
"os.build": {
"ignore_above": 1024,
"type": "keyword"
},
"os.kernel": {
"ignore_above": 1024,
"type": "keyword"
},
"os.codename": {
"ignore_above": 1024,
"type": "keyword"
},
"os.name": {
"ignore_above": 1024,
"type": "keyword",
"fields": {
"text": {
"type": "text"
}
}
}
}
}
}
Is my understanding of how subobjects: false
is intended to interact with the mappings Fleet generates for component templates correct? I'm basing this example off what I'm seeing in the docs added in the referenced Elasticsearch PR: https://github.com/elastic/elasticsearch/pull/86166/files#diff-7fc27842bfad1be51be5b340e2dc4fed9048f72d6129c069fdca9c587aa97ed8R54-R76
I understand the flattening here being valuable for preventing collisions between fields like host
and host.name
(or the metric.value
and metric.value.max
fields described in https://github.com/elastic/elasticsearch/issues/63530) as mentioned, and I think it'd be a fairly straightforward implementation to apply this flattening logic when we parse object fields during Fleet's component template installation step.
I'd echo @jsoriano's question above as well, just out of general curiosity:
In a more general sense, do we know if storing everything flattened would have other advantages? as disk space savings or ingestion performance?
@kpollich The initial idea I was proposing above was not directly related to subobject: false
but it got triggered by it. The idea was to have:
{
"host.hostname":{
"ignore_above":1034,
"type":"keyword"
},
"host.os.build":{
"ignore_above":1024,
"type":"keyword"
},
"host.os.kernel":{
"ignore_above":1024,
"type":"keyword"
}
}
But you bring up additional interesting points.
I think the question here is taking the current solution to resolve the original collision problem to the next level, and apply it more broadly. We are reasoning about how useful intermediate objects are in the mappings and whether it would not be simpler to map leaf fields only using the dot notation.
In a more general sense, do we know if storing everything flattened would have other advantages? as disk space savings or ingestion performance?
No data on this, I believe what we are after is simplifying mappings, but @ruflin can correct me. It would have an impact on the total number of fields, because objects would no longer be mapped.
No data on this, I believe what we are after is simplifying mappings, but @ruflin can correct me. It would have an impact on the total number of fields, because objects would no longer be mapped.
++, focus on simplification. It could have a positive side effect on # of fields mapped but I would expect it to be very minimal.
No data on this, I believe what we are after is simplifying mappings, but @ruflin can correct me. It would have an impact on the total number of fields, because objects would no longer be mapped.
++, focus on simplification. It could have a positive side effect on # of fields mapped but I would expect it to be very minimal.
I agree on simplification of the fields.yml files, and and on applying this to solve the collision problem.
But flattening everything we ship opposes to what we have been doing so far, in general we are sending unflattened data. This can be a big change in beats and integrations. I was asking to see if there are more advantages to help justifying this effort.
This can be a big change in beats and integrations.
If Elasticsearch is automatically handling the flattening of objects, then there's no work that needs to be done in Beats/integrations and using subobjects: false
on the root level should be a drop-in replacement.
if Elasticsearch is automatically handling the flattening of objects, then there's no work that needs to be done in Beats/integrations and using subobjects: false on the root level should be a drop-in replacement.
I believe this is not correct. Or maybe I am referring to a problem that I saw on ECS that is not present in the mappings you are referring to. Elasticsearch is considering automatically flattening objects in incoming documents when subobjects is set to false, but some mappings do send objects and that will not be accepted. If you want to set subobjects to false, then the mappings should be flattened. From previous conversation, this change can be implemented already today: intermediate objects don't need to be mapped, you can already use the dot notation which Elasticsearch will expand to objects automatically, which it will no longer do as soon you switch to subobjects: false. Ping me if you have questions around this.
Right, good point. The mappings potentially need some changes but most importantly, the documents and queries don't, right?
the dot notation which Elasticsearch will expand to objects automatically
There's one instance where this isn't the case, though. Not sure if that's a bug or a feature. If dynamic
is set to false
or runtime
, and when using dynamic templates with a path match containing dots, ES won't automatically create object mappings, which leads to an exception on ingest.
Example:
{
"template": {
"mappings": {
"dynamic_templates": [
{
"log_level": {
"path_match": "log.level",
"mapping": { "type": "keyword" }
}
}
],
"dynamic": "runtime",
"properties": {
"log": {
"type": "object"
}
}
}
}
}
This doesn't work without explicitly mapping log
as an object. It's not strictly related to subobjects: false
but kinda related. And I needed to dump it somewhere 😅
@felixbarny dynamic:runtime maps every unknown field as a runtime field, hence within the runtime section. The runtime section does not support objects, and the behaviour will be then very similar to setting subobjects false to the root then. In this example you'd like to have everything mapped as runtime besides log.level, do I understand correctly?
Can you share more info, like what exception on ingest you get, maybe better on slack as this is unrelated to this issue?
In short, though, that objects are not dynamically not mapped is a feature in this case, I would like to dig further on why the object creation is necessary to make path_match work. Ping me, please?
During 8.10, we're working on removing all known blockers for subobjects: false
adoption across different teams. @jlind23 @juliaElastic, is there a chance this issue could also be picked up in the 8.10 time frame? If that's unreasonable, could the observability contribute the necessary changes?
From my perspective, these are the concrete things that need to be implemented:
subobjects
set to true
and false
, because Elasticsearch will internally create nested object mappings if subobjects are enabled. See also this flat mapping as an example, and contrast it with this nested mapping. Both are equivalent but the former supports both subobjects
to be set to true
and false
, whereas the latter only supports subobjects: false
(the default). This will need to be changed in the the Fleet plugin in Kibana in the _generateMappings
function.subobjects: false
at the root of the mapping (see the last section of the subobjects
docs). This will need changes in the package spec and in Fleet (I suppose also in _generateMappings
).subobjects: false
at a per-field level. This needs similar changes to the above.In the future, we'll want to change the default for many of the data stream types so that subobjects: false
is the default. But integrations should have the ability to opt-out on a per-data stream level.
For more background on why we're adopting subobjects: false
, see this issue: https://github.com/elastic/logs-dev/issues/105
@felixbarny looking at our current roadmap there is no chance that we will be on time for 8.10, I am more than happy to find someone in our team that will help review this contribution.
Instead of creating nested mappings, Fleet should always generate flattened mappings. This works both with subobjects set to true and false, because Elasticsearch will internally create nested object mappings if subobjects are enabled.
We're currently discussing whether this can be done directly in Elasticsearch. That would also help if users have added custom mappings in the <package>@custom
component template that contain nested mappings.
I skimmed through the existing field definitions with type: object
in the https://github.com/elastic/integrations repo to find patterns in the current usage.
Here are a few recurring cases:
- name: docker.container.labels.*
type: object
release: ga
description: |
Container labels
- name: labels
level: extended
type: object
object_type: keyword
description: Image labels.
- name: forgerock.request.detail.*
type: object
object_type: keyword
object_type_mapping_type: '*'
description: Details around the response status.
- name: input
type: object
- name: payload
type: object
enabled: false
- name: labels.*
type: object
description: |
Image labels.
- name: percpu
type: object
description: |
CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup.
From these cases, I tried to identify mapping that would use the subobjects
option and their outcome.
Here are four cases with field definition and expected mapping I am currently focusing on. I am using these cases to write the tests.
Definition
- name: prometheus.a.labels
type: object
subobjects: false
Mapping
{
"properties": {
"prometheus": {
"properties": {
"a": {
"properties": {
"labels": {
"type": "object",
"subobjects": false
}
}
}
}
}
}
}
Definition
- name: prometheus.b.labels.*
type: object
object_type: keyword
subobjects: false
Mapping
{
"dynamic_templates": [
{
"prometheus.b.labels.*": {
"path_match": "prometheus.b.labels.*",
"match_mapping_type": "string",
"mapping": {
"type": "keyword"
}
}
}
],
"properties": {
"prometheus": {
"properties": {
"b": {
"properties": {
"labels": {
"type": "object",
"subobjects": false
}
}
}
}
}
}
}
Definition
- name: prometheus.c.labels.*
type: object
subobjects: false
Mapping
This is the result I get from the current in-progress implementation. It doesn't feels right.
{
"properties": {
"prometheus": {
"properties": {
"c": {
"properties": {
"labels": {
"properties": {
"*": {
"type": "object",
"subobjects": false
}
}
}
}
}
}
}
}
}
Definition
- name: prometheus.d.labels
type: object
object_type: keyword
subobjects: false
Mapping
{
"dynamic_templates": [
{
"prometheus.d.labels": {
"path_match": "prometheus.d.labels.*",
"match_mapping_type": "string",
"mapping": {
"type": "keyword"
}
}
}
],
"properties": {
"prometheus": {
"properties": {
"d": {
"type": "object",
"subobjects": false
}
}
}
}
}
@zmoog are the packages defining these mappings using Package Spec v3? In principle this version doesn't allow the definition of objects without children (or without object_type
), so for example in principle cases A and C would not be possible on recent packages, so maybe it is not worth supporting these cases.
Maybe we need to reconsider this, to allow specifying subobjects: false
, but what would be the use case for such mappings?
Case C ... This is the result I get from the current in-progress implementation. It doesn't feels right.
Indeed this is not right, packages using mappings like this one should be reviewed. This is a mapping that in principle is not allowed by recent Package Specs.
Some cases that were still generating mappings with "*"
members will be fixed in 8.12 after https://github.com/elastic/kibana/pull/169981.
are the packages defining these mappings using Package Spec v3?
No, they are using package spec version < v3. I confirm that packages using spec v3 don't allow this case.
Maybe we need to reconsider this, to allow specifying subobjects: false, but what would be the use case for such mappings?
While troubleshooting an SDH, I tested a potential solution for a user trying to move the content of a field from a flattened
to an object
mapping.
On my test cluster, I was able to make it work by updating the logs-azure.platformlogs@custom
component template with this request:
PUT _component_template/logs-azure.platformlogs@custom
{
"template": {
"mappings": {
"_source": {
"excludes": [],
"includes": [],
"enabled": true
},
"_routing": {
"required": false
},
"dynamic": true,
"numeric_detection": false,
"date_detection": true,
"dynamic_date_formats": [
"strict_date_optional_time",
"yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z"
],
"dynamic_templates": [],
"properties": {
"azure.platformlogs.k8s": {
"type": "object",
"subobjects": false
}
}
}
},
"_meta": {
"package": {
"name": "azure"
},
"managed_by": "fleet",
"managed": true
}
}
I can share more about this case with you on a separate channel and report the summary here.
Indeed this is not right, packages using mappings like this one should be reviewed. This is a mapping that in principle is not allowed by recent Package Specs.
Okay, this is invalid and not something we intend to support. It will resolve as more packages switch to spec v3.
Maybe we need to reconsider this, to allow specifying subobjects: false, but what would be the use case for such mappings?
While troubleshooting an SDH, I tested a potential solution for a user trying to move the content of a field from a
flattened
to anobject
mapping.
Out of curiosity, why did they want to move from flattened to object?
On my test cluster, I was able to make it work by updating the
logs-azure.platformlogs@custom
component template with this request:PUT _component_template/logs-azure.platformlogs@custom { ... "dynamic_templates": [], "properties": { "azure.platformlogs.k8s": { "type": "object", "subobjects": false } } } ... }
With this mapping we are not defining the type for the properties of azure.platformlogs.k8s
, so there may be conflicts if they are mixed (an integer is sent first, but later the same field can also have floats, or keywords). We have been trying to remove this kind of ambiguous mappings.
For this case we could have a definition like the following one (would be Case D then):
- name: "azure.platformlogs.k8s"
type: object
object_type: keyword
subobjects: false
And ensure that we generate for it the following mappings:
{
...
"dynamic_templates": [
{
"azure.platformlogs.k8s": {
"path_match": "azure.platformlogs.k8s.*",
"mapping": { "type": "keyword" }
}
}
],
"properties": {
"azure.platformlogs.k8s": {
"type": "object",
"subobjects": false
}
}
}
...
}
Btw, cases B and D are in principle equivalent, Fleet adds the wildcard at the end for objects whose name doesn't have any wildcard. But would be good to keep them as testing cases. Another equivalent case would be the following one:
- name: prometheus.d.labels.*
type: keyword
subobjects: false
When there are wildcards in the name for non-objects, Fleet translates the definition to type: object
using the original type
as object_type
.
Out of curiosity, why did they want to move from flattened to object?
They mentioned (my apologies for the link to a private repo):
The reason for this change was to be able to select the different azure.platformlogs.properties using Kibana discover module, which is not possible when the field is "flattened".
It's probably worth mentioning how Azure logs work in general:
Here's an example of Azure logs:
{
"category": "kube-audit",
"operationName": "Microsoft.ContainerService/managedClusters/diagnosticLogs/Read",
"properties": {
...
},
"resourceId": "/SUBSCRIPTIONS/123RESOURCEGROUPS/abc/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/abc",
"time": "2023-11-29T09:41:00.720070863Z"
}
The log event contains a handful of standard fields:
category
the log categoryoperationName
resourceId
is the ID of the Azure resource emitting the eventtime
the timestamp of when the described event occurredIn addition to these well-defined and documented fields, there's also the properties
field that Azure uses for the service-specific part of the event.
The content of the properties
field varies A LOT. Depending on the originating service, properties
can contain strings, numbers objects, embedded JSON objects, or a combination. It's wild.
In this scenario, the original designer of the integration opted to map the properties
using the flattened
, which makes sense.
This user is ingesting kube-audit
events that are pretty large objects. You can see an example (I apologize again for the private repo) of this even type unpacked from the Azure log event.
With this mapping we are not defining the type for the properties of azure.platformlogs.k8s, so there may be conflicts if they are mixed (an integer is sent first, but later the same field can also have floats, or keywords). We have been trying to remove this kind of ambiguous mappings.
I see the risk here.
However, users want to tap into the treasure trove of significant and valuable events like kube-audit
to get the most out of it.
The reason for this change was to be able to select the different azure.platformlogs.properties using Kibana discover module, which is not possible when the field is "flattened".
Yep, this is the most frequent issue with flattened
:slightly_frowning_face: https://github.com/elastic/kibana/issues/25820
subobjects: false
will be useful in cases where there is no risk of fields explosion.
Support for per-field subobjects: false
merged in package-spec and kibana.
Support for per data stream subobjects: false
is blocked by https://github.com/elastic/elasticsearch/issues/99860
The subobjects option is now available at two levels:
In https://github.com/elastic/elasticsearch/pull/86166 Elasticsearch introduced the
subobjects: false
parameter. This allows ingest flat key/value pairs that had in the past conflicts likehost
andhost.name
.package-spec and Fleet should support this option in templates.
At first, the feature should be used with care in packages as the objects must be flattened before ingestion (see also https://github.com/elastic/apm/issues/347 for more details). At the moment ingestion tools like Beats often follow the opposite approach and expand all objects. There are potential future improvements that solve this problem: https://github.com/elastic/apm/issues/347#issuecomment-1140947927
From https://github.com/elastic/package-spec/issues/425, we need to ensure that at least one of the following mappings produce expected results, so this is supported.
Should produce:
prometheus.labels.*
, with type keyword.A static mapping for
prometheus.labels
, withtype: object
andsubobjects: false
.Should produce a mapping for
prometheus.labels
withtype: object
andsubobjects: false
, independently of the existence of other related dynamic mappings.Maybe this already works, but we would need to check it.