Closed qcorporation closed 1 month ago
In my work on this so far, I've seen that fields can be grouped into four categories regarding if order and duplication need to be maintained for lists.
cmdline
will be meaningless if the argument order is rearranged."Correlated events" -- Some integrations pack multiple events/records into separate lists. For example:
events:
id:
- 1
- 2
name:
- one
- two
desc:
- ABC
- def
instead of
events:
- event:
id: 1
name: one
desc: ABC
- event:
id: 2
name: two
desc: def
Order needs to be maintained so that event contents are not rearranged and meaning is lost. Integrations could be updated to avoid this and remove the requirement that order is maintained in these fields.
I've updated the spreadsheet to have these categories for "Order/dup important" instead of just true/false.
If this inventory effort is about LogsDB, does it make sense to investigate integrations that return metrics which would NOT be stored in LogsDB (but more likely in TSDB).
Looking at the tracker spreadsheet for @elastic/stack-monitoring, I'm wondering specifically about the following:
.elasticsearch.cluster.stats.nodes.versions[]
.elasticsearch.cluster.stats.state.nodes.Unf_fjGESWun1vbtRzJK9w.roles[]
.elasticsearch.node.roles[]
.kibana.task_manager_metrics.metrics.task_claim.value.duration.counts[]
.kibana.task_manager_metrics.metrics.task_claim.value.duration.values[]
.logstash.node.stats.logstash.pipelines[]
Thanks for your input
M365 Defender ( m365_defender.{alert,incident}
) sort of relies on the order, though there's already issues.
There's a list of json objects in the original event that get flattened by dot_expander
into arrays of values under m365_defender.incident.alert.evidence.*
The order being preserved is needed to determine which evidence item a value in the array it belongs to.
See Alert Evidence under #9050 for some examples. I think this one of the cases where changing the pipeline would be better. I planned to work on that myself this month but things did not work out.
@consulthys @jvalente-salemstate, thank you for the feedback and questions on your respective code bases. We've put this work on hold as the logsdb development has potentially changed their implementation from an opt-out to an opt-in, meaning that by default, it will focus on adoption and minimize breakages vs the reverse, which will require more effort from all integration teams to assess ordering and deduplication dependencies.
The tracker will eventually be updated to reflect the decision made, and potentially, this issue can be closed if deemed unnecessary cc.ing @andrewkroh
The approach has changed, and now logsdb will store _source
for arrays by default. To get the optimization for array fields that are treated as unordered sets, we can opt-in by setting synthetic_source_keep: "none"
in the mapping. Adding this option prevents the array field from being stored in _source
.
Support for this mapping parameter is in-progress in Elasticsearch and will be first available in 8.16. But before integrations can use it, package-spec and Fleet need to be updated to allow it in fields.yml. So until those things happen there won't be anything to do. So let's close this. We'll revisit making the optimizations once support is added in the stack.
Related
Description
With the upcoming LogsDB release, data is stored without the original _source, and this means that arrays can be reordered and de-duplicated. There are fields where order matters to end-users, such as
process.args
. Some fields do not function if de-duplication occurs within the array. A similar issue has been created for standard ECS fields to identify and flag fields that might be affected by relying on ordering or deduplication.Our goal is to identify integration fields that rely on ordering or duplication. The integrations that could potentially be affected by this limitation within LogsDB are listed below. Code owners have been added as a checklist; if you are listed below, please:
1) Update the tracker for the appropriate teams and validate that you have reviewed the fields and have signed off the fields under the specific integration are not affected by LogsDB release.
2) If the list/arrays are affected by the LogsDB release, work with the integration teams to set array normalization for the affected fields.
@elastic/security-service-integrations
@elastic/sec-deployment-and-devices
@elastic/sec-linux-platform
@elastic/obs-infraobs-integrations
@elastic/obs-ds-hosted-services
@elastic/obs-cloudnative-monitoring
@elastic/stack-monitoring
@elastic/elastic-agent-data-plane
@elastic/ecosystem