elastic / integrations

Elastic Integrations
https://www.elastic.co/integrations
Other
21 stars 435 forks source link

[Pause] Identify integration fields lists that rely on ordering and duplication #10900

Closed qcorporation closed 1 month ago

qcorporation commented 2 months ago

Description

With the upcoming LogsDB release, data is stored without the original _source, and this means that arrays can be reordered and de-duplicated. There are fields where order matters to end-users, such as process.args. Some fields do not function if de-duplication occurs within the array. A similar issue has been created for standard ECS fields to identify and flag fields that might be affected by relying on ordering or deduplication.

Our goal is to identify integration fields that rely on ordering or duplication. The integrations that could potentially be affected by this limitation within LogsDB are listed below. Code owners have been added as a checklist; if you are listed below, please:

1) Update the tracker for the appropriate teams and validate that you have reviewed the fields and have signed off the fields under the specific integration are not affected by LogsDB release.

2) If the list/arrays are affected by the LogsDB release, work with the integration teams to set array normalization for the affected fields.


@elastic/security-service-integrations

@elastic/sec-deployment-and-devices

@elastic/sec-linux-platform

@elastic/obs-infraobs-integrations

@elastic/obs-ds-hosted-services

@elastic/obs-cloudnative-monitoring

@elastic/stack-monitoring

@elastic/elastic-agent-data-plane

@elastic/ecosystem

mjwolf commented 2 months ago

In my work on this so far, I've seen that fields can be grouped into four categories regarding if order and duplication need to be maintained for lists.

I've updated the spreadsheet to have these categories for "Order/dup important" instead of just true/false.

consulthys commented 2 months ago

If this inventory effort is about LogsDB, does it make sense to investigate integrations that return metrics which would NOT be stored in LogsDB (but more likely in TSDB).

Looking at the tracker spreadsheet for @elastic/stack-monitoring, I'm wondering specifically about the following:

Thanks for your input

jvalente-salemstate commented 2 months ago

M365 Defender ( m365_defender.{alert,incident} ) sort of relies on the order, though there's already issues.

There's a list of json objects in the original event that get flattened by dot_expander into arrays of values under m365_defender.incident.alert.evidence.* The order being preserved is needed to determine which evidence item a value in the array it belongs to.

See Alert Evidence under #9050 for some examples. I think this one of the cases where changing the pipeline would be better. I planned to work on that myself this month but things did not work out.

qcorporation commented 1 month ago

@consulthys @jvalente-salemstate, thank you for the feedback and questions on your respective code bases. We've put this work on hold as the logsdb development has potentially changed their implementation from an opt-out to an opt-in, meaning that by default, it will focus on adoption and minimize breakages vs the reverse, which will require more effort from all integration teams to assess ordering and deduplication dependencies.

The tracker will eventually be updated to reflect the decision made, and potentially, this issue can be closed if deemed unnecessary cc.ing @andrewkroh

andrewkroh commented 1 month ago

The approach has changed, and now logsdb will store _source for arrays by default. To get the optimization for array fields that are treated as unordered sets, we can opt-in by setting synthetic_source_keep: "none" in the mapping. Adding this option prevents the array field from being stored in _source.

Support for this mapping parameter is in-progress in Elasticsearch and will be first available in 8.16. But before integrations can use it, package-spec and Fleet need to be updated to allow it in fields.yml. So until those things happen there won't be anything to do. So let's close this. We'll revisit making the optimizations once support is added in the stack.

Related