[Pause] Identify integration fields lists that rely on ordering and duplication

qcorporation commented 2 months ago

Description

With the upcoming LogsDB release, data is stored without the original _source, and this means that arrays can be reordered and de-duplicated. There are fields where order matters to end-users, such as process.args. Some fields do not function if de-duplication occurs within the array. A similar issue has been created for standard ECS fields to identify and flag fields that might be affected by relying on ordering or deduplication.

Our goal is to identify integration fields that rely on ordering or duplication. The integrations that could potentially be affected by this limitation within LogsDB are listed below. Code owners have been added as a checklist; if you are listed below, please:

1) Update the tracker for the appropriate teams and validate that you have reviewed the fields and have signed off the fields under the specific integration are not affected by LogsDB release.

Note: Column A within the tracker has a drop down for CODEOWNERS to mark as complete

2) If the list/arrays are affected by the LogsDB release, work with the integration teams to set array normalization for the affected fields.

@elastic/security-service-integrations

[ ] abnormal_security
[ ] akamai
[ ] anomaly
[ ] auth0
[ ] aws
[ ] azure
[ ] bbot
[ ] bitdefender
[ ] box
[ ] canva
[ ] carbon_black_cloud
[ ] cisco_meraki
[ ] cisco.secure_endpoint
[ ] cisco.umbrella
[ ] cloudflare
[ ] crowdstrike
[ ] cybereason
[ ] darktrace
[ ] entityanalytics_ad
[ ] eset
[ ] f5_big
[ ] falco
[ ] forgerock
[ ] gcp
[ ] github
[ ] google_scc
[ ] google_workspace
[ ] infoblox
[ ] jamf_compliance_reporter
[ ] jumpcloud
[ ] lastpass
[ ] m365_defender
[ ] microsoft_defender_cloud
[ ] mimecast
[ ] netskope
[ ] o365
[ ] ocsf
[ ] okta
[ ] opencti
[ ] otx
[ ] panw_cortex
[ ] ping_one
[ ] prisma_cloud
[ ] proofpoint_on_demand
[ ] proofpoint_tap
[ ] qualys_vmdr
[ ] rapid7
[ ] recordedfuture
[ ] sentinel_one
[ ] ses
[ ] slack
[ ] snyk
[ ] spycloud
[ ] tanium
[ ] teleport
[ ] tenable_io
[ ] tenable_sc
[ ] threatq
[ ] ti_crowdstrike
[ ] trellix_edr
[ ] trellix_epo
[ ] trend_micro_vision
[ ] vectra_detect
[ ] wiz
[ ] zeronetworks
[ ] zscaler_zia
[ ] zscaler_zpa

@elastic/sec-deployment-and-devices

[ ] checkpoint
[ ] cisco.ftd
[ ] cisco_ise
[ ] fortinet_fortimail
[ ] hashicorp_vault
[ ] iptables
[ ] modsec
[ ] netflow
[ ] panw
[ ] pfsense
[ ] rsa
[ ] sophos
[ ] stormshield
[ ] suricata
[ ] watchguard
[ ] zeek

@elastic/sec-linux-platform

[ ] auditd
[ ] dhcpv4
[ ] dns
[ ] memcache
[ ] rpc
[ ] sip

@elastic/obs-infraobs-integrations

[ ] apache.access
[ ] aws
[ ] ceph
[ ] cockroachdb
[ ] golang
[ ] haproxy
[ ] ibmmq
[ ] influxdb
[ ] mongodb_atlas
[ ] nginx.access
[ ] salesforce

@elastic/obs-ds-hosted-services

[ ] aws

@elastic/obs-cloudnative-monitoring

[ ] docker
[ ] istio
[ ] kubernetes
[ ] nginx_ingress_controller

@elastic/stack-monitoring

[ ] elasticsearch
[ ] kibana
[ ] logstash

@elastic/elastic-agent-data-plane

[ ] log

@elastic/ecosystem

[ ] package_registry

mjwolf commented 2 months ago

In my work on this so far, I've seen that fields can be grouped into four categories regarding if order and duplication need to be maintained for lists.

"None" -- No dependence on order or duplication in lists, traditional "sets".
"Strong" dependence -- For fields where the meaning will be lost if order/duplication is changed, for example, process cmdline will be meaningless if the argument order is rearranged.
"Weak" dependence -- The order/duplication is important to the actual implementation related to these fields, but is less important to logging. For example, DNS answer records. The order of records can impact of how the actual implementation uses the records, but it's less likely to be impact logging. Still log users might implicitly expect order will match the implementation order. This could probably be addressed with documentation stating not to expect order to be maintained in the field.
"Correlated events" -- Some integrations pack multiple events/records into separate lists. For example:
```
events:
  id:
    - 1
    - 2
  name:
    - one
    - two
  desc:
    - ABC
    - def
```
instead of
```
events:
  - event:
      id: 1
      name: one
      desc: ABC

  - event:
      id: 2
      name: two
      desc: def
```
Order needs to be maintained so that event contents are not rearranged and meaning is lost. Integrations could be updated to avoid this and remove the requirement that order is maintained in these fields.

I've updated the spreadsheet to have these categories for "Order/dup important" instead of just true/false.

consulthys commented 2 months ago

If this inventory effort is about LogsDB, does it make sense to investigate integrations that return metrics which would NOT be stored in LogsDB (but more likely in TSDB).

Looking at the tracker spreadsheet for @elastic/stack-monitoring, I'm wondering specifically about the following:

.elasticsearch.cluster.stats.nodes.versions[]
.elasticsearch.cluster.stats.state.nodes.Unf_fjGESWun1vbtRzJK9w.roles[]
.elasticsearch.node.roles[]
.kibana.task_manager_metrics.metrics.task_claim.value.duration.counts[]
.kibana.task_manager_metrics.metrics.task_claim.value.duration.values[]
.logstash.node.stats.logstash.pipelines[]

Thanks for your input

jvalente-salemstate commented 2 months ago

M365 Defender ( m365_defender.{alert,incident} ) sort of relies on the order, though there's already issues.

There's a list of json objects in the original event that get flattened by dot_expander into arrays of values under m365_defender.incident.alert.evidence.* The order being preserved is needed to determine which evidence item a value in the array it belongs to.

See Alert Evidence under #9050 for some examples. I think this one of the cases where changing the pipeline would be better. I planned to work on that myself this month but things did not work out.

qcorporation commented 1 month ago

@consulthys @jvalente-salemstate, thank you for the feedback and questions on your respective code bases. We've put this work on hold as the logsdb development has potentially changed their implementation from an opt-out to an opt-in, meaning that by default, it will focus on adoption and minimize breakages vs the reverse, which will require more effort from all integration teams to assess ordering and deduplication dependencies.

The tracker will eventually be updated to reflect the decision made, and potentially, this issue can be closed if deemed unnecessary cc.ing @andrewkroh

andrewkroh commented 1 month ago

The approach has changed, and now logsdb will store _source for arrays by default. To get the optimization for array fields that are treated as unordered sets, we can opt-in by setting synthetic_source_keep: "none" in the mapping. Adding this option prevents the array field from being stored in _source.

Support for this mapping parameter is in-progress in Elasticsearch and will be first available in 8.16. But before integrations can use it, package-spec and Fleet need to be updated to allow it in fields.yml. So until those things happen there won't be anything to do. So let's close this. We'll revisit making the optimizations once support is added in the stack.

https://github.com/elastic/ecs/issues/2376

elastic / integrations

[Pause] Identify integration fields lists that rely on ordering and duplication #10900

Description