[Cloud Posture] Deprecate Elasticsearch transform in Cloud security posture plugin

opauloh commented 1 year ago

Summary

This is a proposal to deprecate the use of Elasticsearch transform in the Cloud security posture plugin. Currently, the transform is used to generate the latest findings for each resource.id + rule id, which is then stored in the logs-cloud_security_posture.findings_latest-* index.

However, the use of transforms adds a layer of complexity to test and maintain it. In addition, we have been facing some issues that the transform doesn't recover itself when upgrading Elastic stack versions.

Our transform has a max_age of 26h, with resource.id and rule.id as unique keys. However, we can achieve the same results using Elasticsearch queries directly in the logs-cloud_security_posture.findings-* index by using an @timestamp filter with aggregation to group the findings by resource.id and rule.id and retrieve the latest finding for each group.

Benefits:

Simplify the Cloud security posture plugin codebase and reduce maintenance costs
Reduce the potential for issues when upgrading Elastic stack versions
Reduce the number of moving parts in the system, which could lead to increased reliability

Approach

There's one solution that wasn't explored yet: Using one hash field for rule.id + resource.id combined with the use of the collapse query option in the Elasticsearch queries. The benefit of using collapse is that it doesn't affect sorting or aggregations like aggregations do, so we don't have any regression on the experience we provide in the dashboard or the findings table.

Back in the AWP team, we have used collapse for the session viewer plugin, as it can be seen here to aggregate the Linux events by unique sessions, and it was a performant solution, working properly even with the index having millions of records.

The only requirement is that collapse works as desired with a single field only, so that's why we would need a new field for making the group (in this case a hash of rule.id + resource.id)

Suggestion: We can use event.code to the unique field

### Tasks
- [x] The misconfigurations table works well with all functionalities
- [x] The misconfigurations grouped by resource table works well with all functionalities
- [x] The misconfigurations resource internal table works well with all functionalities
- [ ] CSPM Dashboard works well with all functionalities
- [ ] KSPM Dashboard works well with all functionalities
- [ ] Performance test with big data (100k findings)
- [ ] Cloudbeat send event.code to the findings index, event.code = <resource.id>_<rule.id>
- [ ] Add new field event.code mappings
- [ ] Update FTR Tests
- [ ] Update Unit Tests
- [ ] Start a deprecating plan

elasticmachine commented 1 year ago

Pinging @elastic/kibana-cloud-security-posture (Team:Cloud Security)

CohenIdo commented 1 year ago

Hey @opauloh, we had similar discussion recently, please go over the following issues:

opauloh commented 1 year ago

Hey @opauloh, we had similar discussion recently, please go over the following issues:

[Proposal] Replace Transform by indexing to the latest findings index security-team#5301

[Deprecate Transfrom POC] index latest findings using ingest-pipelines security-team#5495

[Deprecate Transfrom POC] Querying findings index security-team#5496

Thanks @CohenIdo. After reading each issue carefully, I identified there's one solution that wasn't explored yet: Using one hash field for rule.id + resource.id combined with the use of the collapse query option in the Elasticsearch queries. The benefit of using collapse is that it doesn't affect sorting or aggregations like aggregations do, so we don't have any regression on the experience we provide in the dashboard or the findings table.

Back in the AWP team, we have used collapse for the session viewer plugin, as it can be seen here to aggregate the Linux events by unique sessions, and it was a performant solution, working properly even with the index having millions of records.

The only requirement is that collapse works as desired with a single field only, so that's why we would need a new field for making the group (in this case a hash of rule.id + resource.id)

kfirpeled commented 1 year ago

@opauloh when working on this I would say we have 3 big components to examine besides the happy flow:

Grouping by resource
The Dashboard
Score calculation

@CohenIdo , @JordanSh am I missing something in regard of this task?

eyalkraft commented 1 year ago

Very interesting Paulo! Deprecating the transforms will indeed have a great benefit in the terms of reduced complexity of our solution. It cloud help us with namespaces for example.

Can't wait to see the results!

eyalkraft commented 1 year ago

Depending on when we ship this, it could solve a problem we have with ILMs on serverless

https://github.com/elastic/security-team/issues/7441 @CohenIdo

kfirpeled commented 1 year ago

@opauloh can we also track backporting event.code creation to previous packages?

opauloh commented 1 year ago

I'm closing this ticket since we conducted a POC making use of collapse to query data directly from the data stream index and concluded that a few issues with this approach.

Summary of our learnings:

Collapse API works great for tables, as it can collapse data by an identifier key:

Before collapse:

After collapse:

  collapse: {
    field: 'event.code',
    inner_hits: {
      name: 'latest_result_evaluation',
      size: 1,
      sort: [{ '@timestamp': 'desc' }],
    },
  },

However, two problems was found:

Issue 1: Limit of aggregated data for dashboards and grouped table:

In order to have our Dashboard show the correct information we need to perform an aggregation on the identifier key, and then a sub aggregation on the top_hits of the latest event:

  unique_event_code: {
    terms: {
      field: 'event.code',
      size: 65000,
    },
    aggs: {
      latest_result_evaluation: {
        top_hits: {
          _source: ['result.evaluation'],
          size: 1,
          sort: [{ '@timestamp': 'desc' }],
        },
      },
    },
  },

That query with a time range filter of now - 26 hours, however here we hit the limit of 65k records for the first aggregation for event_code, and since we also need to look for the latest hits in order to the dashboard calculate the correct number of failed findings, we are limited by 65k findings in total (counting the duplicated records).

This means that when attempting to insert 70k findings records (with 51k unique findings), the ungrouped table worked as expected using collapse:

But the dashboard and grouped by resource table didn't work:

Throwing the following error on the logs:

too_many_buckets_exception: Trying to create too many buckets. Must be less than or equal to: [65536] but this number of buckets was exceeded. This limit can be set by changing the [search.max_buckets] cluster level setting.

Issue 2: filtering by result.evaluation would not guarantee showing the most up-to-date data:

If there are multiple findings that were remediated, or past from a passed state to failed state, adding a filter would potentially show deprecated data

Example: The most up-to-date finding for this unique key is failed:

But when filtering for result.evaluation: passed, since the query is now filtering out the failed findings, it would incorrectly show an old finding record with the passed finding:

Conclusion

These 2 issues bring up a big showstopper to move forward with this approach using collapse. Even if we can think of a solution for problem number 2, using telemetry data we already know in advance that the limit of 65k findings in a time range of 26 hours won't work for some users while preventing us from future enhancements as adding more Grouped by visualizations to the findings.

The final conclusion is that we don't currently have a way of querying directly from the data stream index with the current model without hitting memory limits for a large data set, this means that the use of transforms is currently the best approach.

The code used during the attempt is on this PR which is now closed.

kfirpeled commented 1 year ago

Thank you @opauloh for taking the time to summerize your conclusions!

elastic / kibana