elastic / integrations

Elastic Integrations
https://www.elastic.co/integrations
Other
199 stars 429 forks source link

[Amazon Security Lake] - OCSF v1.1 and how to move foreword #11276

Open ShourieG opened 1 week ago

ShourieG commented 1 week ago

Current Scenario: For the last couple of months we've struggled to incorporate OCSF mappings into our traditional integration pipeline due to the nature of the OCSF schema and how deeply nested the OCSF mappings can get. We have limited fields on our end at the moment, the field limit per data stream being 2048. We have heavily refactored the whole integration, separated mappings into individual units for better maintainability, added dynamic mappings to help out with explicit field mappings in some scenarios, but this is not enough to make the integration sustainable going into the future. If we choose to stay upto date with newer OCSF versions, the direction would probably need to shift.

Main issues:

  1. OCSF is highly nested and if we even go to a depth of level 3 in v1.1 theres a mapping explosion that occurs as follows :-
        all attributes level 1: 345 fields
        all attributes level 1+2: 1563 fields
        all attributes level 1+2+3: 4266 fields

We can easily maintain till level 2, but beyond that it becomes increasingly difficult.

  1. The OCSF schema is getting larger and larger pretty fast as new fields, classes and entirely new categories are added. This kind of increase seems tricky to keep pace with or update for if we stick to the traditional integration philosophy and we are unable to remove some of the mapping limitations that we currently have.

  2. The initial build of the integration did not follow a specific flattening rule, i.e some objects have highly nested mappings, while others flatten out at levels 1-2. This makes it difficult for us to follow a uniform level based mapping, as we would require to break the integration to follow a more maintainable level 2 mapping. This also removes QoL from the user as flattening objects will reduce the desirability of the integration.

Steps we can take: We need to decide if we wan't to carry on supporting this integration as is, or if we wnat to take a different approach altogether.

Some things that come to mind:

  1. Flatten every object to level 2 and live with the breaking change concequences.
  2. Use a different approach to mapping, i.e somehow support references in yml files, to build mappings off a defined schema structure that can be updated as the OCSF schema changes.
  3. Create something at the input level to directly use the OCSF schema in the backend to produce mappings on the fly and output them to elasticsearch. (don't know if this is possible or not)

I'm open to suggestions and would like to hear your thoughts on this.

Related Issues:

  1. https://github.com/elastic/integrations/pull/10405
  2. https://github.com/elastic/elastic-package/issues/1917
  3. https://github.com/elastic/kibana/issues/187951
  4. https://github.com/elastic/package-spec/issues/746
  5. https://github.com/elastic/package-spec/pull/294
elasticmachine commented 1 week ago

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

elasticmachine commented 1 week ago

Pinging @elastic/fleet (Team:Fleet)

chrisberkhout commented 1 week ago

The OCSF schema allows an infinite number of fields if you include all optional fields and go to infinite depth (this chart combines fields from all categories and profiles):

Image

But when a specific source generates an OCSF document it will have a finite, much smaller number of fields.

It's the optional attributes that add depth. With required attributes only, I think you stop at 101 fields.

The good option: map all fields at levels 1 and 2, then map any more required fields, but optional fields get mapped as flattened. I think it's better to transition to this even if there's no smooth migration path from the current mappings. That's decent for a generic solution. If deeper fields are useful, then I think the documents should be re-indexed with mappings that are specific to the source.

A more ambitious option: try to use dynamic templates to match OCSF fields regardless of where they are nested. I'm not sure if it's possible to write match conditions to correctly match all the OCSF fields. If so, there's still the issue that indexing a wide range of document types may exceed maximum number of fields for an index. So ideally the dynamic templates would cover all fields as necessary, but documents from different sources would be directed to different indexes, so that the number of actual fields per index is limited.

ShourieG commented 1 week ago

The concern I have with the The good option is that users will keep on asking for more depth as a lot of critical info is buried within these deeper levels, also the schema keeps expanding with every update, in 1.3 new categories and a bunch of new fields have been added and eventually it will hit the field limit even with level 2 mappings sometime in the future.

andrewkroh commented 1 week ago

I think we need to begin with some schema analysis to better understand the problem before we try to solve it.

I strongly believe that tooling should be used to create the both the mappings and the pipelines. The tooling should codify any decisions we make. It's impractical to spend time building fields.yml files by hand, and then spend even more time reviewing it. Our time would be better spent investing in building and reviewing the tooling. Each successive update to OCSF gets much easier to apply if there is automation around it.

IIUC, the events we want to represent correspond to OCSF classes. OCSF conveniently has JSON Schema representations of these classes available which likely makes analysis easier since this is a a standardized format.

The schema allows for circular references (e.g. user has a manager, manager is a user; or process has a parent_process, parent_process is a process). When we translate this into concrete fields we get a tree of infinite depth. So we need to place limits on depth to control the number fields in our mappings. This may not have much impact depending on whether real events actually have a high level of nesting. When reaching these cycles we could choose to not index the data after certain level, then if a user wanted to override this decision then could apply an @custom mapping to map the specific fields they need.

Moving the data stream to use subobjects: false will reduce the overall Elasticsearch field count by a small amount. Intermediate object fields won't be a thing. See https://github.com/elastic/integrations/blob/main/docs/subobjects_adoption_guide.md. For example,

I'd like to see data on:

And I would like to see the code behind the analysis just so we can verify it.

With that data I think we would be in a better position to try to answer how we should structure data streams and the mappings to use OCSF in Elasticsearch.

chrisberkhout commented 6 days ago

I started looking at some similar data.

It was convenient to work with data from https://github.com/ocsf/ocsf-lib-py, since that has versions of the schema merged into a single file.

The script I wrote is here.

It would generate simplified representations of the schema to a certain depth. Example: checking field counts for the Authentication class at depths 1-10:

for depth in $(seq 1 10); do
  ./depth.py --by-class --all-fields-to-depth $depth \
  | jq -S .authentication \
  | jq 'paths|join(".")' 
  | wc -l
done
andrewkroh commented 2 days ago

Thanks for sharing, @chrisberkhout.

I also wanted to delve into the schema to gain a better understanding. I took a different approach to 'depth' and examined where circular references occurred, stopping my traversal at these cycles (like process.parent_process). My goal was to determine what the leaf field count would be if each OCSF class were separated into its own index. Additionally, I aimed to verify that there are no conflicting field naming conventions between classes. The analysis and code is at https://github.com/andrewkroh/go-examples/tree/main/ocsf.