[Amazon Security Lake] - OCSF v1.1 and how to move foreword

ShourieG commented 1 month ago

Current Scenario: For the last couple of months we've struggled to incorporate OCSF mappings into our traditional integration pipeline due to the nature of the OCSF schema and how deeply nested the OCSF mappings can get. We have limited fields on our end at the moment, the field limit per data stream being 2048. We have heavily refactored the whole integration, separated mappings into individual units for better maintainability, added dynamic mappings to help out with explicit field mappings in some scenarios, but this is not enough to make the integration sustainable going into the future. If we choose to stay upto date with newer OCSF versions, the direction would probably need to shift.

Main issues:

OCSF is highly nested and if we even go to a depth of level 3 in v1.1 theres a mapping explosion that occurs as follows :-

    all attributes level 1: 345 fields
    all attributes level 1+2: 1563 fields
    all attributes level 1+2+3: 4266 fields

We can easily maintain till level 2, but beyond that it becomes increasingly difficult.

The OCSF schema is getting larger and larger pretty fast as new fields, classes and entirely new categories are added. This kind of increase seems tricky to keep pace with or update for if we stick to the traditional integration philosophy and we are unable to remove some of the mapping limitations that we currently have.
The initial build of the integration did not follow a specific flattening rule, i.e some objects have highly nested mappings, while others flatten out at levels 1-2. This makes it difficult for us to follow a uniform level based mapping, as we would require to break the integration to follow a more maintainable level 2 mapping. This also removes QoL from the user as flattening objects will reduce the desirability of the integration.

Steps we can take: We need to decide if we wan't to carry on supporting this integration as is, or if we wnat to take a different approach altogether.

Some things that come to mind:

Flatten every object to level 2 and live with the breaking change concequences.
Use a different approach to mapping, i.e somehow support references in yml files, to build mappings off a defined schema structure that can be updated as the OCSF schema changes.
Create something at the input level to directly use the OCSF schema in the backend to produce mappings on the fly and output them to elasticsearch. (don't know if this is possible or not)

I'm open to suggestions and would like to hear your thoughts on this.

Related Issues:

elasticmachine commented 1 month ago

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

elasticmachine commented 1 month ago

Pinging @elastic/fleet (Team:Fleet)

chrisberkhout commented 1 month ago

The OCSF schema allows an infinite number of fields if you include all optional fields and go to infinite depth (this chart combines fields from all categories and profiles):

But when a specific source generates an OCSF document it will have a finite, much smaller number of fields.

It's the optional attributes that add depth. With required attributes only, I think you stop at 101 fields.

The good option: map all fields at levels 1 and 2, then map any more required fields, but optional fields get mapped as flattened. I think it's better to transition to this even if there's no smooth migration path from the current mappings. That's decent for a generic solution. If deeper fields are useful, then I think the documents should be re-indexed with mappings that are specific to the source.

A more ambitious option: try to use dynamic templates to match OCSF fields regardless of where they are nested. I'm not sure if it's possible to write match conditions to correctly match all the OCSF fields. If so, there's still the issue that indexing a wide range of document types may exceed maximum number of fields for an index. So ideally the dynamic templates would cover all fields as necessary, but documents from different sources would be directed to different indexes, so that the number of actual fields per index is limited.

ShourieG commented 1 month ago

The concern I have with the The good option is that users will keep on asking for more depth as a lot of critical info is buried within these deeper levels, also the schema keeps expanding with every update, in 1.3 new categories and a bunch of new fields have been added and eventually it will hit the field limit even with level 2 mappings sometime in the future.

andrewkroh commented 1 month ago

I think we need to begin with some schema analysis to better understand the problem before we try to solve it.

I strongly believe that tooling should be used to create the both the mappings and the pipelines. The tooling should codify any decisions we make. It's impractical to spend time building fields.yml files by hand, and then spend even more time reviewing it. Our time would be better spent investing in building and reviewing the tooling. Each successive update to OCSF gets much easier to apply if there is automation around it.

IIUC, the events we want to represent correspond to OCSF classes. OCSF conveniently has JSON Schema representations of these classes available which likely makes analysis easier since this is a a standardized format.

The schema allows for circular references (e.g. user has a manager, manager is a user; or process has a parent_process, parent_process is a process). When we translate this into concrete fields we get a tree of infinite depth. So we need to place limits on depth to control the number fields in our mappings. This may not have much impact depending on whether real events actually have a high level of nesting. When reaching these cycles we could choose to not index the data after certain level, then if a user wanted to override this decision then could apply an @custom mapping to map the specific fields they need.

Moving the data stream to use subobjects: false will reduce the overall Elasticsearch field count by a small amount. Intermediate object fields won't be a thing. See https://github.com/elastic/integrations/blob/main/docs/subobjects_adoption_guide.md. For example,

ocsf (object) ❌ [eliminated]
ocsv.tls (object) ❌ [eliminated]
ocsf.tls.ja3_hash (object) ❌ [eliminated]
ocsf.tls.ja3_hash.algorithm (keyword) ✅

I'd like to see data on:

Field count per event class
Total unique field count
Are there any conflicting field definitions between event classes?
If we traverse the circular references to go one level deeper (i.e. instead of stopping at process.parent=[index: false], we go one level deeper to process.parent.parent=[index: false]), then what are the new field counts.

And I would like to see the code behind the analysis just so we can verify it.

With that data I think we would be in a better position to try to answer how we should structure data streams and the mappings to use OCSF in Elasticsearch.

chrisberkhout commented 1 month ago

I started looking at some similar data.

It was convenient to work with data from https://github.com/ocsf/ocsf-lib-py, since that has versions of the schema merged into a single file.

The script I wrote is here.

It would generate simplified representations of the schema to a certain depth. Example: checking field counts for the Authentication class at depths 1-10:

for depth in $(seq 1 10); do
  ./depth.py --by-class --all-fields-to-depth $depth \
  | jq -S .authentication \
  | jq 'paths|join(".")' \
  | wc -l
done

andrewkroh commented 1 month ago

Thanks for sharing, @chrisberkhout.

I also wanted to delve into the schema to gain a better understanding. I took a different approach to 'depth' and examined where circular references occurred, stopping my traversal at these cycles (like process.parent_process). My goal was to determine what the leaf field count would be if each OCSF class were separated into its own index. Additionally, I aimed to verify that there are no conflicting field naming conventions between classes. The analysis and code is at https://github.com/andrewkroh/go-examples/tree/main/ocsf.

chrisberkhout commented 2 weeks ago

As discussed in the meeting...

This is based on OCSF schema 1.3.0, but I'd expect it to see similar results for later versions.

We can get each leaf key name and its value, looking down to depth 10:

./depth.py --by-class --all-fields-to-depth 10 \
  | jq -c 'paths(scalars) as $p | {key: ($p | last), value: getpath($p)}' \
  | sort | uniq

{"key":"accessed_time_dt","value":"Datetime"}
{"key":"accessed_time","value":"Timestamp"}
{"key":"access_list","value":"String"}
{"key":"access_mask","value":"Integer"}
{"key":"ack_reason","value":"Integer"}
{"key":"ack_result","value":"Integer"}
{"key":"action_id","value":"Integer"}
{"key":"action","value":"String"}
{"key":"activity_id","value":"Integer"}
{"key":"activity_name","value":"String"}
...

Total count of leaf field names:

./depth.py --by-class --all-fields-to-depth 10 \
  | jq -c 'paths(scalars) as $p | {key: ($p | last), value: getpath($p)}' \
  | sort | uniq | wc -l

Going deeper doesn't surface new leaf field names, we already covered them all:

./depth.py --by-class --all-fields-to-depth 20 \
  | jq -c 'paths(scalars) as $p | {key: ($p | last), value: getpath($p)}' \
  | sort | uniq | wc -l

There aren't duplicates. Each leaf field name only has one value type:

./depth.py --by-class --all-fields-to-depth 10 \
  | jq -c 'paths(scalars) as $p | {key: ($p | last), value: getpath($p)}' \
  | sort | uniq | jq .key | uniq -c | sort -hr | head -3

      1 "zone"
      1 "x_originating_ip"
      1 "x_forwarded_for"

ShourieG commented 2 weeks ago

Updates after recent discussions:

Chalk up a sample Index template that could be generated by dynamic template rules which satisfies OCSF mapping rulesets, and connect with the fleet team to discuss feasibility of generating said rulesets by using dynamic templates.
Experiment with building an auto mapper tool that consumes an OCSF schema for a specific Category Class Event, and by utilizing the full schema definitions that can be fetched using -curl -X GET "https://schema.ocsf.io/export/schema" -H "accept: application/json" >> schema.json These schema definitions have the necessary metadata to help us build rule sets for generating the index mappings automatically.We should be able to specify the depth to which we want to map in the auto mapper tool and if we want to flatten objects after a certain level.
Have discussions with the fleet team on removing limitation on field limits for the OCSF integration.
Create a Meta Issue to track the progress of each task.

With these necessary steps in place we should have an auto mapper tool or dynamic templates ready that would generate most of the mappings for us.

Refer to the OCSF swagger for more functionalities: https://schema.ocsf.io/doc/index.html#/

elastic / integrations

[Amazon Security Lake] - OCSF v1.1 and how to move foreword #11276