[Discuss] Cloud resources as Elasticsearch documents

ari-aviran commented 2 years ago

As we start implementing the CIS for AWS benchmark, we want to look into shipping cloud resources to Elasticsearch prior to evaluation. This will help future endeavors such as vulnerability management and an asset inventory. This task is for discussing on how best to do that, how the data will look like, how the indices will look like, etc.

eyalkraft commented 2 years ago

Some notes following a sync with @ruflin:

There are some efforts for defining a schema for a general entity model in Elasticsearch, to be used as the base for asset inventory. The proposed design is also supposed to take into consideration some challenges such as building the entities on top of time-series Elasticsearch capabilities, Supporting relations between entities and also allowing for graphing and visualizations. See #entity-model.

The approach we'll probably want to take as a first step is have cloudbeat ship the resources it collects in the newly defined schema to this new index.

Hopefully a first version of this schema will soon be introduced, for us to review and provide feedback.

Additional thoughts I had, regarding the collection side, talking about a broad Cloud Asset Inventory:

After an initial scan, Could we only include updates and not full blown scans? (Maybe using CloudTrail or similar services?) Is this a required optimization?
I'm not sure the straight forward way for collecting a complete cloud assets image would be to go and query many service-specific API's as we currently do for the benchmarks. We should understand what's the required data to collect about these assets, and consider using some more general and complete API's such as AWS Config and GCP Cloud Asset Inventory.
related product discussion https://github.com/elastic/security-team/issues/4359

tinnytintin10 commented 1 year ago

@eyalkraft @ari-aviran I had a great meeting with Jason Rhodes on the o11y team, and he has done a lot of work in this space (storing cloud assets as ES documents) worth chatting with him so we can share learnings. You can read up on his most recent updates on the asset inventory and topology work he is doing here: https://groups.google.com/a/elastic.co/g/entity-model/c/sPZy9Vyey5E

eyalkraft commented 1 year ago

Next steps for implementation (with some hints for external contributors):

Ship the collected assets (resources) to the `logs-assets--` index, (behind a configuration feature flag).

Index name reference
General asset schema definition reference
For specific schemas for some of the cloud resource dive deeper into the linked repo ^

Hints

Index definition (for shipping to logs-assets-*)
Config flags (for the “enable asset collection” flag)
Architecture - gives a good overview about the components in cloudbeat and how the pipeline works.
The data collection happens using the fetchers, I think what I’d do for the asset event-shipping is copy the Transformer/Evaluator as a baseline, and then based on the config flag add the new AssetShipper to our pipeline) before the evaluator (because afterwards every resource might create multiple events).
How to develop cloudbeat - README isn’t enough currently, but it mentions Hermit which we use for installing dependencies (hermit install), and then go build to build the binary, make PackageAgent to build the agent with cloudbeat in it, and the Justfile is your friend for more complex tasks (like deploying the agent you built).
Fetching asset data and applying the data structure is done on a per resource basis in each fetcher. for example file resource. and the functions responsible are GetData, GetElasticCommonData, GetMetadata which are then used in a few different places: Raw data for the policy evaluation, ECS data when building the event, and metadata in event building, policy evaluation, and more [1] [2]

ruflin commented 1 year ago

This means with the above, we would end up with something similar to (not valid code):

    // Creating the data pipeline
    if (config.Asset.Enabled) {
        assetsCh := pipeline.Step(bt.log, bt.resourceCh, bt.assetsEvaluator.Eval)
        findingsCh := pipeline.Step(bt.log, assetsCh, bt.evaluator.Eval)
    } else {
        findingsCh := pipeline.Step(bt.log, bt.resourceCh, bt.evaluator.Eval)
    }
    eventsCh := pipeline.Step(bt.log, findingsCh, bt.transformer.CreateBeatEvents)

And on the CreateBeatEvents we could could check for the asset data or even have a separate channel to publish the data.

A nicer way of doing it would be to duplicate all the events in bt.resourceCh and then let it go through both pipeline steps separately.

eyalkraft commented 1 year ago

@ruflin

A nicer way of doing it would be to duplicate all the events in bt.resourceCh and then let it go through both pipeline steps separately.

True. Actually now looking deeper I think it might even be a must since otherwise we'll have to ignore/handle assets in the evaluator which couples unrelated responsibilities.

ruflin commented 1 year ago

Based on the above, I did play around with the code a bit. Here is the commit: https://github.com/elastic/cloudbeat/commit/f037dc17ae2a2e9cf483aeb84678f105f8d12554 There are many things I don't like about it.

Currently it demos how we could get the Node asset out of the channel. But as we hook into a very generic stream of data, we have to cast each entry to figure out what type it is and if we are interested. As for the assets, it does not have to go through all the pipelines, I wonder if we should have an additional flow that pushes it directly to the publisher in the fetcher. There we already know all the types and we can also add the conversion logic to the assets model.

elastic / cloudbeat