elastic / cloudbeat

Analyzing Cloud Security Posture
Other
40 stars 43 forks source link

[Discuss] Define full side-scanning flow #609

Closed oren-zohar closed 1 year ago

oren-zohar commented 1 year ago

Summary

First approach:

Deploying Agent EC2 is the naive solution to get started with in order to host our vuln mgmt on customers' cloud. Assuming we will leverage cloudbeat in that process:

Second approach:

Another approach is deploying on EKS:

Deploying an EKS cluster on the customer cloud instead of an EC2 machine will let us come up with an architecture that will work on every K8s cluster (including MKI and different providers GKE etc). It will allow us to create a microservice arch leveraging some of the k8s capabilities such as corn jobs, better scaling, and so on. This is a better solution when using an MKI since there's a need to deploy a cluster.

EC2 Pros/Cons:

EKS Pros/Cons:

amirbenun commented 1 year ago

A possible suggestion for the second approach:

Deploying on a k8s cluster (existing cluster / new cluster / MKI)

Pros:

Cons:

amirbenun commented 1 year ago

My recommendation is to go with the first, more naive option, of deploying on an EC2 instance. Deploying on a K8s cluster will only be possible if we will have one of:

  1. Full control of the customer K8s cluster (K8s operator).
  2. Ability to deploy customer resources in our cloud services (MKI).

A few open questions regarding the flow to be discussed with the product:

  1. Does an EC2 instance scans a single region or multi-region? Is it configurable?
  2. What do we do with the snapshot after the scan? Deleting a resource from a customer's cloud is a risky operation, we can archive it instead or leave it there
eyalkraft commented 1 year ago

My recommendation is to go with the first, more naive option, of deploying on an EC2 instance.

I agree this seems more plausible for 8.8.

Does an EC2 instance scans a single region or multi-region? Is it configurable?

I'd recommend

  1. Definitely not configurable
  2. single region because of data transfer costs. see my comment on the relevant issue
amitkanfer commented 1 year ago

Came across this issue... Just want to state that i find the Pros here more convincing vs. the Cons:

Pros:

The entire infrastructure is cloud agnostic and will be reused for vulnerability management on GCP and Azure Furthermore, the infrastructure can also be applied on MKI and operate agent-less in the near future In case the cluster already exists, we won't have idle resources on it, once we finished scanning all the jobs get cleaned up until the next iteration. Cons:

Managing elastic components on customers' clusters is much more difficult, in our case for example the right way to implement is by creating a K8s operator that installs these resources together (which is definitely out of scope for 8.8). Maintaining another elastic-agent Kubernetes deployment (can be avoided by forcing cloudbeat to create the relevant cron job and die)

I would go the extra mile and build something for k8s and forget about plain EC2. Think about cron services that you get for free, secrets store, a unified way to mount volumes and provide permissions, cloud agnostics (as you wrote), auto-scaling (!!), a single way to view and monitor the cluster of agents... You'll also contribute to the operator project (by using it). Happy to discuss further if you want

amirbenun commented 1 year ago

Thanks @amitkanfer your feedback is always welcome :) Taking the path of managing users' resources with an elastic operator is definitely something we will explore in the future. Looking at the 8.8 release, it seems like the infrastructure is not ready for deploying and managing K8s resources on the user's cluster and it will force us to develop most of the logic which will put the vulnerability management feature at risk. Having said that I think we should keep that idea and consider the option of developing the required features in the fleet operator for future releases as it is definitely a more complete solution. How can we track the progress of the fleet operator? Is there a github ticket or a mailing group that we can add ourselves?

amitkanfer commented 1 year ago

https://github.com/elastic/ingest-dev/issues/1496 (very early)