[Discuss] Define full side-scanning flow

oren-zohar commented 1 year ago

Summary

First approach:

Deploying Agent EC2 is the naive solution to get started with in order to host our vuln mgmt on customers' cloud. Assuming we will leverage cloudbeat in that process:

Fetch the current region (can be the default in the config)
Collect (running) EC2 instances in the current region - Should be discussed with product
For each EC2 instance - snapshot its volume
Scan snapshot with Trivy
Archive/Delete the snapshot
Sending findings to ES
Sleep until the next iteration

Second approach:

Another approach is deploying on EKS:

Deploying an EKS cluster on the customer cloud instead of an EC2 machine will let us come up with an architecture that will work on every K8s cluster (including MKI and different providers GKE etc). It will allow us to create a microservice arch leveraging some of the k8s capabilities such as corn jobs, better scaling, and so on. This is a better solution when using an MKI since there's a need to deploy a cluster.

EC2 Pros/Cons:

TBD

EKS Pros/Cons:

TBD

amirbenun commented 1 year ago

A possible suggestion for the second approach:

Deploying on a k8s cluster (existing cluster / new cluster / MKI)

Scanning is triggered by a cron job that runs every 24 hours
Fetch the current region
Create a deployment for having a vulnerabilities DB server
Collect (running) EC2 instances in the current region
For each EC2 instance initiate jobs in the cluster
- A job to create a snapshot of that EC2 instance
- A job will run an elastic-agent container, then leverage cloudbeat for getting vulnerability analysis on a specific EC2 instance using the vulnerabilities DB server and send it to ES
- A job to archive/delete the snapshot

Pros:

The entire infrastructure is cloud agnostic and will be reused for vulnerability management on GCP and Azure
Furthermore, the infrastructure can also be applied on MKI and operate agent-less in the near future
In case the cluster already exists, we won't have idle resources on it, once we finished scanning all the jobs get cleaned up until the next iteration.

Cons:

Managing elastic components on customers' clusters is much more difficult, in our case for example the right way to implement is by creating a K8s operator that installs these resources together (which is definitely out of scope for 8.8).
Maintaining another elastic-agent Kubernetes deployment (can be avoided by forcing cloudbeat to create the relevant cron job and die)

amirbenun commented 1 year ago

My recommendation is to go with the first, more naive option, of deploying on an EC2 instance. Deploying on a K8s cluster will only be possible if we will have one of:

Full control of the customer K8s cluster (K8s operator).
Ability to deploy customer resources in our cloud services (MKI).

A few open questions regarding the flow to be discussed with the product:

Does an EC2 instance scans a single region or multi-region? Is it configurable?
What do we do with the snapshot after the scan? Deleting a resource from a customer's cloud is a risky operation, we can archive it instead or leave it there

eyalkraft commented 1 year ago

My recommendation is to go with the first, more naive option, of deploying on an EC2 instance.

I agree this seems more plausible for 8.8.

Does an EC2 instance scans a single region or multi-region? Is it configurable?

I'd recommend

Definitely not configurable
single region because of data transfer costs. see my comment on the relevant issue

amitkanfer commented 1 year ago

Came across this issue... Just want to state that i find the Pros here more convincing vs. the Cons:

Pros:

The entire infrastructure is cloud agnostic and will be reused for vulnerability management on GCP and Azure Furthermore, the infrastructure can also be applied on MKI and operate agent-less in the near future In case the cluster already exists, we won't have idle resources on it, once we finished scanning all the jobs get cleaned up until the next iteration. Cons:

Managing elastic components on customers' clusters is much more difficult, in our case for example the right way to implement is by creating a K8s operator that installs these resources together (which is definitely out of scope for 8.8). Maintaining another elastic-agent Kubernetes deployment (can be avoided by forcing cloudbeat to create the relevant cron job and die)

I would go the extra mile and build something for k8s and forget about plain EC2. Think about cron services that you get for free, secrets store, a unified way to mount volumes and provide permissions, cloud agnostics (as you wrote), auto-scaling (!!), a single way to view and monitor the cluster of agents... You'll also contribute to the operator project (by using it). Happy to discuss further if you want

amirbenun commented 1 year ago

Thanks @amitkanfer your feedback is always welcome :) Taking the path of managing users' resources with an elastic operator is definitely something we will explore in the future. Looking at the 8.8 release, it seems like the infrastructure is not ready for deploying and managing K8s resources on the user's cluster and it will force us to develop most of the logic which will put the vulnerability management feature at risk. Having said that I think we should keep that idea and consider the option of developing the required features in the fleet operator for future releases as it is definitely a more complete solution. How can we track the progress of the fleet operator? Is there a github ticket or a mailing group that we can add ourselves?

amitkanfer commented 1 year ago

https://github.com/elastic/ingest-dev/issues/1496 (very early)

elastic / cloudbeat