elastic / cloudbeat

Analyzing Cloud Security Posture
Other
15 stars 44 forks source link

[CNVM] Improve verification time and queries for EBS snapshots #735

Open jeniawhite opened 1 year ago

jeniawhite commented 1 year ago

Every EC2 instance might have multiple volumes and every volume will be converted into a single EBS snapshot. Currently, we are working on EC2 instances and creating the snapshots in a batch for every EC2 instance. After the creation of the snapshots, we are waiting for them to be ready and available. This waiting loop is quite naive as of right now and we are cycling up until all of the snapshots are available.

// TODO: Change this to variable
        case <-time.After(15 * time.Second):
            sp, err := f.provider.DescribeSnapshots(ctx, snap)
            if err != nil {
                f.log.Errorf("VulnerabilityVerifier.verify.DescribeSnapshots failed: %v", err)
                continue
            }
            // TOOD: Add a layer of "smart" cache to avoid checking and sending the same snapshot
            // and not to wait on all snapshots to be completed, sending them periodically
            allCompleted := true
            for _, i := range sp {
                if i.State != types.SnapshotStateCompleted {
                    f.log.Info("VulnerabilityVerifier.verify.VerifySnapshot snapshot not completed yet - ", *i.SnapshotId)
                    allCompleted = false
                    break
                }
            }
            if allCompleted {
                for _, i := range sp {
                    f.ch <- i
                }
                return
            }
        }

I would like us to improve this and send the completed snapshots via the channel as soon as they are ready and ignore them in the next query for availability. This would require having a state of already sent snapshots and managing that in addition to the batches. We need to make sure that we do not send the same snapshot twice to the channel.

jeniawhite commented 1 year ago

This issue is a bit skewed from the eventual code that we've committed. Currently, the code that we have creates snapshots for all of the certain EC2 resource volumes in a single AWS call here:

sp, err := f.provider.CreateSnapshots(ctx, data)
            if err != nil {
                f.log.Errorf("VulnerabilityReplicator.SnapshotInstance.CreateSnapshots failed: %v", err)
                continue
            }

            snapshots = append(snapshots, sp...)
            for _, s := range sp {
                f.ch <- s
            }

Notice that it sends them via the pipeline to the verifier one by one. This means that the verifier only works on a single snapshot at a time. The verifier then queries the status of the snapshot via a single API call that can query multiple snapshots:

sp, err := f.provider.DescribeSnapshots(ctx, snap)
            if err != nil {
                f.log.Errorf("VulnerabilityVerifier.verify.DescribeSnapshots failed: %v", err)
                continue
            }
            // TOOD: Add a layer of "smart" cache to avoid checking and sending the same snapshot
            // and not to wait on all snapshots to be completed, sending them periodically
            allCompleted := true
            for _, i := range sp {
                if i.State != types.SnapshotStateCompleted {
                    f.log.Infof("VulnerabilityVerifier.verify.VerifySnapshot snapshot not completed yet - %s", snap.SnapshotId)
                    allCompleted = false
                    break
                }
            }
            if allCompleted {
                for _, i := range sp {
                    f.ch <- i
                }
                f.log.Info("VulnerabilityVerifier.verify.VerifySnapshot snapshot completed - ", snap.SnapshotId)
                return
            }

But essentially the whole code was coded to support an optimization that will query all of certain EC2 resource snapshots in a single API call to reduce round-trip and calls. This is why I've coded the mechanism that will wait for all of them to be available before advancing. Today we only query one snapshot at a time so this mechanism is basically irrelevant (it always loops on a single snapshot).

This issue is relevant to a use case where we send all of certain EC2 resource snapshots at once to the verifier and then query them in a single API call. Then in that case we would like to have memorization of the snapshots that have already been verified and sent to the next step of the pipeline so we won't send them again.

There are two options that we can do:

  1. We can recode the solution to verify all snapshots of a certain EC2 resource at once and reduce the number of API calls (obviously implement the memorization and send snapshots as soon as ready to the next pipeline step). This will cause the verifier middleware to be a bit heavier because it will basically work on multiple snapshots at a time instead of being a stripped component that only works on a single snapshot as it is today. This means that if we'd have a problem with this step we would potentially skip all of the snapshots of a certain EC2 resource, which is different from what we have today when the middleware is only aware of a single snapshot at a time. Note that we have a timeout on the verifier middleware and adding multiple volumes that share the same timeout is problematic because even if one of them fails we will basically drop all others, which is different from the behavior that we have today that only drops a single snapshot.

  2. Leave the code as is without any changes.

  3. Have a hybrid of options 1 and 2, which means that we will implement 1 but send only a single snapshot at a time like today.

Due to the fact that Trivy only scans MBR and GPT partitions, I think that we can leave what we have today for now and revisit it later on.

Would be great to filter out non-MBR and non-GPT partitions from the snapshot creation logic so we won't do unnecessary work.