[PROPOSAL] Formalize container engine testing framework similar to the new kernel version testing framework

incertum commented 1 year ago

Motivation

The implementation of the formal kernel version testing framework (https://github.com/falcosecurity/libs/blob/master/proposals/20230530-driver-kernel-testing-framework.md) has had a highly positive impact on the overall progress and stability of The Falco Project.

I am proposing a similar testing framework for container engines, with a specific focus on maintaining the expected functionality of each engine for a particular container runtime.

This testing would be crucial not only to identify regressions but also to demonstrate the reliability of the container engine. This is because there is no expectation for it to be flawless. Simultaneously, we must comprehend the scenarios and conditions in which we might fail to retrieve container information. This understanding will help establish a form of Service Level Objective (SLO) for adopters. For instance, in edge case race conditions, we might provide less stringent guarantees compared to a situation where a container runs for 30 days without ever having its information available. The latter case serves as an example of an opportunity to enhance the engine's robustness. Returning to the notion that perfection is unattainable, embracing a data-driven approach will assist in setting escalation thresholds for reported container engine issues.

Feature

Set up a testbed to evaluate the following:

Test accurate and reliable container information enrichment for two scenarios: (1) container was active before agent launch, and (2) container launches after agent start.
Above shall include verifying each supported field's accuracy, similar to existing test/drivers unit tests.
Assess each officially supported container engine, prioritizing certain container runtimes as P1 (e.g., containerd, cri-o, docker), while others are labeled "best effort".
Perform semi-realistic tests on a Kubernetes server featuring multiple pods. These tests aim to observe continuous enrichment of container information over an extended period (e.g., several hours), encompassing stable pods as well as pods coming up and down. Apply upper limits as per https://kubernetes.io/docs/setup/best-practices/cluster-large/. However, reaching 110 pods per node with multiple containers within a pod is unlikely. A more realistic expectation would be a maximum of around 100-150 containers per node.

Note: Parallel testing may be applicable to certain runtimes, while for others, individual assessments are required.

CC @falcosecurity/core-maintainers

incertum commented 11 months ago

@jasondellaluce and @Andreagit97 and others it may be time for better container engine testing, we keep breaking it see latest oversight (that was on me) https://github.com/falcosecurity/libs/pull/1535

leogr commented 11 months ago

Hey @incertum

This is really interesting. Have we already collected a list of regressions we have encountered? :thinking: It would be useful to understand which aspects to focus on more.

poiana commented 8 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

incertum commented 8 months ago

/remove-lifecycle stale

Some new e2e test efforts are a WIP @therealbobo

poiana commented 5 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana commented 4 months ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

leogr commented 4 months ago

/remove-lifecycle stale /remove-lifecycle rotten

poiana commented 1 month ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

leogr commented 1 month ago

/remove-lifecycle stale

falcosecurity / libs

[PROPOSAL] Formalize container engine testing framework similar to the new kernel version testing framework #1298