cilium / design-cfps

Repo to store Cilium CFP design docs
Apache License 2.0
21 stars 19 forks source link

CFP: Cilium CLI connectivity tests speedup. #15

Open viktor-kurchenko opened 5 months ago

viktor-kurchenko commented 5 months ago

Proposal sounds good to me. As mentioned offline by Andre, it would be good to see a POC of how this would work. Namely, the aspect I have concerns about is that many of the connectivity tests configure the cluster in a specific way that may conflict with other test runs (such as policies, etc.). It would be good to understand how you propose to approach that problem.

Thank you, @christarazi !

Yeah, it should be challenging but I want to try it. Do you know what else except Cluster wide network policies can interfere with different namespaces?

christarazi commented 5 months ago

In general, any policy whether it's CNP or CCNP can interfere, especially if the workloads that they select are common amongst other policies. It seems like one approach could be to completely separate workloads via namespaces for each "group" of connectivity tests. This way the policies applied only have an effect within the namespaces that they are in, so therefore namespaces would be the separation barrier that allows parallelism.

viktor-kurchenko commented 5 months ago

It would be useful to people like me that the document includes a list, at a high level, of the strategies that have already been explored and why they are falling short.

Yeah, it would be useful for me as well.

viktor-kurchenko commented 5 months ago

I was thinking about PoC plan and realized that Cilium CLI can be used to run multiple tests in parallel (at least for testing). The --test-namespace and --test parameters were used to validate the idea.

I've selected 46 tests (from EKS CNI conformance test workflow) and used the attached bash script to run the tests in batches/parallel.

You can find results in the table: https://docs.google.com/spreadsheets/d/1csmszEtlohqPpgMV8N_aJUI4yCoW8mPke46k-rG-Uec/edit?usp=sharing

Conclusions:

Further steps (order might be different):

  1. Rename --test-namespace parameter to --test-namespace-prefix.
  2. Implement a new parameter: --test-parallel-runs with the default value: 1.
  3. Move test namespace/deployments creation and verification logic before the test run function.
  4. Implement tests grouping logic into batches with the --test-parallel-runs size.
  5. Think about how to collect and display output (considering GH runners that might have different behavior than a local terminal).

Also, I was thinking about implementing this as a new CLI command (e.g.: cilium tests ..., maybe even hidden). So, for some time we can have both old and new approaches with shared test sources and will be able to test and compare them without any impact.

CC: @aanm @christarazi @fgiloux @brlbil @michi-covalent.

christarazi commented 5 months ago

That sounds good to me.

Just one thing on

Cilium CLI has no tests that use CiliumClusterwideNetworkPolicy yet.

Soon I imagine https://github.com/cilium/design-cfps/pull/16 will get merged and we'll very likely have tests with CCNP, so it is something that we'll need to consider in this proposal.

brlbil commented 5 months ago

Sounds great,

Think about how to collect and display output (considering GH runners that might have different behavior than a local terminal).

One thing might be tricky, printing test logs correctly given the tests would be run concurrently. Also, the JUnit collection should be considered.

viktor-kurchenko commented 5 months ago

Sounds great,

Think about how to collect and display output (considering GH runners that might have different behavior than a local terminal).

One thing might be tricky, printing test logs correctly given the tests would be run concurrently. Also, the JUnit collection should be considered.

Thanks! I've already tested this: cilium/images/conn-tests-concurrent-output.gif