armadaproject / armada

A multi-cluster batch queuing system for high-throughput workloads on Kubernetes.
https://armadaproject.io
Apache License 2.0
460 stars 132 forks source link

Parallelize Armada integration tests by package, run on multiple machines #3100

Open richscott opened 9 months ago

richscott commented 9 months ago

Currently, the Golang integration tests job is the longest-running job in the standard Armada CI test run, taking approximately 12-15 minutes. Essentially, it invokes:

go run cmd/testsuite/main.go test --tests testsuite/testcases/basic/*

It may be possible to split this up to run on multiple machines simultaneously, but that may require specifying partitioned subsets of the testsuite/testcases/basic/* files amongst several jobs in the same workflow file (jobs by default run concurrently). Reconciling the results into a single test report might have to be done in a final job of the workflow.

shashank-iitbhu commented 8 months ago

Hey @richscott , I would like to give this a try. One way is to use GitHub actions matrix build for parallel execution of the jobs. Can you provide your inputs on how to specify partitions? Also, if all the tests are not required to be run all the time then can we do selective testing by running only the required tests?

richscott commented 8 months ago

@shashank-iitbhu The current integration tests may be partition-able. Essentially, invoking mage testsuite causes it to run

go run cmd/testsuite/main.go test --tests testsuite/testcases/basic/* --junit junit.xml

(see magefiles/ci.go, lines 27-50). If you look at the YAML files in testsuite/testcases/basic/*.yaml, you'll see that each is an Armada batch submission job plus a clause that specifies expected Armada events to be generated, etc. I think the YAML test cases are separate/idempotent, and so it may be possible to split up the testcases into a few subsets, having a separate Armada instance run each subset. However, of course this will require that the integration test CI (Github Action) workflow will have to start up a separate Armada cluster for each parallelized subset (e.g. mage localdev full) before running its subset of the integration tests.

Then, some logic will be needed to coalesce the separate integration test subset results into a single unified result. See cmd/testsuite/{main,root,test}.go for the code that currently runs the tests (it uses the github.com/jstemmer/go-junit-report/v2/junit module for individual tests).