Better orchestration of integration test runners

rdner commented 5 months ago

Describe the enhancement:

Based on my observations over the last 2 months, the root cause of most of the issues we're experiencing with OGC while running our integration tests is the communication via SSH.

Some of the reports can be found here https://github.com/elastic/elastic-agent/issues/4356

SSH is not very resilient when it comes to connection issues/interruptions and we cannot simply add a retry because we run commands via SSH which, in most cases, are meant to be executed only once.

The goal of this enhancement should be minimizing interactions via SSH.

We could implement the following improvements to make our integration tests more stable:

It should be possible to prepare a single script (per OS) with all commands needed for initialization and execution of integration tests on a remote VM. We should run this script only once via SSH instead of sending separate commands. Buildkite does something similar – sends a complex manifest to a remote machine and then runs it there.
VMs that run our integration tests should have access to an S3 bucket or any other artifact storage where they would upload their results/test logs (should be a part of the script mentioned above).
The main script/orchestrator should: 3.1 watch the artifact storage for the results to appear without any communication with the VMs. 3.2 watch states of the VMs via OGC and fail the tests if one of the machines has a wrong state.
All the read-only communication via OGC should have retries.

Describe a specific use case for the enhancement or feature:

Our integration tests are occasionally failing because of orchestration of the VMs or while communicating with the VMs:

What is the definition of done?

While running integration tests we use SSH only once for delivering and running a script
All read-only orchestration operations should have a retry
Runners should deliver their artifacts directly to a storage.

elasticmachine commented 5 months ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

blakerouse commented 5 months ago

It should be possible to prepare a single script (per OS) with all commands needed for initialization and execution of integration tests on a remote VM. We should run this script only once via SSH instead of sending separate commands. Buildkite does something similar – sends a complex manifest to a remote machine and then runs it there.

Not a fan of using a script. Script programming sucks, we should never change to using scripts. I would be fine with changing to a golang code that is ran.

VMs that run our integration tests should have access to an S3 bucket or any other artifact storage where they would upload their results/test logs (should be a part of the script mentioned above).

Don't see why we need to add S3 as another dependencies. We can add retries on pulling the content.

leehinman commented 5 months ago

So things like cfengine, Chef, Puppet, Ansible & Salt stack are all solutions to this kind of problem. I don't want any of those in our testing framework, but I think it is worth looking into how they solve the problem. There are lots of subtle edge cases and I'd rather learn from others than discover all those sharp edges on our own.

cachedout commented 5 months ago

Drive-by comment: have you tried just tuning SSH to improve reliability? Stuff like turning on multiplexing or tuning ServerAliveInterval on the client side might improve things substantially.

blakerouse commented 5 months ago

It is also safe to retry most of the commands that fail. Just being more defensive in the execution of the SSH commands can also improve the stability.

rdner commented 5 months ago

Some improvements to the SSH connection management (reconnect and TCP keep alive) were made in https://github.com/elastic/elastic-agent/pull/4498

elasticmachine commented 3 months ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

elastic / elastic-agent

Better orchestration of integration test runners #4410