Closed rdner closed 2 months ago
Pinging @elastic/elastic-agent (Team:Elastic-Agent)
- It should be possible to prepare a single script (per OS) with all commands needed for initialization and execution of integration tests on a remote VM. We should run this script only once via SSH instead of sending separate commands. Buildkite does something similar – sends a complex manifest to a remote machine and then runs it there.
Not a fan of using a script. Script programming sucks, we should never change to using scripts. I would be fine with changing to a golang code that is ran.
- VMs that run our integration tests should have access to an S3 bucket or any other artifact storage where they would upload their results/test logs (should be a part of the script mentioned above).
Don't see why we need to add S3 as another dependencies. We can add retries on pulling the content.
So things like cfengine, Chef, Puppet, Ansible & Salt stack are all solutions to this kind of problem. I don't want any of those in our testing framework, but I think it is worth looking into how they solve the problem. There are lots of subtle edge cases and I'd rather learn from others than discover all those sharp edges on our own.
Drive-by comment: have you tried just tuning SSH to improve reliability? Stuff like turning on multiplexing or tuning ServerAliveInterval
on the client side might improve things substantially.
It is also safe to retry most of the commands that fail. Just being more defensive in the execution of the SSH commands can also improve the stability.
Some improvements to the SSH connection management (reconnect and TCP keep alive) were made in https://github.com/elastic/elastic-agent/pull/4498
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)
Closing since we achieved sufficient stability in the runners by adding retries.
Describe the enhancement:
Based on my observations over the last 2 months, the root cause of most of the issues we're experiencing with OGC while running our integration tests is the communication via SSH.
Some of the reports can be found here https://github.com/elastic/elastic-agent/issues/4356
SSH is not very resilient when it comes to connection issues/interruptions and we cannot simply add a retry because we run commands via SSH which, in most cases, are meant to be executed only once.
The goal of this enhancement should be minimizing interactions via SSH.
We could implement the following improvements to make our integration tests more stable:
Describe a specific use case for the enhancement or feature:
Our integration tests are occasionally failing because of orchestration of the VMs or while communicating with the VMs:
What is the definition of done?