Closed rdner closed 7 months ago
Pinging @elastic/elastic-agent (Team:Elastic-Agent)
Another build got stuck (> 10h) https://buildkite.com/elastic/elastic-agent/builds/7946#018e72ea-209d-411e-8881-e83fe768fdc9
In the first 15 hour log and the most recent log https://github.com/elastic/elastic-agent/issues/4475#issuecomment-2017699066 we never see the SLES runner get past this point:
_bk;t=1711326854275>>> (linux-amd64-sles-15-rpm) Starting SSH; connect with `ssh -i /opt/buildkite-agent/builds/bk-agent-prod-gcp-1711324950210306583/elastic/elastic-agent/.integration-cache/id_rsa sles@34.67.132.10`
_bk;t=1711326854906>>> (linux-amd64-sles-15-rpm) Connected over SSH
_bk;t=1711326854906>>> (linux-amd64-sles-15-rpm) Preparing instance
_bk;t=1711326854906>>> (linux-amd64-sles-15-rpm) install devel_basis
What is interesting is that this operation has a timeout on the context that should have timed out:
Following the context here the call stack is:
So the context we are using is the default one mage provides, which only has a timeout if mage was run with -t
per https://magefile.org/targets/:
A default context is passed into any target with a context argument. This context will have a timeout if mage was run with -t, and thus will cancel the running targets and dependencies at that time.
We don't use mage -t
in CI (or have a timeout on the integration test buildkite step at all apparently): https://github.com/elastic/elastic-agent/blob/c9bb1643c6817879545c9200e1d3d00c02cd3287/.buildkite/scripts/steps/integration_tests.sh#L26
So that's why this is hanging. It doesn't tell us why it's hanging, but it does look like we are getting stuck trying to do "sudo", []string{"zypper", "install", "-y ", "-t", "pattern", "devel_basis"}
.
zypper
does have a --non-interactive
we aren't using, that might be contributing to this, not sure why this wouldn't time out on every test run other wise though
By default, Zypper asks for a confirmation before installing or removing a selected package, or when a problem occurs. You can override this behavior using the --non-interactive option. This option must be given before the actual command (install, remove, and patch), as can be seen in the following:
It looks like -y
should be similar:
-y, --no-confirm Don’t require user interaction. It’s recommended to use the --non-interactive global option instead. Global options are passed before the command (zypper --non-interactive COMMAND ...). Unlike the no-confirm command option, the global option can be used together with any zypper command.
There's also a --verbose
we could turn on to try to get more details on why it might be stuck.
I'm looking at it, and I'm pretty sure one of the http connections that zypper uses to download packages & repo updates has been broken and we aren't detecting it. Hopefully I can get either a TCP or HTTP keepalive set.
But this does bring up a design question. We probably shouldn't be going out to the Internet to update our Linux boxes every time we run an integration test. One possible solution is to run a local deb & rpm repos, we can update those periodically, while all the integration tests pull from those local ones. Another solution is to push all this traffic through some kind of caching proxy.
Small PR to try to prevent the tests from hanging indefinitely when this happens by always specifying a top level context timeout: https://github.com/elastic/elastic-agent/pull/4478
I'm re-assigning this to @leehinman since he's already looking into this and the new SLES runner introduced in https://github.com/elastic/elastic-agent/pull/4461 most likely caused this.
Finally got this reproduce this behavior, and when I ssh onto the sles host, the SSH connection was still there, but the shell for the connection wasn't. And if I killed the ssh process on the sles host, the integration framework isn't detecting the TCP connection is gone.
>>> Waiting for cloud stack 8.14.0-SNAPSHOT to be ready [stack_id: 8140-SNAPSHOT, deployment_id: 2f6022c1423c4eff803a497b3d7df08b]
>>> (linux-amd64-sles-15-rpm) Starting SSH; connect with `ssh -i /Users/hinman/src/elastic-agent/.integration-cache/id_rsa sles@34.171.38.222`
>>> (linux-amd64-sles-15-rpm) Connected over SSH
>>> (linux-amd64-sles-15-rpm) Preparing instance
>>> (linux-amd64-sles-15-rpm) install devel_basis
So I think this our SSH implementation not detecting the connection is broken, it is clearly gone from the OS point of view on both ends of the connection.
golang SSH doesn't do keep alive by default.
https://github.com/golang/go/issues/21478 https://github.com/golang/go/issues/19338
I'm going to try some of the work arounds in the above issues and see if that helps.
golang SSH doesn't do keep alive by default.
@leehinman There were some related suggestions here too https://github.com/elastic/elastic-agent/issues/4410#issuecomment-2001985922
Stuff like turning on multiplexing or tuning ServerAliveInterval on the client side might improve things substantially.
Stuff like turning on multiplexing or tuning ServerAliveInterval on the client side might improve things substantially.
golang ssh doesn't have that, that is what those issues point to. :-) I have something that is sending ssh keepalives now, and it looks good. I'm going to let it run over night to see how it holds up.
just to share some info. So sometimes ssh was having i/o timeouts and adding keepalives and reconnects helped with that.
But there is still an issue where sometimes the ogc sles image comes up without any rpm repositories defined. This means that the zypper
command would fail, and because we blindly try over and over until the overall context times out that would cause us to "hang" at the zypper install step.
https://github.com/elastic/elastic-agent/pull/4498 with the second try, and it has the additional logging and TCP keepalives for SSH
Need to carefully inspect the logs, improve logging, if necessary, and find the root cause. Could be related to https://github.com/elastic/elastic-agent/issues/4356
This problem started to appear after https://github.com/elastic/elastic-agent/pull/4461 got merged. I reverted the change in https://github.com/elastic/elastic-agent/pull/4474