[Flaky Test]: Integration tests keep running forever until manually cancelled

rdner commented 7 months ago

https://buildkite.com/elastic/elastic-agent/builds/7938#018e669f-d4f2-423c-84f7-e6df2caf22b5 is cancelled after 15 hours
https://buildkite.com/elastic/elastic-agent/builds/7939#018e675d-553b-444c-8521-e0739a0a2bed is cancelled after 12 hours
https://buildkite.com/elastic/elastic-agent/builds/7942#018e689d-374d-4154-9c5b-91e8729c11ae cancelled after 6 hours
https://buildkite.com/elastic/elastic-agent/builds/7942#018e6c31-3b66-40d1-929c-196d780315a3 cancelled after 4 hours

Need to carefully inspect the logs, improve logging, if necessary, and find the root cause. Could be related to https://github.com/elastic/elastic-agent/issues/4356

This problem started to appear after https://github.com/elastic/elastic-agent/pull/4461 got merged. I reverted the change in https://github.com/elastic/elastic-agent/pull/4474

elasticmachine commented 7 months ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

rdner commented 7 months ago

Another build got stuck (> 10h) https://buildkite.com/elastic/elastic-agent/builds/7946#018e72ea-209d-411e-8881-e83fe768fdc9

cmacknz commented 7 months ago

In the first 15 hour log and the most recent log https://github.com/elastic/elastic-agent/issues/4475#issuecomment-2017699066 we never see the SLES runner get past this point:

_bk;t=1711326854275>>> (linux-amd64-sles-15-rpm) Starting SSH; connect with `ssh -i /opt/buildkite-agent/builds/bk-agent-prod-gcp-1711324950210306583/elastic/elastic-agent/.integration-cache/id_rsa sles@34.67.132.10`
_bk;t=1711326854906>>> (linux-amd64-sles-15-rpm) Connected over SSH
_bk;t=1711326854906>>> (linux-amd64-sles-15-rpm) Preparing instance
_bk;t=1711326854906>>> (linux-amd64-sles-15-rpm) install devel_basis

What is interesting is that this operation has a timeout on the context that should have timed out:

https://github.com/elastic/elastic-agent/blob/d55869484332108b060ebe31bbca779706e77fc4/pkg/testing/runner/sles.go#L20-L26

Following the context here the call stack is:

https://github.com/elastic/elastic-agent/blob/d55869484332108b060ebe31bbca779706e77fc4/pkg/testing/runner/runner.go#L347-L355

https://github.com/elastic/elastic-agent/blob/d55869484332108b060ebe31bbca779706e77fc4/pkg/testing/runner/runner.go#L309

https://github.com/elastic/elastic-agent/blob/d55869484332108b060ebe31bbca779706e77fc4/pkg/testing/runner/runner.go#L251

https://github.com/elastic/elastic-agent/blob/d55869484332108b060ebe31bbca779706e77fc4/magefile.go#L2125

https://github.com/elastic/elastic-agent/blob/d55869484332108b060ebe31bbca779706e77fc4/magefile.go#L1562-L1576

So the context we are using is the default one mage provides, which only has a timeout if mage was run with -t per https://magefile.org/targets/:

A default context is passed into any target with a context argument. This context will have a timeout if mage was run with -t, and thus will cancel the running targets and dependencies at that time.

We don't use mage -t in CI (or have a timeout on the integration test buildkite step at all apparently): https://github.com/elastic/elastic-agent/blob/c9bb1643c6817879545c9200e1d3d00c02cd3287/.buildkite/scripts/steps/integration_tests.sh#L26

So that's why this is hanging. It doesn't tell us why it's hanging, but it does look like we are getting stuck trying to do "sudo", []string{"zypper", "install", "-y ", "-t", "pattern", "devel_basis"}.

cmacknz commented 7 months ago

zypper does have a --non-interactive we aren't using, that might be contributing to this, not sure why this wouldn't time out on every test run other wise though

https://doc.opensuse.org/documentation/leap/reference/html/book-reference/cha-sw-cl.html#sec-zypper-script

By default, Zypper asks for a confirmation before installing or removing a selected package, or when a problem occurs. You can override this behavior using the --non-interactive option. This option must be given before the actual command (install, remove, and patch), as can be seen in the following:

It looks like -y should be similar:

-y, --no-confirm Don’t require user interaction. It’s recommended to use the --non-interactive global option instead. Global options are passed before the command (zypper --non-interactive COMMAND ...). Unlike the no-confirm command option, the global option can be used together with any zypper command.

There's also a --verbose we could turn on to try to get more details on why it might be stuck.

leehinman commented 7 months ago

I'm looking at it, and I'm pretty sure one of the http connections that zypper uses to download packages & repo updates has been broken and we aren't detecting it. Hopefully I can get either a TCP or HTTP keepalive set.

But this does bring up a design question. We probably shouldn't be going out to the Internet to update our Linux boxes every time we run an integration test. One possible solution is to run a local deb & rpm repos, we can update those periodically, while all the integration tests pull from those local ones. Another solution is to push all this traffic through some kind of caching proxy.

cmacknz commented 7 months ago

Small PR to try to prevent the tests from hanging indefinitely when this happens by always specifying a top level context timeout: https://github.com/elastic/elastic-agent/pull/4478

rdner commented 7 months ago

I'm re-assigning this to @leehinman since he's already looking into this and the new SLES runner introduced in https://github.com/elastic/elastic-agent/pull/4461 most likely caused this.

leehinman commented 7 months ago

Finally got this reproduce this behavior, and when I ssh onto the sles host, the SSH connection was still there, but the shell for the connection wasn't. And if I killed the ssh process on the sles host, the integration framework isn't detecting the TCP connection is gone.

>>> Waiting for cloud stack 8.14.0-SNAPSHOT to be ready [stack_id: 8140-SNAPSHOT, deployment_id: 2f6022c1423c4eff803a497b3d7df08b]
>>> (linux-amd64-sles-15-rpm) Starting SSH; connect with `ssh -i /Users/hinman/src/elastic-agent/.integration-cache/id_rsa sles@34.171.38.222`
>>> (linux-amd64-sles-15-rpm) Connected over SSH
>>> (linux-amd64-sles-15-rpm) Preparing instance
>>> (linux-amd64-sles-15-rpm) install devel_basis

So I think this our SSH implementation not detecting the connection is broken, it is clearly gone from the OS point of view on both ends of the connection.

leehinman commented 7 months ago

golang SSH doesn't do keep alive by default.

https://github.com/golang/go/issues/21478 https://github.com/golang/go/issues/19338

I'm going to try some of the work arounds in the above issues and see if that helps.

rdner commented 7 months ago

golang SSH doesn't do keep alive by default.

@leehinman There were some related suggestions here too https://github.com/elastic/elastic-agent/issues/4410#issuecomment-2001985922

Stuff like turning on multiplexing or tuning ServerAliveInterval on the client side might improve things substantially.

leehinman commented 7 months ago

Stuff like turning on multiplexing or tuning ServerAliveInterval on the client side might improve things substantially.

golang ssh doesn't have that, that is what those issues point to. :-) I have something that is sending ssh keepalives now, and it looks good. I'm going to let it run over night to see how it holds up.

leehinman commented 7 months ago

just to share some info. So sometimes ssh was having i/o timeouts and adding keepalives and reconnects helped with that.

But there is still an issue where sometimes the ogc sles image comes up without any rpm repositories defined. This means that the zypper command would fail, and because we blindly try over and over until the overall context times out that would cause us to "hang" at the zypper install step.

leehinman commented 7 months ago

https://github.com/elastic/elastic-agent/pull/4498 with the second try, and it has the additional logging and TCP keepalives for SSH

rdner commented 7 months ago

Fixed by https://github.com/elastic/elastic-agent/pull/4498

elastic / elastic-agent

[Flaky Test]: Integration tests keep running forever until manually cancelled #4475