Closed rdner closed 6 months ago
Pinging @elastic/elastic-agent (Team:Elastic-Agent)
Another OGC failure https://buildkite.com/elastic/elastic-agent/builds/7651#018e1f0f-c712-4721-baf7-f13f8ba8477e
Error: error running test: failed to prepare instance ogc-windows-amd64-2022-fleet-e3f0: failed to install curl: could not run "choco install -y curl" though SSH: Process exited with status 1 (stdout: , stderr: 'choco' is not recognized as an internal or external command, operable program or batch file.
Not OGC's fault, that is the integration testing framework preparing the instance. OGC doesn't do that.
And another one https://buildkite.com/elastic/elastic-agent/builds/7654#018e1f49-c387-4973-b8a0-dd10dba598f2
Error: error running test: failed to check for cloud 8.13.0-SNAPSHOT [stack_id: 8130-SNAPSHOT, deployment_id: <REDACTED>] to be ready: error calling deployment retrieval API: Get "https://cloud.elastic.co/api/v1/deployments/<REDACTED>": context deadline exceeded
Not OGC, OGC doesn't create or prepare any stack.
Another OGC-related failure https://buildkite.com/elastic/elastic-agent/builds/7744#018e36cf-934b-4a2f-aaca-de800860be5e
Failed to execute tests on instance: error running sudo tests: failed to fetched test output at %home%\agent\build\TEST-go-remote-windows-amd64-2022-upgrade-sudo.integration.out
Not OGC, OGC doesn't run the tests or fetch the results.
Just to be clear, OGC only creates the instance with the cloud providers nothing else. Everything else is done by the integration testing framework and is our code.
@blakerouse would "VM orchestration" be a better term? I will rename this issue then.
When it comes to OGC failures, sometimes we have something like this:
https://buildkite.com/elastic/elastic-agent/builds/8091#018ea9bd-ee66-4a10-b1bc-b8f8030d80bc
libcloud.common.google.GoogleBaseError: {'message': "Internal error. Please try again or contact Google Support. (Code: '
')", 'domain': 'global', 'reason': 'backendError'}
Not sure we can do anything about it.
For this one, yeah not sure we can do anything.
I updated the description to organize known failures by categories and clean up my comments on this issue.
I believe this is a new VM orchestration issue:
Error: error running test: failed to connect to instance ogc-linux-amd64-ubuntu-2204-default-f3e5: error NewClientConn for ssh to "34.41.144.218:22" :ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
https://buildkite.com/elastic/elastic-agent/builds/8793#018f58b8-f8c9-4c73-9b40-6f2da4a73974
It is from a backport PR: https://github.com/elastic/elastic-agent/pull/4709, I'll try re-running it.
I moved all the failures that we can actually recover from to https://github.com/elastic/elastic-agent/issues/4794
Since we have not had new errors for a while now and there is nothing new to report here, I'm closing this issue in favor of the new one.
The failures can be categorized in following groups:
Firewall resource not found or already exists (quite often)(should be fixed by https://github.com/elastic/elastic-agent/pull/4740)This has been reported in the OGC repository https://github.com/adam-stokes/ogc/issues/28
Examples:
I believe it might be some kind of race condition, we should investigate further.
Networking issues
Tracked by https://github.com/elastic/elastic-agent/issues/4794
Permission errors (serverless)
Examples:
SQL error
Examples:
GCP just fails with 500 (rare)
Examples:
Job did not complete in 180 seconds
Examples: