elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
123 stars 132 forks source link

[Flaky Test] VM orchestration is unstable in integration tests #4356

Closed rdner closed 3 months ago

rdner commented 6 months ago

The failures can be categorized in following groups:

Firewall resource not found or already exists (quite often) (should be fixed by https://github.com/elastic/elastic-agent/pull/4740)

This has been reported in the OGC repository https://github.com/adam-stokes/ogc/issues/28

libcloud.common.google.ResourceNotFoundError: {'message': "The resource 'projects/elastic-platform-ingest/global/firewalls/linux-amd64-ubuntu-2204-upgrade' was not found", 'domain': 'global', 'reason': 'notFound'}

libcloud.common.google.ResourceExistsError: {'message': "The resource 'projects/elastic-platform-ingest/zones/us-central1-a/instances/ogc-linux-amd64-ubuntu-2204-fleet-airgapped-2315' already exists", 'domain': 'global', 'reason': 'alreadyExists'}

Examples:

I believe it might be some kind of race condition, we should investigate further.

Networking issues

Tracked by https://github.com/elastic/elastic-agent/issues/4794

Permission errors (serverless)

Error: error running clean: got unexpected response code [403] from deployment shutdown API: {
   "errors": [
       {
           "message": "To access the resource [u:/deployments/cc41c0a61a474f3aa6d890df111925d5], the user must have the required authorization.",
           "code": "root.permission_denied"
       }
   ]
}

Examples:

SQL error

sqlite3.OperationalError: no such table: layouts

Examples:

GCP just fails with 500 (rare)

libcloud.common.google.GoogleBaseError: {'message': "Internal error. Please try again or contact Google Support. (Code: '')", 'domain': 'global', 'reason': 'backendError'}

Examples:

Job did not complete in 180 seconds

libcloud.common.types.LibcloudError: <LibcloudError in None 'Job did not complete in 180 seconds'>

Examples:

elasticmachine commented 6 months ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

blakerouse commented 5 months ago

Another OGC failure https://buildkite.com/elastic/elastic-agent/builds/7651#018e1f0f-c712-4721-baf7-f13f8ba8477e

Error: error running test: failed to prepare instance ogc-windows-amd64-2022-fleet-e3f0: failed to install curl: could not run "choco install -y curl" though SSH: Process exited with status 1 (stdout: , stderr: 'choco' is not recognized as an internal or external command, operable program or batch file.

Not OGC's fault, that is the integration testing framework preparing the instance. OGC doesn't do that.

blakerouse commented 5 months ago

And another one https://buildkite.com/elastic/elastic-agent/builds/7654#018e1f49-c387-4973-b8a0-dd10dba598f2

Error: error running test: failed to check for cloud 8.13.0-SNAPSHOT [stack_id: 8130-SNAPSHOT, deployment_id: <REDACTED>] to be ready: error calling deployment retrieval API: Get "https://cloud.elastic.co/api/v1/deployments/<REDACTED>": context deadline exceeded

Not OGC, OGC doesn't create or prepare any stack.

blakerouse commented 5 months ago

Another OGC-related failure https://buildkite.com/elastic/elastic-agent/builds/7744#018e36cf-934b-4a2f-aaca-de800860be5e

Failed to execute tests on instance: error running sudo tests: failed to fetched test output at %home%\agent\build\TEST-go-remote-windows-amd64-2022-upgrade-sudo.integration.out

Not OGC, OGC doesn't run the tests or fetch the results.

blakerouse commented 5 months ago

Just to be clear, OGC only creates the instance with the cloud providers nothing else. Everything else is done by the integration testing framework and is our code.

rdner commented 5 months ago

@blakerouse would "VM orchestration" be a better term? I will rename this issue then.

rdner commented 5 months ago

When it comes to OGC failures, sometimes we have something like this:

https://buildkite.com/elastic/elastic-agent/builds/8091#018ea9bd-ee66-4a10-b1bc-b8f8030d80bc

libcloud.common.google.GoogleBaseError: {'message': "Internal error. Please try again or contact Google Support. (Code: '')", 'domain': 'global', 'reason': 'backendError'}

Not sure we can do anything about it.

pierrehilbert commented 5 months ago

For this one, yeah not sure we can do anything.

rdner commented 5 months ago

I updated the description to organize known failures by categories and clean up my comments on this issue.

belimawr commented 4 months ago

I believe this is a new VM orchestration issue:

Error: error running test: failed to connect to instance ogc-linux-amd64-ubuntu-2204-default-f3e5: error NewClientConn for ssh to "34.41.144.218:22" :ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

https://buildkite.com/elastic/elastic-agent/builds/8793#018f58b8-f8c9-4c73-9b40-6f2da4a73974

It is from a backport PR: https://github.com/elastic/elastic-agent/pull/4709, I'll try re-running it.

rdner commented 3 months ago

I moved all the failures that we can actually recover from to https://github.com/elastic/elastic-agent/issues/4794

Since we have not had new errors for a while now and there is nothing new to report here, I'm closing this issue in favor of the new one.