elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.65k stars 8.23k forks source link

Failing test: X-Pack EPM API Integration Tests.x-pack/test/fleet_api_integration/apis/epm/setup·ts - Fleet Endpoints EPM Endpoints setup api setup performs upgrades upgrades the endpoint package from 0.13.0 to the latest version available #118479

Open kibanamachine opened 3 years ago

kibanamachine commented 3 years ago

A test failed on a tracked branch

Error: expected '0.13.0' to sort of equal '1.3.0-dev.0'
    at Assertion.assert (/opt/local-ssd/buildkite/builds/kb-n2-4-d03412ba5a172dc0/elastic/kibana-hourly/kibana/node_modules/@kbn/expect/expect.js:100:11)
    at Assertion.eql (/opt/local-ssd/buildkite/builds/kb-n2-4-d03412ba5a172dc0/elastic/kibana-hourly/kibana/node_modules/@kbn/expect/expect.js:244:8)
    at Context.<anonymous> (test/fleet_api_integration/apis/epm/setup.ts:49:88)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at Object.apply (/opt/local-ssd/buildkite/builds/kb-n2-4-d03412ba5a172dc0/elastic/kibana-hourly/kibana/node_modules/@kbn/test/target_node/functional_test_runner/lib/mocha/wrap_function.js:87:16) {
  actual: '0.13.0',
  expected: '1.3.0-dev.0',
  showDiff: true
}

First failure: CI Build - 8.0

elasticmachine commented 3 years ago

Pinging @elastic/fleet (Team:Fleet)

criamico commented 3 years ago

I tried to reproduce it locally and couldn't. Also checked the subsequent runs on buildkite and it didn't fail anymore after this occurrence (Sat, Nov 13th). I think this was a one-off.

joshdover commented 3 years ago

We also just made some changes in #117552 that may have impacted this. This was merged after this test failure. If we don't get any more failures in a week, let's close it.

kibanamachine commented 2 years ago

New failure: CI Build - main

joshdover commented 2 years ago

Logs from most recent failure:

[2022-02-05T23:56:42.166+00:00][WARN ][plugins.fleet] Failed installing package [endpoint] due to error: [ResponseError: validation_exception: [validation_exception] Reason: Validation Failed: 1: Failed to test query, received status: SERVICE_UNAVAILABLE;]
[2022-02-05T23:56:42.212+00:00][INFO ][plugins.fleet] Encountered non fatal errors during Fleet setup
[2022-02-05T23:56:42.212+00:00][INFO ][plugins.fleet] {"name":"ResponseError","message":"validation_exception: [validation_exception] Reason: Validation Failed: 1: Failed to test query, received status: SERVICE_UNAVAILABLE;"}
[2022-02-05T23:56:42.212+00:00][INFO ][plugins.fleet] Fleet setup completed

@paul-tavares Could this be related to Endpoint's getPackagePolicyUpdateCallback that runs after policy upgrades?

paul-tavares commented 2 years ago

@joshdover , I don't think so. Our fleet server extension for Policy Update is used to validate that the policy being updated (via fleet api) is not attempting to use features of the endpoint security policy that are not supported under the current elastic license. So it never really runs on package installs as this test seems to be focusing on. I looked at the test and it seems to be ok (although, I don't see this output in the log: log.info(Endpoint package latest version: ${latestEndpointVersion});.

Looks like maybe the install/upgrade failed? or maybe a concurrent run condition?

paul-tavares commented 2 years ago

Wondering if this line:

await supertest.post(`/api/fleet/setup`).set('kbn-xsrf', 'xxxx').expect(200);

could actually return 200, but have some "warning"/"errors" in the body?

kibanamachine commented 2 years ago

New failure: CI Build - main

mistic commented 2 years ago

Skipped.

main: 1a215ba

hop-dev commented 2 years ago

latest failure:

└- ✖ fail: Fleet Endpoints EPM Endpoints setup api setup performs upgrades upgrades the endpoint package from 0.13.0 to the latest version available
  Error: expected '0.13.0' to sort of equal '1.5.0'
  + expected - actual
  -0.13.0
  +1.5.0

Kibana error:

[WARN][plugins.fleet] Failed installing package [endpoint] due to error: [ResponseError: validation_exception: [validation_exception] Reason: Validation Failed: 1: Failed to test query, received status: SERVICE_UNAVAILABLE;]
[DEBUG][plugins.fleet] Running required package policies upgrades for managed policies
[DEBUG][plugins.fleet] Setting up Fleet enrollment keys
[INFO][plugins.fleet] Encountered non fatal errors during Fleet setup
[INFO][plugins.fleet] {"name":"ResponseError","message":"validation_exception: [validation_exception] Reason: Validation Failed: 1: Failed to test query, received status: SERVICE_UNAVAILABLE;"}
hop-dev commented 2 years ago

@joshdover I'm not sure how to proceed here. I've run a couple of flaky test runner jobs for this test and they passed fine:

I've been trying to find where the error came from, it looks like it's coming from elastic based on another issue I found with the same message (here is the search I used). So it looks like it may be something to do with the cluster health? In the logs we do see the health go from yellow to green after the error:

0:01:08]             │ info [o.e.c.m.MetadataCreateIndexService] [ftr] [metrics-endpoint.metadata_current_default] creating index, cause [api], templates [metrics-metadata-current], shards [1]/[1]
[00:01:08]             │ info [o.e.c.r.a.AllocationService] [ftr] updating number_of_replicas to [0] for indices [metrics-endpoint.metadata_current_default]
[00:01:08]             │ proc [kibana] [2022-03-20T04:56:14.358+00:00][WARN ][plugins.fleet] Failure to install package [endpoint]: [ResponseError: validation_exception: [validation_exception] Reason: Validation Failed: 1: Failed to test query, received status: SERVICE_UNAVAILABLE;]
[00:01:08]             │ info [o.e.c.r.a.AllocationService] [ftr] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[metrics-endpoint.metadata_current_default][0]]])." previous.health="YELLOW" reason="shards started [[metrics-endpoint.metadata_current_default][0]]"

I can't think of any debug we would want to add to get more info, and if we re-enable the test I think it's likely to eventually fail again.

joshdover commented 2 years ago

So it looks like it may be something to do with the cluster health? In the logs we do see the health go from yellow to green after the error:

One thing that would at least help improve the UX here is to have a more specific error message so we know this is failing during transform creation. We could also check the cluster health before starting the install process and bail early if it's yellow?

hop-dev commented 2 years ago

@sophiec20 Hi, we just mentioned an error we were seeing on the fleet side potentially while installing a transform, theres more detail in my comment above, but here is the error we see:

ResponseError: validation_exception: [validation_exception] Reason: Validation Failed: 1: Failed to test query, received status: SERVICE_UNAVAILABLE

I only suspected that it was from transform creation based on a slightly similar SDH I found . The endpoint package creates 2 transforms defined here

One has source index "metrics-endpoint.metadata-*" and dest index "metrics-endpoint.metadata_current_default".

In the logs above I can see that metrics-endpoint.metadata_current_default goes from yellow to green after we receive the error:

current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[metrics-endpoint.metadata_current_default][0]]])." previous.health="YELLOW" reason="shards started [[metrics-endpoint.metadata_current_default][0]]"

Would the index being yellow/not having shards started potentially cause this error?

paul-tavares commented 2 years ago

@pzl interesting comments above re: transform☝️

sophiec20 commented 2 years ago

@hop-dev The SERVICE_UNAVAILABLE validation exception is caused by the composite aggregation failing to execute on the source indices. It is unfortunately too generic a msg and hides the underlying failure reason - the most likely cause could be:

The best way to troubleshoot this would be to check the output from calling GET _transform/<transform_id>/_preview but I appreciate that might not be possible nor useful if the error is transient.

I suspect that using a source index wildcard might avoid issues with timing if the source index is not yet available. Can you try this?

sophiec20 commented 2 years ago

Also, is it possible to get the whole error response.. on reflection, I would think that there is more information returned in caused_by.reason

jen-huang commented 2 years ago

Closing due to inactivity / lack of additional failures.

kibanamachine commented 2 weeks ago

New failure: kibana-es-forward-compatibility-testing - 7.17