Open kibanamachine opened 3 years ago
Pinging @elastic/fleet (Team:Fleet)
I tried to reproduce it locally and couldn't. Also checked the subsequent runs on buildkite and it didn't fail anymore after this occurrence (Sat, Nov 13th). I think this was a one-off.
We also just made some changes in #117552 that may have impacted this. This was merged after this test failure. If we don't get any more failures in a week, let's close it.
New failure: CI Build - main
Logs from most recent failure:
[2022-02-05T23:56:42.166+00:00][WARN ][plugins.fleet] Failed installing package [endpoint] due to error: [ResponseError: validation_exception: [validation_exception] Reason: Validation Failed: 1: Failed to test query, received status: SERVICE_UNAVAILABLE;]
[2022-02-05T23:56:42.212+00:00][INFO ][plugins.fleet] Encountered non fatal errors during Fleet setup
[2022-02-05T23:56:42.212+00:00][INFO ][plugins.fleet] {"name":"ResponseError","message":"validation_exception: [validation_exception] Reason: Validation Failed: 1: Failed to test query, received status: SERVICE_UNAVAILABLE;"}
[2022-02-05T23:56:42.212+00:00][INFO ][plugins.fleet] Fleet setup completed
@paul-tavares Could this be related to Endpoint's getPackagePolicyUpdateCallback
that runs after policy upgrades?
@joshdover ,
I don't think so. Our fleet server extension for Policy Update is used to validate that the policy being updated (via fleet api) is not attempting to use features of the endpoint security policy that are not supported under the current elastic license. So it never really runs on package installs as this test seems to be focusing on. I looked at the test and it seems to be ok (although, I don't see this output in the log: log.info(
Endpoint package latest version: ${latestEndpointVersion});
.
Looks like maybe the install/upgrade failed? or maybe a concurrent run condition?
Wondering if this line:
await supertest.post(`/api/fleet/setup`).set('kbn-xsrf', 'xxxx').expect(200);
could actually return 200
, but have some "warning"/"errors" in the body?
New failure: CI Build - main
latest failure:
└- ✖ fail: Fleet Endpoints EPM Endpoints setup api setup performs upgrades upgrades the endpoint package from 0.13.0 to the latest version available
Error: expected '0.13.0' to sort of equal '1.5.0'
+ expected - actual
-0.13.0
+1.5.0
Kibana error:
[WARN][plugins.fleet] Failed installing package [endpoint] due to error: [ResponseError: validation_exception: [validation_exception] Reason: Validation Failed: 1: Failed to test query, received status: SERVICE_UNAVAILABLE;]
[DEBUG][plugins.fleet] Running required package policies upgrades for managed policies
[DEBUG][plugins.fleet] Setting up Fleet enrollment keys
[INFO][plugins.fleet] Encountered non fatal errors during Fleet setup
[INFO][plugins.fleet] {"name":"ResponseError","message":"validation_exception: [validation_exception] Reason: Validation Failed: 1: Failed to test query, received status: SERVICE_UNAVAILABLE;"}
@joshdover I'm not sure how to proceed here. I've run a couple of flaky test runner jobs for this test and they passed fine:
I've been trying to find where the error came from, it looks like it's coming from elastic based on another issue I found with the same message (here is the search I used). So it looks like it may be something to do with the cluster health? In the logs we do see the health go from yellow to green after the error:
0:01:08] │ info [o.e.c.m.MetadataCreateIndexService] [ftr] [metrics-endpoint.metadata_current_default] creating index, cause [api], templates [metrics-metadata-current], shards [1]/[1]
[00:01:08] │ info [o.e.c.r.a.AllocationService] [ftr] updating number_of_replicas to [0] for indices [metrics-endpoint.metadata_current_default]
[00:01:08] │ proc [kibana] [2022-03-20T04:56:14.358+00:00][WARN ][plugins.fleet] Failure to install package [endpoint]: [ResponseError: validation_exception: [validation_exception] Reason: Validation Failed: 1: Failed to test query, received status: SERVICE_UNAVAILABLE;]
[00:01:08] │ info [o.e.c.r.a.AllocationService] [ftr] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[metrics-endpoint.metadata_current_default][0]]])." previous.health="YELLOW" reason="shards started [[metrics-endpoint.metadata_current_default][0]]"
I can't think of any debug we would want to add to get more info, and if we re-enable the test I think it's likely to eventually fail again.
So it looks like it may be something to do with the cluster health? In the logs we do see the health go from yellow to green after the error:
One thing that would at least help improve the UX here is to have a more specific error message so we know this is failing during transform creation. We could also check the cluster health before starting the install process and bail early if it's yellow?
@sophiec20 Hi, we just mentioned an error we were seeing on the fleet side potentially while installing a transform, theres more detail in my comment above, but here is the error we see:
ResponseError: validation_exception: [validation_exception] Reason: Validation Failed: 1: Failed to test query, received status: SERVICE_UNAVAILABLE
I only suspected that it was from transform creation based on a slightly similar SDH I found . The endpoint package creates 2 transforms defined here
One has source index "metrics-endpoint.metadata-*" and dest index "metrics-endpoint.metadata_current_default".
In the logs above I can see that metrics-endpoint.metadata_current_default
goes from yellow to green after we receive the error:
current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[metrics-endpoint.metadata_current_default][0]]])." previous.health="YELLOW" reason="shards started [[metrics-endpoint.metadata_current_default][0]]"
Would the index being yellow/not having shards started potentially cause this error?
@pzl interesting comments above re: transform☝️
@hop-dev The SERVICE_UNAVAILABLE
validation exception is caused by the composite aggregation failing to execute on the source indices. It is unfortunately too generic a msg and hides the underlying failure reason - the most likely cause could be:
The best way to troubleshoot this would be to check the output from calling GET _transform/<transform_id>/_preview
but I appreciate that might not be possible nor useful if the error is transient.
I suspect that using a source index wildcard might avoid issues with timing if the source index is not yet available. Can you try this?
Also, is it possible to get the whole error response.. on reflection, I would think that there is more information returned in caused_by.reason
Closing due to inactivity / lack of additional failures.
New failure: kibana-es-forward-compatibility-testing - 7.17
A test failed on a tracked branch
First failure: CI Build - 8.0