Closed kibanamachine closed 3 months ago
Pinging @elastic/kibana-core (Team:Core)
Caused by the same reason as https://github.com/elastic/kibana/issues/163273. Closing.
New failure: CI Build - main
New failure: CI Build - main
New failure: CI Build - main
New failure: CI Build - main
New failure: CI Build - main
New failure: CI Build - main
New failure: CI Build - main
Actual error from the last CI build:
FAIL src/core/server/integration_tests/saved_objects/migrations/group3/split_failed_to_clone.test.ts (63.424 s)
--
| ● when splitting .kibana into multiple indices and one clone fails › after resolving the problem and retrying the migration completes successfully
|
| expect(received).rejects.toThrowError(expected)
|
| Expected pattern: /cluster_shard_limit_exceeded/
| Received message: "Unable to complete saved object migrations for the [.kibana_task_manager] index. Please check the health of your Elasticsearch cluster and try again. Unexpected Elasticsearch ResponseError: statusCode: 404, method: GET, url: /_tasks/VpXIdVnGRjishpLCm0CrzA%3A431?wait_for_completion=true&timeout=120s error: [resource_not_found_exception]: task [VpXIdVnGRjishpLCm0CrzA:431] isn't running and hasn't stored its results,"
New failure: CI Build - main
New failure: CI Build - main
/skip
Started to break on PRs and main: https://github.com/elastic/kibana/commit/960b1a1fbe774215bcc457e83cb5f5ba78c99a83
New failure: kibana-on-merge - main
New failure: kibana-on-merge - main
This is a new error:
Error: expect(received).rejects.toThrowError(expected)
Expected pattern: /cluster_shard_limit_exceeded/
Received message: "Unable to complete saved object migrations for the [.kibana_task_manager] index. Please check the health of your Elasticsearch cluster and try again. Unexpected Elasticsearch ResponseError: statusCode: 404, method: GET, url: /_tasks/VpXIdVnGRjishpLCm0CrzA%3A458?wait_for_completion=true&timeout=120s error: [resource_not_found_exception]: task [VpXIdVnGRjishpLCm0CrzA:458] isn't running and hasn't stored its results,"
I'll take a look at what could cause them.
UPDATE: It looks like the same error as reported in https://github.com/elastic/kibana/issues/163253#issuecomment-1853378582
I have identified the place where the error happens, it's the CLEANUP_UNKNOWN_AND_EXCLUDED_WAIT_FOR_TASK
.
When performing a compatible migration, we update documents in place, but we also cleanup those that are unknown or excluded.
deleteByQuery
is issued as part of the CLEANUP_UNKNOWN_AND_EXCLUDED
step.
taskId
.GET, url: /_tasks/VpXIdVnGRjishpLCm0CrzA%3A458?wait_for_completion=true&timeout=120s
error: [resource_not_found_exception]: task [VpXIdVnGRjishpLCm0CrzA:458] isn't running and hasn't stored its results
.Even though we have a retry mechanism to keep trying until task completes, this time we're getting a 404 Not Found
.
I've searched for the error message, and ES is throwing this error in 2 places in this file:
false == response.isExists()
).If we take a step back and look at the test, Rudolf is trying to make migrations fail on purpose, by doing:
// cause a failure when cloning .kibana_slow_clone_* indices
await client.cluster.putSettings({ persistent: { 'cluster.max_shards_per_node': 15 } });
after which, we expect
await expect(runMigrationsWhichFailsWhenCloning()).rejects.toThrowError(
/cluster_shard_limit_exceeded/
);
However, it seems that we might be failing before we attempt to create enough SO indices to cause the expected failure.
My theory is that this max_shards_per_node
can be messing up with ES, not allowing it to create its internal _tasks
index.
A test failed on a tracked branch
First failure: CI Build - main