Open droberts195 opened 3 years ago
Pinging @elastic/es-ql (Team:QL)
There are 4 test classes with the name RestSqlIT
All have had failures in the last 7 days apart from org.elasticsearch.xpack.sql.qa.multi_node.RestSqlIT
https://build-stats.elastic.co/app/kibana#/discover?_g=h@33d9122&_a=h@19cd33a
The test is muted in the base class in #80126
The failures won't reproduce locally, but all CI failures contain the warning There are still tasks running after this test that might break subsequent tests
(also not reproducible locally) after every test preceding this failing one, which I suppose has to do with the test failing.
The leftover task is indices:data/read/sql/async/get
.
What might be happening is:
all shards failed
because the async search results index exists but is not ready to search yet, i.e. the problem of #65846If you agree then the fix could be to wait for pending tasks to complete at the end of the previous test. If you search for waitForPendingTasks
in IntelliJ you'll find a few places where ML does this in order to stop spillover from one test to the next.
Please also upvote #65846.
because the async search results index exists but is not ready to search yet
Maybe this scenario could happen, there are failures occurring when fetching the async results; but there are also failures triggered even by fetching the async task's status (like here and here), or even for starting the async job (like here). Then all failures of this test are 404s vs 503s (as in #65846) and the reason, when available is No search context found for id
.
And it also seems these types of failures for this test started past the 27th last month (there are a couple of other failures, but time out-related).
So it might still be the suggested root cause, but I'd like to poke around a bit more.
Pinging @elastic/es-analytical-engine (Team:Analytics)
I tested both the failures reported in https://github.com/elastic/elasticsearch/issues/76785 as well as reported here for both main (8.15) and 7.17 and all passed locally. However, in one of the above comments it was pointed out that these failures only occur in CI, not locally, and they appear to be unclosed previous async queries. So I'm making a PR that unmutes the test as well as adds the suggested waitForPendingTasks(adminClient(), taskName -> taskName.startsWith("indices:data/read/sql/async/get"));
.
To see both 8.15 and 7.17 running in CI, I made two PRs:
This issue has been closed because it has been open for too long with no activity.
Any muted tests that were associated with this issue have been unmuted.
If the tests begin failing again, a new issue will be opened, and they may be muted again.
This issue is getting re-opened because there are still AwaitsFix mutes for the given test. It will likely be closed again in the future.
Although the same test, this failure differs from #76785, because that was a timeout whereas this one is "all shards failed". I wonder if this is yet another manifestation of #65846?
Build scan: https://gradle-enterprise.elastic.co/s/xehunhrnkfhhi/tests/:x-pack:plugin:sql:qa:server:security:with-ssl:integTest/org.elasticsearch.xpack.sql.qa.security.RestSqlIT/testAsyncTextPaginated
Reproduction line:
./gradlew ':x-pack:plugin:sql:qa:server:security:with-ssl:integTest' --tests "org.elasticsearch.xpack.sql.qa.security.RestSqlIT.testAsyncTextPaginated" -Dtests.seed=B1D5C515E1E4486D -Dtests.locale=hr-HR -Dtests.timezone=Asia/Irkutsk -Druntime.java=17
Applicable branches: master, 8.0, 7.16
Reproduces locally?: No
Failure history: https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.sql.qa.security.RestSqlIT&tests.test=testAsyncTextPaginated
Failure excerpt: