elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.48k stars 24.88k forks source link

[CI] RestSqlIT testAsyncTextPaginated failing #80089

Open droberts195 opened 3 years ago

droberts195 commented 3 years ago

Although the same test, this failure differs from #76785, because that was a timeout whereas this one is "all shards failed". I wonder if this is yet another manifestation of #65846?

Build scan: https://gradle-enterprise.elastic.co/s/xehunhrnkfhhi/tests/:x-pack:plugin:sql:qa:server:security:with-ssl:integTest/org.elasticsearch.xpack.sql.qa.security.RestSqlIT/testAsyncTextPaginated

Reproduction line: ./gradlew ':x-pack:plugin:sql:qa:server:security:with-ssl:integTest' --tests "org.elasticsearch.xpack.sql.qa.security.RestSqlIT.testAsyncTextPaginated" -Dtests.seed=B1D5C515E1E4486D -Dtests.locale=hr-HR -Dtests.timezone=Asia/Irkutsk -Druntime.java=17

Applicable branches: master, 8.0, 7.16

Reproduces locally?: No

Failure history: https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.sql.qa.security.RestSqlIT&tests.test=testAsyncTextPaginated

Failure excerpt:

org.elasticsearch.client.ResponseException: method [GET], host [https://[::1]:45613], URI [/_sql/async/FkZhZEVPa2VyUmphUlRzQW5mMWYtX2cbb0U0bzB2UGlSWnVGZ2RRV2ZlU3dnQTo0MDMz?format=txt&wait_for_completion_timeout=1d], status line [HTTP/1.1 404 Not Found]
{"error":{"root_cause":[{"type":"search_context_missing_exception","reason":"No search context found for id [45]"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":-1,"index":null,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [45]"}}],"caused_by":{"type":"search_context_missing_exception","reason":"No search context found for id [45]"}},"status":404}

  at __randomizedtesting.SeedInfo.seed([B1D5C515E1E4486D:FB65C4A44320DED2]:0)
  at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:335)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:301)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:276)
  at org.elasticsearch.xpack.sql.qa.rest.RestSqlTestCase.testAsyncTextPaginated(RestSqlTestCase.java:1344)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:568)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
  at java.lang.Thread.run(Thread.java:833)
elasticmachine commented 3 years ago

Pinging @elastic/es-ql (Team:QL)

droberts195 commented 3 years ago

Same thing also happened on 8.0 - see https://gradle-enterprise.elastic.co/s/4sq2l2sm6y5yq/tests/:x-pack:plugin:sql:qa:server:security:without-ssl:integTest/org.elasticsearch.xpack.sql.qa.security.RestSqlIT/testAsyncTextPaginated

davidkyle commented 3 years ago

There are 4 test classes with the name RestSqlIT

All have had failures in the last 7 days apart from org.elasticsearch.xpack.sql.qa.multi_node.RestSqlIT

https://build-stats.elastic.co/app/kibana#/discover?_g=h@33d9122&_a=h@19cd33a

The test is muted in the base class in #80126

bpintea commented 3 years ago

The failures won't reproduce locally, but all CI failures contain the warning There are still tasks running after this test that might break subsequent tests (also not reproducible locally) after every test preceding this failing one, which I suppose has to do with the test failing.

droberts195 commented 3 years ago

The leftover task is indices:data/read/sql/async/get.

What might be happening is:

  1. The previous test leaves an async search in progress when it completes
  2. The cleanup that runs between tests deletes all indices
  3. The leftover async search causes the async search results index to be created
  4. This test runs a search that then suffers all shards failed because the async search results index exists but is not ready to search yet, i.e. the problem of #65846

If you agree then the fix could be to wait for pending tasks to complete at the end of the previous test. If you search for waitForPendingTasks in IntelliJ you'll find a few places where ML does this in order to stop spillover from one test to the next.

Please also upvote #65846.

bpintea commented 3 years ago

because the async search results index exists but is not ready to search yet

Maybe this scenario could happen, there are failures occurring when fetching the async results; but there are also failures triggered even by fetching the async task's status (like here and here), or even for starting the async job (like here). Then all failures of this test are 404s vs 503s (as in #65846) and the reason, when available is No search context found for id.

And it also seems these types of failures for this test started past the 27th last month (there are a couple of other failures, but time out-related).

So it might still be the suggested root cause, but I'd like to poke around a bit more.

elasticsearchmachine commented 10 months ago

Pinging @elastic/es-analytical-engine (Team:Analytics)

craigtaverner commented 7 months ago

I tested both the failures reported in https://github.com/elastic/elasticsearch/issues/76785 as well as reported here for both main (8.15) and 7.17 and all passed locally. However, in one of the above comments it was pointed out that these failures only occur in CI, not locally, and they appear to be unclosed previous async queries. So I'm making a PR that unmutes the test as well as adds the suggested waitForPendingTasks(adminClient(), taskName -> taskName.startsWith("indices:data/read/sql/async/get"));.

craigtaverner commented 7 months ago

To see both 8.15 and 7.17 running in CI, I made two PRs:

elasticsearchmachine commented 3 weeks ago

This issue has been closed because it has been open for too long with no activity.

Any muted tests that were associated with this issue have been unmuted.

If the tests begin failing again, a new issue will be opened, and they may be muted again.

elasticsearchmachine commented 2 weeks ago

This issue is getting re-opened because there are still AwaitsFix mutes for the given test. It will likely be closed again in the future.