elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.07k stars 24.83k forks source link

[CI] DocsClientYamlTestSuiteIT test {yaml=reference/sql/endpoints/rest/line_574} failing #106158

Closed jfreden closed 4 days ago

jfreden commented 8 months ago

Build scan: https://gradle-enterprise.elastic.co/s/u3ln5vofvnbtw/tests/:docs:yamlRestTest/org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT/test%20%7Byaml=reference%2Fsql%2Fendpoints%2Frest%2Fline_574%7D

Reproduction line:

./gradlew ':docs:yamlRestTest' --tests "org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT.test {yaml=reference/sql/endpoints/rest/line_574}" -Dtests.seed=461A50DBB2EAB7FF -Dtests.locale=ar-SD -Dtests.timezone=SystemV/EST5 -Druntime.java=21

Applicable branches: main

Reproduces locally?: Yes

Failure history: Failure dashboard for org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT#test {yaml=reference/sql/endpoints/rest/line_574}&_a=(controlGroupInput:(chainingSystem:HIERARCHICAL,controlStyle:twoLine,ignoreParentSettings:(ignoreFilters:!f,ignoreQuery:!f,ignoreTimerange:!f,ignoreValidations:!t),panels:('0c0c9cb8-ccd2-45c6-9b13-96bac4abc542':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:task.keyword,grow:!t,id:'0c0c9cb8-ccd2-45c6-9b13-96bac4abc542',searchTechnique:wildcard,selectedOptions:!(),singleSelect:!t,title:'Gradle%20Task',width:medium),grow:!t,order:0,type:optionsListControl,width:small),'144933da-5c1b-4257-a969-7f43455a7901':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:name.keyword,grow:!t,id:'144933da-5c1b-4257-a969-7f43455a7901',searchTechnique:wildcard,selectedOptions:!('test%20%7Byaml%3Dreference/sql/endpoints/rest/line_574%7D'),title:Test,width:medium),grow:!t,order:2,type:optionsListControl,width:medium),'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:className.keyword,grow:!t,id:'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850',searchTechnique:wildcard,selectedOptions:!('org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT'),title:Suite,width:medium),grow:!t,order:1,type:optionsListControl,width:medium)))))

Failure excerpt:

org.elasticsearch.client.WarningFailureException: method [DELETE], host [http://127.0.0.1:39763], URI [*,-.ds-ilm-history-*,-.ds-.slm-history-*?expand_wildcards=open%2Cclosed%2Chidden], status line [HTTP/1.1 200 OK]
Warnings: [this request accesses system indices: [.async-search], but in a future major version, direct access to system indices will be prevented by default]
{"acknowledged":true}

  at __randomizedtesting.SeedInfo.seed([461A50DBB2EAB7FF:CE4E6F011C16DA07]:0)
  at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:347)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:317)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:292)
  at org.elasticsearch.test.rest.ESRestTestCase.wipeAllIndices(ESRestTestCase.java:1113)
  at org.elasticsearch.test.rest.ESRestTestCase.wipeCluster(ESRestTestCase.java:853)
  at org.elasticsearch.test.rest.ESRestTestCase.cleanUpCluster(ESRestTestCase.java:539)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:1004)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)
elasticsearchmachine commented 8 months ago

Pinging @elastic/es-analytical-engine (Team:Analytics)

joegallo commented 8 months ago

I've assigned this one to myself, it's caused by https://github.com/elastic/elasticsearch/pull/106140.

joegallo commented 8 months ago

https://gradle-enterprise.elastic.co/s/yocwlc3kri3jg/tests/task/:docs:yamlRestTest/details/org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT/test%20%7Byaml=reference%2Fsql%2Fendpoints%2Frest%2Fline_574%7D

joegallo commented 8 months ago

I had unmuted these tests on https://github.com/elastic/elasticsearch/pull/106140, they were originally muted via https://github.com/elastic/elasticsearch/pull/75237. I have re-muted them via https://github.com/elastic/elasticsearch/pull/106196.

Since https://github.com/elastic/elasticsearch/pull/94191 has been merged, the async search feature is being reset when we clean up the cluster between rest tests. However, I think the .async-search index is being re-created as a race condition because of the async nature of writing to that index (or something 🤷).

So the flow in the cleanup code ends up being:

  1. The test runs, it will create an async-search
  2. The async-search feature is reset, the .async-search index will be deleted if it exists
  3. The async-search is actually created and the .async-search index is created
  4. We get to wipeAllIndices(...) and it complains that .async-search is a system index

/cc @williamrandolph since we talked about exactly this sort of possibility

joegallo commented 8 months ago

The failures reproduce really easily, btw.

joegallo@simulacron:~/Code/elastic/elasticsearch $ git reset --hard upstream/main
HEAD is now at 0dd6ce2df6c Delete DownsampleClusterDisruptionIT (#106225)
joegallo@simulacron:~/Code/elastic/elasticsearch $ git log --oneline -1
0dd6ce2df6c (HEAD -> main, upstream/main, upstream/HEAD) Delete DownsampleClusterDisruptionIT (#106225)
joegallo@simulacron:~/Code/elastic/elasticsearch $ git revert --no-edit 8be3fb229d8c29d7a3931ffcb5e03e0bb0476c0f
[main 27c07af19e4] Revert "Mute failing snippet tests (#106196)"
 Date: Tue Mar 12 09:52:24 2024 -0400
 1 file changed, 6 deletions(-)

And then it's just a matter of:

./gradlew ':docs:yamlRestTest' --tests "org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT.test {yaml=reference/sql/endpoints/rest/*}"

That doesn't fail 100% of the time, but it's pretty reliable (it failed 2 out of the 4 times I tried this on my laptop just now).

joegallo commented 8 months ago

I've unassigned myself from this issue because this seems to be a legitimate issue and none of it's actually my area.

joegallo commented 8 months ago

I suppose something like the following diff would square this away, but that seems contrary to the point of https://github.com/elastic/elasticsearch/pull/94191:

diff --git a/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java b/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
index 307daddd17c..24f6424e71e 100644
--- a/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
+++ b/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
@@ -1109,7 +1109,7 @@ public abstract class ESRestTestCase extends ESTestCase {
         try {
             // remove all indices except some history indices which can pop up after deleting all data streams but shouldn't interfere
             final List<String> indexPatterns = new ArrayList<>(
-                List.of("*", "-.ds-ilm-history-*", "-.ds-.slm-history-*", "-.ds-.watcher-history-*")
+                List.of("*", "-.async-search", "-.ds-ilm-history-*", "-.ds-.slm-history-*", "-.ds-.watcher-history-*")
             );
             if (preserveSecurityIndices) {
                 indexPatterns.add("-.security-*");
elasticsearchmachine commented 8 months ago

Pinging @elastic/es-search (Team:Search)

joegallo commented 8 months ago

Given the similarity to #104013, I'm labeling this one as low-risk instead of blocker (that is, labeled the same as that issue already is).

nik9000 commented 8 months ago

I think this one is similar to this:

REPRODUCE WITH: ./gradlew ':docs:yamlRestTest' --tests "org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT.test {yaml=reference/esql/esql-rest/line_297}" -Dtests.seed=53ABCE485C630DC4 -Dtests.locale=hu -Dtests.timezone=Asia/Irkutsk -Druntime.java=21
elasticsearchmachine commented 8 months ago

Pinging @elastic/es-core-infra (Team:Core/Infra)

benwtrent commented 8 months ago

Since this has to do with system indices and those being cleaned up, I am gonna tag core infra as well.

I don't know what the focus would be for search, since .async_search is a system index, its management, etc. should be pretty transparent to us.

@williamrandolph would we have to add some logic on waiting for tasks or something in the async search cleanup?

davidkyle commented 7 months ago

Also failing on 8.13 https://gradle-enterprise.elastic.co/s/35culdxu6ttss/tests/:docs:yamlRestTest/org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT/test%20%7Byaml=reference%2Fesql%2Fesql-rest%2Fline_297%7D

williamrandolph commented 7 months ago

It would be ideal if I could implement some cleanup logic by overriding SystemIndexPlugin#cleanUpFeature in AsyncResultsIndexPlugin. I'll take a look at doing that.

elasticsearchmachine commented 3 months ago

Pinging @elastic/es-search-foundations (Team:Search Foundations)

ldematte commented 3 months ago

I dug into this a bit. What I found confirms @joegallo hypothesis:

[2024-08-08T15:27:14,717][INFO ][o.e.x.a.AsyncResultsIndexPlugin] [node-0] cleanUpFeature started
[2024-08-08T15:27:14,718][INFO ][o.e.x.a.AsyncResultsIndexPlugin] [node-0] cleanUpFeature done
[2024-08-08T15:27:14,786][INFO ][o.e.a.a.i.c.AutoCreateAction] [node-0] AutoCreateAction#execute .async-search
[2024-08-08T15:27:14,786][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] executing IndexCreationTask for [CreateIndexClusterStateUpdateRequest{cause='auto(bulk api)', index='.async-search', dataStreamName='null', providedName='.async-search', recoverFrom=null, resizeType=null, copySettings=false, settings={"index.auto_expand_replicas":"0-1","index.codec":"best_compression","index.hidden":"true","index.number_of_shards":"1"}, aliases=[], blocks=[], waitForActiveShards=ALL, systemDataStreamDescriptor=null, matchingTemplate=null}] against cluster state version [54]
[2024-08-08T15:27:14,786][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] applying create index request for system index [.async-search] matching pattern [.async-search*]
[2024-08-08T15:27:14,790][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] [.async-search] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]

What is happening is that the snippet starts a async query, but does not wait for it (the test actually replaces the wait_for_completion_timeout with 0). The test suite proceeds to cleanup, calling SystemIndexPlugin#cleanUpFeature; this would cleanup .async-search (and it does, in some rare cases, when the query completes faster than we reach cleanUpFeature).

For example, these are the logs from a successful execution:

[2024-08-08T16:48:29,600][INFO ][o.e.a.a.i.c.AutoCreateAction] [node-0] AutoCreateAction#execute .async-search
[2024-08-08T16:48:29,600][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] executing IndexCreationTask for [CreateIndexClusterStateUpdateRequest{cause='auto(bulk api)', index='.async-search', dataStreamName='null', providedName='.async-search', recoverFrom=null, resizeType=null, copySettings=false, settings={"index.auto_expand_replicas":"0-1","index.codec":"best_compression","index.hidden":"true","index.number_of_shards":"1"}, aliases=[], blocks=[], waitForActiveShards=ALL, systemDataStreamDescriptor=null, matchingTemplate=null}] against cluster state version [52]
[2024-08-08T16:48:29,600][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] applying create index request for system index [.async-search] matching pattern [.async-search*]
[2024-08-08T16:48:29,604][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] [.async-search] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]
[2024-08-08T16:48:29,605][INFO ][o.e.c.r.a.AllocationService] [node-0] updating number_of_replicas to [0] for indices [.async-search]
[2024-08-08T16:48:31,581][INFO ][o.e.x.m.MachineLearning  ] [node-0] Starting machine learning feature reset
[2024-08-08T16:48:31,584][INFO ][o.e.x.a.AsyncResultsIndexPlugin] [node-0] cleanUpFeature started
[2024-08-08T16:48:31,586][INFO ][o.e.x.a.AsyncResultsIndexPlugin] [node-0] cleanUpFeature done
[2024-08-08T16:48:31,612][INFO ][o.e.c.m.MetadataDeleteIndexService] [node-0] [.async-search/922JLz6jRga6h4Xa2euk4w] deleting index
[2024-08-08T16:48:31,732][INFO ][o.e.x.m.MachineLearning  ] [node-0] Finished machine learning feature reset

Notice how .async-search is creater before AsyncResultsIndexPlugin#cleanUpFeature is called here.

This means that this has little to do with how system indices are being cleaned up; it's more a test issue -- these tests launch a query, then forget about it, creating a race between the system index creation (when the query eventually completes) and cleanup. Overriding AsyncResultsIndexPlugin#cleanUpFeature does no good either -- the test fails when creation happens after cleanUpFeature and before wipeAllIndices (which generates the deprecation warning, as it will find a system index to delete).

ldematte commented 3 months ago

I see 4 ways of fixing this:

  1. what joe suggested: https://github.com/elastic/elasticsearch/issues/106158#issuecomment-1991738483 do not delete .async-search. Not my preferred option TBH
  2. in all async search tests, know that a deprecation warning can be generated and add TEST[warning:] (see docs). Not sure about this either, the deprecation warning could eventually become an error.
  3. Wait for tasks completion? Here https://github.com/elastic/elasticsearch/blob/9496fa37469d543d73e558bdcf0d710d63d5833b/docs/src/yamlRestTest/java/org/elasticsearch/smoketest/DocsClientYamlTestSuiteIT.java#L144 we already wait for ML tests.
  4. Skip execution on all these tests (transform mute to // TEST[skip:])
ldematte commented 3 months ago

In any case, I think this is more an issue with how the test is written, not a system index problem, so I'm going to untag Core/Infra. But let me know if you think this is not right and still need our input.

elasticsearchmachine commented 4 days ago

This issue has been closed because it has been open for too long with no activity.

Any muted tests that were associated with this issue have been unmuted.

If the tests begin failing again, a new issue will be opened, and they may be muted again.

nik9000 commented 3 days ago

Wait for tasks completion? Here

I'd wait for all async searches/ESQL queries to complete at the end of the test. That's the kind of thing we used to do with these tests. To try to make them as "clean" between tests as possible without stopping and restarting.