Closed jfreden closed 4 days ago
Pinging @elastic/es-analytical-engine (Team:Analytics)
I've assigned this one to myself, it's caused by https://github.com/elastic/elasticsearch/pull/106140.
I had unmuted these tests on https://github.com/elastic/elasticsearch/pull/106140, they were originally muted via https://github.com/elastic/elasticsearch/pull/75237. I have re-muted them via https://github.com/elastic/elasticsearch/pull/106196.
Since https://github.com/elastic/elasticsearch/pull/94191 has been merged, the async search feature is being reset when we clean up the cluster between rest tests. However, I think the .async-search
index is being re-created as a race condition because of the async nature of writing to that index (or something 🤷).
So the flow in the cleanup code ends up being:
.async-search
index will be deleted if it exists.async-search
index is createdwipeAllIndices(...)
and it complains that .async-search
is a system index/cc @williamrandolph since we talked about exactly this sort of possibility
The failures reproduce really easily, btw.
joegallo@simulacron:~/Code/elastic/elasticsearch $ git reset --hard upstream/main
HEAD is now at 0dd6ce2df6c Delete DownsampleClusterDisruptionIT (#106225)
joegallo@simulacron:~/Code/elastic/elasticsearch $ git log --oneline -1
0dd6ce2df6c (HEAD -> main, upstream/main, upstream/HEAD) Delete DownsampleClusterDisruptionIT (#106225)
joegallo@simulacron:~/Code/elastic/elasticsearch $ git revert --no-edit 8be3fb229d8c29d7a3931ffcb5e03e0bb0476c0f
[main 27c07af19e4] Revert "Mute failing snippet tests (#106196)"
Date: Tue Mar 12 09:52:24 2024 -0400
1 file changed, 6 deletions(-)
And then it's just a matter of:
./gradlew ':docs:yamlRestTest' --tests "org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT.test {yaml=reference/sql/endpoints/rest/*}"
That doesn't fail 100% of the time, but it's pretty reliable (it failed 2 out of the 4 times I tried this on my laptop just now).
I've unassigned myself from this issue because this seems to be a legitimate issue and none of it's actually my area.
I suppose something like the following diff would square this away, but that seems contrary to the point of https://github.com/elastic/elasticsearch/pull/94191:
diff --git a/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java b/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
index 307daddd17c..24f6424e71e 100644
--- a/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
+++ b/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
@@ -1109,7 +1109,7 @@ public abstract class ESRestTestCase extends ESTestCase {
try {
// remove all indices except some history indices which can pop up after deleting all data streams but shouldn't interfere
final List<String> indexPatterns = new ArrayList<>(
- List.of("*", "-.ds-ilm-history-*", "-.ds-.slm-history-*", "-.ds-.watcher-history-*")
+ List.of("*", "-.async-search", "-.ds-ilm-history-*", "-.ds-.slm-history-*", "-.ds-.watcher-history-*")
);
if (preserveSecurityIndices) {
indexPatterns.add("-.security-*");
Pinging @elastic/es-search (Team:Search)
Given the similarity to #104013, I'm labeling this one as low-risk
instead of blocker
(that is, labeled the same as that issue already is).
I think this one is similar to this:
REPRODUCE WITH: ./gradlew ':docs:yamlRestTest' --tests "org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT.test {yaml=reference/esql/esql-rest/line_297}" -Dtests.seed=53ABCE485C630DC4 -Dtests.locale=hu -Dtests.timezone=Asia/Irkutsk -Druntime.java=21
Pinging @elastic/es-core-infra (Team:Core/Infra)
Since this has to do with system indices and those being cleaned up, I am gonna tag core infra as well.
I don't know what the focus would be for search, since .async_search
is a system index, its management, etc. should be pretty transparent to us.
@williamrandolph would we have to add some logic on waiting for tasks or something in the async search cleanup?
It would be ideal if I could implement some cleanup logic by overriding SystemIndexPlugin#cleanUpFeature
in AsyncResultsIndexPlugin
. I'll take a look at doing that.
Pinging @elastic/es-search-foundations (Team:Search Foundations)
I dug into this a bit. What I found confirms @joegallo hypothesis:
[2024-08-08T15:27:14,717][INFO ][o.e.x.a.AsyncResultsIndexPlugin] [node-0] cleanUpFeature started
[2024-08-08T15:27:14,718][INFO ][o.e.x.a.AsyncResultsIndexPlugin] [node-0] cleanUpFeature done
[2024-08-08T15:27:14,786][INFO ][o.e.a.a.i.c.AutoCreateAction] [node-0] AutoCreateAction#execute .async-search
[2024-08-08T15:27:14,786][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] executing IndexCreationTask for [CreateIndexClusterStateUpdateRequest{cause='auto(bulk api)', index='.async-search', dataStreamName='null', providedName='.async-search', recoverFrom=null, resizeType=null, copySettings=false, settings={"index.auto_expand_replicas":"0-1","index.codec":"best_compression","index.hidden":"true","index.number_of_shards":"1"}, aliases=[], blocks=[], waitForActiveShards=ALL, systemDataStreamDescriptor=null, matchingTemplate=null}] against cluster state version [54]
[2024-08-08T15:27:14,786][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] applying create index request for system index [.async-search] matching pattern [.async-search*]
[2024-08-08T15:27:14,790][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] [.async-search] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]
What is happening is that the snippet starts a async query, but does not wait for it (the test actually replaces the wait_for_completion_timeout
with 0).
The test suite proceeds to cleanup, calling SystemIndexPlugin#cleanUpFeature
; this would cleanup .async-search
(and it does, in some rare cases, when the query completes faster than we reach cleanUpFeature
).
For example, these are the logs from a successful execution:
[2024-08-08T16:48:29,600][INFO ][o.e.a.a.i.c.AutoCreateAction] [node-0] AutoCreateAction#execute .async-search
[2024-08-08T16:48:29,600][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] executing IndexCreationTask for [CreateIndexClusterStateUpdateRequest{cause='auto(bulk api)', index='.async-search', dataStreamName='null', providedName='.async-search', recoverFrom=null, resizeType=null, copySettings=false, settings={"index.auto_expand_replicas":"0-1","index.codec":"best_compression","index.hidden":"true","index.number_of_shards":"1"}, aliases=[], blocks=[], waitForActiveShards=ALL, systemDataStreamDescriptor=null, matchingTemplate=null}] against cluster state version [52]
[2024-08-08T16:48:29,600][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] applying create index request for system index [.async-search] matching pattern [.async-search*]
[2024-08-08T16:48:29,604][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] [.async-search] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]
[2024-08-08T16:48:29,605][INFO ][o.e.c.r.a.AllocationService] [node-0] updating number_of_replicas to [0] for indices [.async-search]
[2024-08-08T16:48:31,581][INFO ][o.e.x.m.MachineLearning ] [node-0] Starting machine learning feature reset
[2024-08-08T16:48:31,584][INFO ][o.e.x.a.AsyncResultsIndexPlugin] [node-0] cleanUpFeature started
[2024-08-08T16:48:31,586][INFO ][o.e.x.a.AsyncResultsIndexPlugin] [node-0] cleanUpFeature done
[2024-08-08T16:48:31,612][INFO ][o.e.c.m.MetadataDeleteIndexService] [node-0] [.async-search/922JLz6jRga6h4Xa2euk4w] deleting index
[2024-08-08T16:48:31,732][INFO ][o.e.x.m.MachineLearning ] [node-0] Finished machine learning feature reset
Notice how .async-search
is creater before AsyncResultsIndexPlugin#cleanUpFeature
is called here.
This means that this has little to do with how system indices are being cleaned up; it's more a test issue -- these tests launch a query, then forget about it, creating a race between the system index creation (when the query eventually completes) and cleanup. Overriding AsyncResultsIndexPlugin#cleanUpFeature
does no good either -- the test fails when creation happens after cleanUpFeature
and before wipeAllIndices
(which generates the deprecation warning, as it will find a system index to delete).
I see 4 ways of fixing this:
.async-search
. Not my preferred option TBHTEST[warning:]
(see docs). Not sure about this either, the deprecation warning could eventually become an error.// TEST[skip:]
)In any case, I think this is more an issue with how the test is written, not a system index problem, so I'm going to untag Core/Infra. But let me know if you think this is not right and still need our input.
This issue has been closed because it has been open for too long with no activity.
Any muted tests that were associated with this issue have been unmuted.
If the tests begin failing again, a new issue will be opened, and they may be muted again.
Wait for tasks completion? Here
I'd wait for all async searches/ESQL queries to complete at the end of the test. That's the kind of thing we used to do with these tests. To try to make them as "clean" between tests as possible without stopping and restarting.
Build scan: https://gradle-enterprise.elastic.co/s/u3ln5vofvnbtw/tests/:docs:yamlRestTest/org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT/test%20%7Byaml=reference%2Fsql%2Fendpoints%2Frest%2Fline_574%7D
Reproduction line:
Applicable branches: main
Reproduces locally?: Yes
Failure history: Failure dashboard for
org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT#test {yaml=reference/sql/endpoints/rest/line_574}
&_a=(controlGroupInput:(chainingSystem:HIERARCHICAL,controlStyle:twoLine,ignoreParentSettings:(ignoreFilters:!f,ignoreQuery:!f,ignoreTimerange:!f,ignoreValidations:!t),panels:('0c0c9cb8-ccd2-45c6-9b13-96bac4abc542':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:task.keyword,grow:!t,id:'0c0c9cb8-ccd2-45c6-9b13-96bac4abc542',searchTechnique:wildcard,selectedOptions:!(),singleSelect:!t,title:'Gradle%20Task',width:medium),grow:!t,order:0,type:optionsListControl,width:small),'144933da-5c1b-4257-a969-7f43455a7901':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:name.keyword,grow:!t,id:'144933da-5c1b-4257-a969-7f43455a7901',searchTechnique:wildcard,selectedOptions:!('test%20%7Byaml%3Dreference/sql/endpoints/rest/line_574%7D'),title:Test,width:medium),grow:!t,order:2,type:optionsListControl,width:medium),'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:className.keyword,grow:!t,id:'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850',searchTechnique:wildcard,selectedOptions:!('org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT'),title:Suite,width:medium),grow:!t,order:1,type:optionsListControl,width:medium)))))Failure excerpt: