elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
68.8k stars 24.45k forks source link

[CI] DatafeedJobsRestIT multiple tests failing due to unfinished tasks #105239

Open alex-spies opened 5 months ago

alex-spies commented 5 months ago

All of these tests fail with an assertion error thrown in clearMlState (DatafeedJobsRestIT, line 1705):

In all cases, there are 2 pending tasks

    java.lang.AssertionError: 2 active tasks found:
    cluster:admin/xpack/ml/job/close    vdFO27UiR8-yA205mgAQaQ:25140 -                            transport  1707306145310 11:42:25 1.7m        127.0.0.1 javaRestTest-1 
    cluster:admin/xpack/ml/job/close[n] vdFO27UiR8-yA205mgAQaQ:25144 vdFO27UiR8-yA205mgAQaQ:25140 transport  1707306145323 11:42:25 1.7m        127.0.0.1 javaRestTest-1 

The tasks seem to be exactly the same in all the tests; they could be from a preceding test.

Build scan: https://gradle-enterprise.elastic.co/s/iyd5l5a5elixk/tests/:x-pack:plugin:ml:qa:native-multi-node-tests:javaRestTest/org.elasticsearch.xpack.ml.integration.DatafeedJobsRestIT/testLookbackOnlyWithKeywordMultiField

Reproduction line:

./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:javaRestTest' --tests "org.elasticsearch.xpack.ml.integration.DatafeedJobsRestIT.testLookbackOnlyWithKeywordMultiField" -Dtests.seed=8BE1A8358268DF09 -Dtests.locale=ar-SA -Dtests.timezone=Etc/GMT+10 -Druntime.java=21

Applicable branches: main

Reproduces locally?: No

Failure history: Failure dashboard for org.elasticsearch.xpack.ml.integration.DatafeedJobsRestIT#testLookbackOnlyWithKeywordMultiField&_a=(controlGroupInput:(chainingSystem:HIERARCHICAL,controlStyle:twoLine,ignoreParentSettings:(ignoreFilters:!f,ignoreQuery:!f,ignoreTimerange:!f,ignoreValidations:!t),panels:('0c0c9cb8-ccd2-45c6-9b13-96bac4abc542':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:task.keyword,grow:!t,id:'0c0c9cb8-ccd2-45c6-9b13-96bac4abc542',searchTechnique:wildcard,selectedOptions:!(),singleSelect:!t,title:'Gradle%20Task',width:medium),grow:!t,order:0,type:optionsListControl,width:small),'144933da-5c1b-4257-a969-7f43455a7901':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:name.keyword,grow:!t,id:'144933da-5c1b-4257-a969-7f43455a7901',searchTechnique:wildcard,selectedOptions:!('testLookbackOnlyWithKeywordMultiField'),title:Test,width:medium),grow:!t,order:2,type:optionsListControl,width:medium),'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:className.keyword,grow:!t,id:'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850',searchTechnique:wildcard,selectedOptions:!('org.elasticsearch.xpack.ml.integration.DatafeedJobsRestIT'),title:Suite,width:medium),grow:!t,order:1,type:optionsListControl,width:medium)))))

Failure excerpt:

java.lang.AssertionError: 2 active tasks found:
cluster:admin/xpack/ml/job/close    vdFO27UiR8-yA205mgAQaQ:25140 -                            transport  1707306145310 11:42:25 1.7m        127.0.0.1 javaRestTest-1 
cluster:admin/xpack/ml/job/close[n] vdFO27UiR8-yA205mgAQaQ:25144 vdFO27UiR8-yA205mgAQaQ:25140 transport  1707306145323 11:42:25 1.7m        127.0.0.1 javaRestTest-1 
 expected:<0> but was:<2>

  at __randomizedtesting.SeedInfo.seed([8BE1A8358268DF09:A3EF966D97D472]:0)
  at org.junit.Assert.fail(Assert.java:89)
  at org.junit.Assert.failNotEquals(Assert.java:835)
  at org.junit.Assert.assertEquals(Assert.java:647)
  at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForPendingTasks$5(ESRestTestCase.java:625)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1255)
  at org.elasticsearch.test.rest.ESRestTestCase.waitForPendingTasks(ESRestTestCase.java:596)
  at org.elasticsearch.xpack.ml.integration.DatafeedJobsRestIT.clearMlState(DatafeedJobsRestIT.java:1705)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:1004)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)
elasticsearchmachine commented 5 months ago

Pinging @elastic/ml-core (Team:ML)

danielmitterdorfer commented 5 months ago

Another instance of this test failure is here: https://gradle-enterprise.elastic.co/s/cohmjlmusxiwk

danielmitterdorfer commented 5 months ago

I've muted the entire suite in https://github.com/elastic/elasticsearch/pull/105387.

droberts195 commented 5 months ago

Another instance of this test failure is here: https://gradle-enterprise.elastic.co/s/cohmjlmusxiwk

For this particular failure, the first failed test is testInsufficientSearchPrivilegesOnPutWithJob. It's the tasks left over from this that cause all the subsequent failures.

The tasks that are left behind are "Close Job" tasks from the feature reset at the end of the test.

The logs from that test are:

[2024-02-12T06:05:18,268][INFO ][o.e.x.m.i.DatafeedJobsRestIT] [testInsufficientSearchPrivilegesOnPutWithJob] before test   
[2024-02-12T06:05:18,271][INFO ][o.e.x.m.i.DatafeedJobsRestIT] [testInsufficientSearchPrivilegesOnPutWithJob] initializing REST clients against [http://[::1]:33405, http://127.0.0.1:43365/, http://[::1]:44817, http://127.0.0.1:35533/, http://[::1]:33307, http://127.0.0.1:34623/] 
[2024-02-12T06:05:18,326][WARN ][o.e.x.m.i.DatafeedJobsRestIT] [testInsufficientSearchPrivilegesOnPutWithJob] This test is running on the legacy test framework; historical features from production code will not be available. You need to port the test to the new test plugins in order to use historical features from production code. If this is a legacy feature used only in tests, you can add it to a test-only FeatureSpecification such as org.elasticsearch.test.rest.RestTestLegacyFeatures. 
[2024-02-12T06:05:53,485][INFO ][o.e.x.m.i.DatafeedJobsRestIT] [testInsufficientSearchPrivilegesOnPutWithJob] There are still tasks running after this test that might break subsequent tests [cluster:admin/xpack/ml/job/close, cluster:admin/xpack/ml/job/close[n], health-node[c]].  
[2024-02-12T06:05:53,486][INFO ][o.e.x.m.i.DatafeedJobsRestIT] [testInsufficientSearchPrivilegesOnPutWithJob] after test

The corresponding server-side logs are:

[2024-02-12T10:05:20,144][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-1] Starting machine learning feature reset
[2024-02-12T10:05:20,163][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-1] [.security-7/wN4X6LmhQ5aRUiTplHSv6Q] deleting index
[2024-02-12T10:05:20,236][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-1] [.ml-annotations-000001/TZRK7zHRTh29VCkufU6rkw] deleting index
[2024-02-12T10:05:20,236][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-1] [.ml-notifications-000002/tlCYHYjtRbuULe3IyJHeLw] deleting index
[2024-02-12T10:05:20,236][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-1] [.ml-anomalies-shared/zEHUsV2VQPigXP1Bg6ePdA] deleting index
[2024-02-12T10:05:20,268][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-1] [.ml-config/jkJoq62lQB-igtiqWsGzIQ] deleting index
[2024-02-12T10:05:20,306][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-1] Finished machine learning feature reset
[2024-02-12T10:05:53,292][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-1] Starting machine learning feature reset
[2024-02-12T10:05:53,371][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-1] Finished machine learning feature reset

It's interesting that there are actually two feature resets there.

Of the 3 nodes in the native multi-node tests, the node with the outstanding tasks is iov4A-EPQ2-2Qwgemj_22w, which is javaRestTest-2. The coordinating node for the feature resets in testInsufficientSearchPrivilegesOnPutWithJob is javaRestTest-1.

One thing that's strange about testInsufficientSearchPrivilegesOnPutWithJob having a "Close Job" call hang is that the whole point of the test is that it doesn't successfully create a job. So there should be nothing to close. The likely explanation is that the job that hangs "Close Job" actually comes from the previous test and somehow wasn't cleaned up correctly after that test.

The job that's likely to blame is job-for-start-datafeed-timeout which comes from DatafeedJobsIT.testStartDatafeed_GivenTimeout_Returns408. That is a test from a different suite to the one that failed here. DatafeedJobsIT is using the transport Java classes and DatafeedJobsRestIT is using the REST API. The differing cleanup was the subject of https://github.com/elastic/elasticsearch/issues/49582, and may be part of the problem here.

Another thing I've just noticed is that the suite that was muted was DatafeedJobsIT, and not DatafeedJobsRestIT which had the failures. Given that the suspicion is that a test from DatafeedJobsIT is the root cause of the failures here that might actually be fortuitous. It will be interesting to see if muting DatafeedJobsIT stops further failures in DatafeedJobsRestIT.

maxhniebergall commented 1 week ago

We haven't seen any more failures in DatafeedJobsRestIT since DatafeedJobsIT was muted, indicating that Dave R's theory is probably correct. I'm going to unmute DatafeedJobsIT and just mute the single test Dave R mentions above. Then we will be able to narrow down the problem.

maxhniebergall commented 6 days ago

related failures with mutes https://github.com/elastic/elasticsearch/issues/111319 https://github.com/elastic/elasticsearch/issues/111383 https://github.com/elastic/elasticsearch/issues/111384