Open alex-spies opened 5 months ago
Pinging @elastic/ml-core (Team:ML)
Another instance of this test failure is here: https://gradle-enterprise.elastic.co/s/cohmjlmusxiwk
I've muted the entire suite in https://github.com/elastic/elasticsearch/pull/105387.
Another instance of this test failure is here: https://gradle-enterprise.elastic.co/s/cohmjlmusxiwk
For this particular failure, the first failed test is testInsufficientSearchPrivilegesOnPutWithJob
. It's the tasks left over from this that cause all the subsequent failures.
The tasks that are left behind are "Close Job" tasks from the feature reset at the end of the test.
The logs from that test are:
[2024-02-12T06:05:18,268][INFO ][o.e.x.m.i.DatafeedJobsRestIT] [testInsufficientSearchPrivilegesOnPutWithJob] before test
[2024-02-12T06:05:18,271][INFO ][o.e.x.m.i.DatafeedJobsRestIT] [testInsufficientSearchPrivilegesOnPutWithJob] initializing REST clients against [http://[::1]:33405, http://127.0.0.1:43365/, http://[::1]:44817, http://127.0.0.1:35533/, http://[::1]:33307, http://127.0.0.1:34623/]
[2024-02-12T06:05:18,326][WARN ][o.e.x.m.i.DatafeedJobsRestIT] [testInsufficientSearchPrivilegesOnPutWithJob] This test is running on the legacy test framework; historical features from production code will not be available. You need to port the test to the new test plugins in order to use historical features from production code. If this is a legacy feature used only in tests, you can add it to a test-only FeatureSpecification such as org.elasticsearch.test.rest.RestTestLegacyFeatures.
[2024-02-12T06:05:53,485][INFO ][o.e.x.m.i.DatafeedJobsRestIT] [testInsufficientSearchPrivilegesOnPutWithJob] There are still tasks running after this test that might break subsequent tests [cluster:admin/xpack/ml/job/close, cluster:admin/xpack/ml/job/close[n], health-node[c]].
[2024-02-12T06:05:53,486][INFO ][o.e.x.m.i.DatafeedJobsRestIT] [testInsufficientSearchPrivilegesOnPutWithJob] after test
The corresponding server-side logs are:
[2024-02-12T10:05:20,144][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-1] Starting machine learning feature reset
[2024-02-12T10:05:20,163][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-1] [.security-7/wN4X6LmhQ5aRUiTplHSv6Q] deleting index
[2024-02-12T10:05:20,236][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-1] [.ml-annotations-000001/TZRK7zHRTh29VCkufU6rkw] deleting index
[2024-02-12T10:05:20,236][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-1] [.ml-notifications-000002/tlCYHYjtRbuULe3IyJHeLw] deleting index
[2024-02-12T10:05:20,236][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-1] [.ml-anomalies-shared/zEHUsV2VQPigXP1Bg6ePdA] deleting index
[2024-02-12T10:05:20,268][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-1] [.ml-config/jkJoq62lQB-igtiqWsGzIQ] deleting index
[2024-02-12T10:05:20,306][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-1] Finished machine learning feature reset
[2024-02-12T10:05:53,292][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-1] Starting machine learning feature reset
[2024-02-12T10:05:53,371][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-1] Finished machine learning feature reset
It's interesting that there are actually two feature resets there.
Of the 3 nodes in the native multi-node tests, the node with the outstanding tasks is iov4A-EPQ2-2Qwgemj_22w
, which is javaRestTest-2
. The coordinating node for the feature resets in testInsufficientSearchPrivilegesOnPutWithJob
is javaRestTest-1
.
One thing that's strange about testInsufficientSearchPrivilegesOnPutWithJob
having a "Close Job" call hang is that the whole point of the test is that it doesn't successfully create a job. So there should be nothing to close. The likely explanation is that the job that hangs "Close Job" actually comes from the previous test and somehow wasn't cleaned up correctly after that test.
The job that's likely to blame is job-for-start-datafeed-timeout
which comes from DatafeedJobsIT.testStartDatafeed_GivenTimeout_Returns408
. That is a test from a different suite to the one that failed here. DatafeedJobsIT
is using the transport Java classes and DatafeedJobsRestIT
is using the REST API. The differing cleanup was the subject of https://github.com/elastic/elasticsearch/issues/49582, and may be part of the problem here.
Another thing I've just noticed is that the suite that was muted was DatafeedJobsIT
, and not DatafeedJobsRestIT
which had the failures. Given that the suspicion is that a test from DatafeedJobsIT
is the root cause of the failures here that might actually be fortuitous. It will be interesting to see if muting DatafeedJobsIT
stops further failures in DatafeedJobsRestIT
.
We haven't seen any more failures in DatafeedJobsRestIT
since DatafeedJobsIT
was muted, indicating that Dave R's theory is probably correct. I'm going to unmute DatafeedJobsIT
and just mute the single test Dave R mentions above. Then we will be able to narrow down the problem.
All of these tests fail with an assertion error thrown in
clearMlState
(DatafeedJobsRestIT
, line 1705):In all cases, there are 2 pending tasks
The tasks seem to be exactly the same in all the tests; they could be from a preceding test.
Build scan: https://gradle-enterprise.elastic.co/s/iyd5l5a5elixk/tests/:x-pack:plugin:ml:qa:native-multi-node-tests:javaRestTest/org.elasticsearch.xpack.ml.integration.DatafeedJobsRestIT/testLookbackOnlyWithKeywordMultiField
Reproduction line:
Applicable branches: main
Reproduces locally?: No
Failure history: Failure dashboard for
org.elasticsearch.xpack.ml.integration.DatafeedJobsRestIT#testLookbackOnlyWithKeywordMultiField
&_a=(controlGroupInput:(chainingSystem:HIERARCHICAL,controlStyle:twoLine,ignoreParentSettings:(ignoreFilters:!f,ignoreQuery:!f,ignoreTimerange:!f,ignoreValidations:!t),panels:('0c0c9cb8-ccd2-45c6-9b13-96bac4abc542':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:task.keyword,grow:!t,id:'0c0c9cb8-ccd2-45c6-9b13-96bac4abc542',searchTechnique:wildcard,selectedOptions:!(),singleSelect:!t,title:'Gradle%20Task',width:medium),grow:!t,order:0,type:optionsListControl,width:small),'144933da-5c1b-4257-a969-7f43455a7901':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:name.keyword,grow:!t,id:'144933da-5c1b-4257-a969-7f43455a7901',searchTechnique:wildcard,selectedOptions:!('testLookbackOnlyWithKeywordMultiField'),title:Test,width:medium),grow:!t,order:2,type:optionsListControl,width:medium),'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:className.keyword,grow:!t,id:'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850',searchTechnique:wildcard,selectedOptions:!('org.elasticsearch.xpack.ml.integration.DatafeedJobsRestIT'),title:Suite,width:medium),grow:!t,order:1,type:optionsListControl,width:medium)))))Failure excerpt: