elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.15k stars 24.84k forks source link

[CI] InferenceIT testInferClassificationModel failing #108778

Closed cbuescher closed 1 week ago

cbuescher commented 5 months ago

Build scan: https://gradle-enterprise.elastic.co/s/5ornffogc35je/tests/:x-pack:plugin:ml:qa:native-multi-node-tests:javaRestTest/org.elasticsearch.xpack.ml.integration.InferenceIT/testInferClassificationModel

Reproduction line:

./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:javaRestTest' --tests "org.elasticsearch.xpack.ml.integration.InferenceIT.testInferClassificationModel" -Dtests.seed=5FC4DC30BB6AF63 -Dtests.locale=fi -Dtests.timezone=Atlantic/Canary -Druntime.java=22

Applicable branches: main

Reproduces locally?: No

Failure history: Failure dashboard for org.elasticsearch.xpack.ml.integration.InferenceIT#testInferClassificationModel&_a=(controlGroupInput:(chainingSystem:HIERARCHICAL,controlStyle:twoLine,ignoreParentSettings:(ignoreFilters:!f,ignoreQuery:!f,ignoreTimerange:!f,ignoreValidations:!t),panels:('0c0c9cb8-ccd2-45c6-9b13-96bac4abc542':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:task.keyword,grow:!t,id:'0c0c9cb8-ccd2-45c6-9b13-96bac4abc542',searchTechnique:wildcard,selectedOptions:!(),singleSelect:!t,title:'Gradle%20Task',width:medium),grow:!t,order:0,type:optionsListControl,width:small),'144933da-5c1b-4257-a969-7f43455a7901':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:name.keyword,grow:!t,id:'144933da-5c1b-4257-a969-7f43455a7901',searchTechnique:wildcard,selectedOptions:!('testInferClassificationModel'),title:Test,width:medium),grow:!t,order:2,type:optionsListControl,width:medium),'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:className.keyword,grow:!t,id:'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850',searchTechnique:wildcard,selectedOptions:!('org.elasticsearch.xpack.ml.integration.InferenceIT'),title:Suite,width:medium),grow:!t,order:1,type:optionsListControl,width:medium)))))

Failure excerpt:

org.elasticsearch.client.ResponseException: method [PUT], host [http://[::1]:36353], URI [_ml/trained_models/test_classification], status line [HTTP/1.1 500 Internal Server Error]
{"error":{"root_cause":[{"type":"node_not_connected_exception","reason":"[external_15][127.0.0.1:17711] Node not connected"}],"type":"status_exception","reason":"Unable to retrieve task information for model id [test_classification]","caused_by":{"type":"failed_node_exception","reason":"Failed node [Ko8fkv__T_-pW2knAAAAAA]","node_id":"Ko8fkv__T_-pW2knAAAAAA","caused_by":{"type":"node_not_connected_exception","reason":"[external_15][127.0.0.1:17711] Node not connected"}}},"status":500}

  at __randomizedtesting.SeedInfo.seed([5FC4DC30BB6AF63:9BAA81402FF58B6A]:0)
  at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:351)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:317)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:292)
  at org.elasticsearch.xpack.ml.integration.InferenceIT.putModel(InferenceIT.java:326)
  at org.elasticsearch.xpack.ml.integration.InferenceIT.testInferClassificationModel(InferenceIT.java:93)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1570)
elasticsearchmachine commented 5 months ago

Pinging @elastic/ml-core (Team:ML)

davidkyle commented 5 months ago

In a 3 node cluster 2 of the nodes can't connect to the 3rd. Checking the logs the disconnected node appears to have gone in feature reset loop.

I can't see any obvious reason for the reset loop but one concern is that reset code is tying up a network thread, hence the node fell out of the cluster.

I've assigned medium risk based on the facts that this code is long established, has not been modified recently and this is the first failure of this kind I have found.

[2024-05-17T13:15:02,544][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:02,622][DEBUG][o.e.x.m.a.TransportCancelJobModelSnapshotUpgradeAction] [javaRestTest-0] [_all] cancel model snapshot [_all] upgrades
[2024-05-17T13:15:02,623][DEBUG][o.e.x.m.a.TransportKillProcessAction] [javaRestTest-0] [*] Cannot kill the process because job(s) are not open
[2024-05-17T13:15:02,625][DEBUG][o.e.x.m.a.TransportStopDataFrameAnalyticsAction] [javaRestTest-0] Received request to stop data frame analytics [_all]
[2024-05-17T13:15:02,625][DEBUG][o.e.x.m.a.TransportStopDataFrameAnalyticsAction] [javaRestTest-0] Resolved data frame analytics to stop: []
[2024-05-17T13:15:02,647][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-annotations-000001/CfE6_bXRTXG_PQVrSWjV6A] deleting index
[2024-05-17T13:15:02,647][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-notifications-000002/FQgd2g8cQI6mp2S7AuugEw] deleting index
[2024-05-17T13:15:02,647][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-inference-native-000002/cWhGVDKdQcmkkTuc5s6XDQ] deleting index
[2024-05-17T13:15:02,647][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-inference-000005/brlBcFhGTPOzrQN5jzoHgw] deleting index
[2024-05-17T13:15:02,693][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Finished machine learning feature reset
[2024-05-17T13:15:02,716][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:02,762][DEBUG][o.e.x.m.a.TransportCancelJobModelSnapshotUpgradeAction] [javaRestTest-0] [_all] cancel model snapshot [_all] upgrades
[2024-05-17T13:15:02,763][DEBUG][o.e.x.m.a.TransportKillProcessAction] [javaRestTest-0] [*] Cannot kill the process because job(s) are not open
[2024-05-17T13:15:02,764][DEBUG][o.e.x.m.a.TransportStopDataFrameAnalyticsAction] [javaRestTest-0] Received request to stop data frame analytics [_all]
[2024-05-17T13:15:02,765][DEBUG][o.e.x.m.a.TransportStopDataFrameAnalyticsAction] [javaRestTest-0] Resolved data frame analytics to stop: []
[2024-05-17T13:15:02,804][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Finished machine learning feature reset
[2024-05-17T13:15:02,806][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [text-expansion-pipeline-test-index/xhXxcR54TTeH_Cz-qy3jEA] deleting index
[2024-05-17T13:15:02,959][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-0] [.ml-inference-000005] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]
[2024-05-17T13:15:03,007][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-0] [.ml-annotations-000001] creating index, cause [api], templates [], shards [1]/[1]
[2024-05-17T13:15:03,231][INFO ][o.e.c.r.a.AllocationService] [javaRestTest-0] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.ml-annotations-000001][0]]])." previous.health="YELLOW" reason="shards started [[.ml-annotations-000001][0]]"
[2024-05-17T13:15:03,248][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:03,321][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-annotations-000001/B1nXfB8wTj26mKIZZy5bAg] deleting index
[2024-05-17T13:15:03,321][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-inference-000005/OrAILs_nSR6LFCB2I3VwPg] deleting index
[2024-05-17T13:15:03,383][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Finished machine learning feature reset
[2024-05-17T13:15:03,409][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:03,499][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Finished machine learning feature reset
[2024-05-17T13:15:03,528][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-0] [.ml-inference-000005] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]
[2024-05-17T13:15:03,573][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-0] [.ml-annotations-000001] creating index, cause [api], templates [], shards [1]/[1]
[2024-05-17T13:15:03,770][INFO ][o.e.c.r.a.AllocationService] [javaRestTest-0] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.ml-annotations-000001][0]]])." previous.health="YELLOW" reason="shards started [[.ml-annotations-000001][0]]"
[2024-05-17T13:15:03,770][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:03,866][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-annotations-000001/XqfInErASiu-OZOwRLxnKA] deleting index
[2024-05-17T13:15:03,866][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-inference-000005/Ar9ILsa3QwS_JBu-cJ1PXQ] deleting index
[2024-05-17T13:15:03,908][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Finished machine learning feature reset
[2024-05-17T13:15:03,932][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:04,019][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Finished machine learning feature reset
[2024-05-17T13:15:04,046][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-0] [.ml-inference-000005] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]
[2024-05-17T13:15:04,090][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-0] [.ml-annotations-000001] creating index, cause [api], templates [], shards [1]/[1]
[2024-05-17T13:15:04,235][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-0] [.ml-inference-native-000002] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]
[2024-05-17T13:15:04,417][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:04,525][INFO ][o.e.c.r.a.AllocationService] [javaRestTest-0] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.ml-inference-native-000002][0]]])." previous.health="YELLOW" reason="shards started [[.ml-inference-native-000002][0]]"
[2024-05-17T13:15:04,567][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-annotations-000001/zNhk6U9tSVKecPGrfApFIg] deleting index
[2024-05-17T13:15:04,567][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-inference-000005/6nF42COrSAS4Uw4EOv9mxQ] deleting index
[2024-05-17T13:15:04,567][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-inference-native-000002/h6wlOfd0Qy60JFPzBiT7GQ] deleting index
[2024-05-17T13:15:04,612][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Finished machine learning feature reset
[2024-05-17T13:15:04,636][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:04,732][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Finished machine learning feature reset
[2024-05-17T13:15:05,040][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:05,128][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Finished machine learning feature reset
[2024-05-17T13:15:05,151][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:05,223][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Finished machine learning feature reset
[2024-05-17T13:15:05,271][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-0] [.ml-inference-000005] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]
[2024-05-17T13:15:05,316][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-0] [.ml-annotations-000001] creating index, cause [api], templates [], shards [1]/[1]
[2024-05-17T13:15:05,510][INFO ][o.e.c.r.a.AllocationService] [javaRestTest-0] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.ml-inference-000005][0], [.ml-annotations-000001][0]]])." previous.health="YELLOW" reason="shards started [[.ml-inference-000005][0], [.ml-annotations-000001][0]]"
[2024-05-17T13:15:05,531][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:05,532][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-0] [.ml-notifications-000002] creating index, cause [auto(bulk api)], templates [.ml-notifications-000002], shards [1]/[1]
[2024-05-17T13:15:05,674][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-annotations-000001/mtyJd8vPRyC_1_kxdtcVQg] deleting index
[2024-05-17T13:15:05,674][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-notifications-000002/CEkMpqBLQI2rARXXrSVJwQ] deleting index
[2024-05-17T13:15:05,680][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-inference-000005/YQKHwEr-TCi82dFf_rIG4A] deleting index
[2024-05-17T13:15:05,751][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Finished machine learning feature reset
[2024-05-17T13:15:05,775][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:05,862][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-0] Finished machine learning feature reset
elasticsearchmachine commented 1 week ago

This issue has been closed because it has been open for too long with no activity.

Any muted tests that were associated with this issue have been unmuted.

If the tests begin failing again, a new issue will be opened, and they may be muted again.