Closed cbuescher closed 1 week ago
Pinging @elastic/ml-core (Team:ML)
In a 3 node cluster 2 of the nodes can't connect to the 3rd. Checking the logs the disconnected node appears to have gone in feature reset loop.
I can't see any obvious reason for the reset loop but one concern is that reset code is tying up a network thread, hence the node fell out of the cluster.
I've assigned medium risk based on the facts that this code is long established, has not been modified recently and this is the first failure of this kind I have found.
[2024-05-17T13:15:02,544][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:02,622][DEBUG][o.e.x.m.a.TransportCancelJobModelSnapshotUpgradeAction] [javaRestTest-0] [_all] cancel model snapshot [_all] upgrades
[2024-05-17T13:15:02,623][DEBUG][o.e.x.m.a.TransportKillProcessAction] [javaRestTest-0] [*] Cannot kill the process because job(s) are not open
[2024-05-17T13:15:02,625][DEBUG][o.e.x.m.a.TransportStopDataFrameAnalyticsAction] [javaRestTest-0] Received request to stop data frame analytics [_all]
[2024-05-17T13:15:02,625][DEBUG][o.e.x.m.a.TransportStopDataFrameAnalyticsAction] [javaRestTest-0] Resolved data frame analytics to stop: []
[2024-05-17T13:15:02,647][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-annotations-000001/CfE6_bXRTXG_PQVrSWjV6A] deleting index
[2024-05-17T13:15:02,647][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-notifications-000002/FQgd2g8cQI6mp2S7AuugEw] deleting index
[2024-05-17T13:15:02,647][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-inference-native-000002/cWhGVDKdQcmkkTuc5s6XDQ] deleting index
[2024-05-17T13:15:02,647][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-inference-000005/brlBcFhGTPOzrQN5jzoHgw] deleting index
[2024-05-17T13:15:02,693][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Finished machine learning feature reset
[2024-05-17T13:15:02,716][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:02,762][DEBUG][o.e.x.m.a.TransportCancelJobModelSnapshotUpgradeAction] [javaRestTest-0] [_all] cancel model snapshot [_all] upgrades
[2024-05-17T13:15:02,763][DEBUG][o.e.x.m.a.TransportKillProcessAction] [javaRestTest-0] [*] Cannot kill the process because job(s) are not open
[2024-05-17T13:15:02,764][DEBUG][o.e.x.m.a.TransportStopDataFrameAnalyticsAction] [javaRestTest-0] Received request to stop data frame analytics [_all]
[2024-05-17T13:15:02,765][DEBUG][o.e.x.m.a.TransportStopDataFrameAnalyticsAction] [javaRestTest-0] Resolved data frame analytics to stop: []
[2024-05-17T13:15:02,804][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Finished machine learning feature reset
[2024-05-17T13:15:02,806][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [text-expansion-pipeline-test-index/xhXxcR54TTeH_Cz-qy3jEA] deleting index
[2024-05-17T13:15:02,959][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-0] [.ml-inference-000005] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]
[2024-05-17T13:15:03,007][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-0] [.ml-annotations-000001] creating index, cause [api], templates [], shards [1]/[1]
[2024-05-17T13:15:03,231][INFO ][o.e.c.r.a.AllocationService] [javaRestTest-0] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.ml-annotations-000001][0]]])." previous.health="YELLOW" reason="shards started [[.ml-annotations-000001][0]]"
[2024-05-17T13:15:03,248][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:03,321][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-annotations-000001/B1nXfB8wTj26mKIZZy5bAg] deleting index
[2024-05-17T13:15:03,321][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-inference-000005/OrAILs_nSR6LFCB2I3VwPg] deleting index
[2024-05-17T13:15:03,383][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Finished machine learning feature reset
[2024-05-17T13:15:03,409][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:03,499][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Finished machine learning feature reset
[2024-05-17T13:15:03,528][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-0] [.ml-inference-000005] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]
[2024-05-17T13:15:03,573][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-0] [.ml-annotations-000001] creating index, cause [api], templates [], shards [1]/[1]
[2024-05-17T13:15:03,770][INFO ][o.e.c.r.a.AllocationService] [javaRestTest-0] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.ml-annotations-000001][0]]])." previous.health="YELLOW" reason="shards started [[.ml-annotations-000001][0]]"
[2024-05-17T13:15:03,770][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:03,866][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-annotations-000001/XqfInErASiu-OZOwRLxnKA] deleting index
[2024-05-17T13:15:03,866][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-inference-000005/Ar9ILsa3QwS_JBu-cJ1PXQ] deleting index
[2024-05-17T13:15:03,908][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Finished machine learning feature reset
[2024-05-17T13:15:03,932][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:04,019][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Finished machine learning feature reset
[2024-05-17T13:15:04,046][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-0] [.ml-inference-000005] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]
[2024-05-17T13:15:04,090][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-0] [.ml-annotations-000001] creating index, cause [api], templates [], shards [1]/[1]
[2024-05-17T13:15:04,235][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-0] [.ml-inference-native-000002] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]
[2024-05-17T13:15:04,417][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:04,525][INFO ][o.e.c.r.a.AllocationService] [javaRestTest-0] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.ml-inference-native-000002][0]]])." previous.health="YELLOW" reason="shards started [[.ml-inference-native-000002][0]]"
[2024-05-17T13:15:04,567][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-annotations-000001/zNhk6U9tSVKecPGrfApFIg] deleting index
[2024-05-17T13:15:04,567][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-inference-000005/6nF42COrSAS4Uw4EOv9mxQ] deleting index
[2024-05-17T13:15:04,567][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-inference-native-000002/h6wlOfd0Qy60JFPzBiT7GQ] deleting index
[2024-05-17T13:15:04,612][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Finished machine learning feature reset
[2024-05-17T13:15:04,636][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:04,732][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Finished machine learning feature reset
[2024-05-17T13:15:05,040][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:05,128][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Finished machine learning feature reset
[2024-05-17T13:15:05,151][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:05,223][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Finished machine learning feature reset
[2024-05-17T13:15:05,271][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-0] [.ml-inference-000005] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]
[2024-05-17T13:15:05,316][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-0] [.ml-annotations-000001] creating index, cause [api], templates [], shards [1]/[1]
[2024-05-17T13:15:05,510][INFO ][o.e.c.r.a.AllocationService] [javaRestTest-0] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.ml-inference-000005][0], [.ml-annotations-000001][0]]])." previous.health="YELLOW" reason="shards started [[.ml-inference-000005][0], [.ml-annotations-000001][0]]"
[2024-05-17T13:15:05,531][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:05,532][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-0] [.ml-notifications-000002] creating index, cause [auto(bulk api)], templates [.ml-notifications-000002], shards [1]/[1]
[2024-05-17T13:15:05,674][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-annotations-000001/mtyJd8vPRyC_1_kxdtcVQg] deleting index
[2024-05-17T13:15:05,674][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-notifications-000002/CEkMpqBLQI2rARXXrSVJwQ] deleting index
[2024-05-17T13:15:05,680][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.ml-inference-000005/YQKHwEr-TCi82dFf_rIG4A] deleting index
[2024-05-17T13:15:05,751][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Finished machine learning feature reset
[2024-05-17T13:15:05,775][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Starting machine learning feature reset
[2024-05-17T13:15:05,862][INFO ][o.e.x.m.MachineLearning ] [javaRestTest-0] Finished machine learning feature reset
This issue has been closed because it has been open for too long with no activity.
Any muted tests that were associated with this issue have been unmuted.
If the tests begin failing again, a new issue will be opened, and they may be muted again.
Build scan: https://gradle-enterprise.elastic.co/s/5ornffogc35je/tests/:x-pack:plugin:ml:qa:native-multi-node-tests:javaRestTest/org.elasticsearch.xpack.ml.integration.InferenceIT/testInferClassificationModel
Reproduction line:
Applicable branches: main
Reproduces locally?: No
Failure history: Failure dashboard for
org.elasticsearch.xpack.ml.integration.InferenceIT#testInferClassificationModel
&_a=(controlGroupInput:(chainingSystem:HIERARCHICAL,controlStyle:twoLine,ignoreParentSettings:(ignoreFilters:!f,ignoreQuery:!f,ignoreTimerange:!f,ignoreValidations:!t),panels:('0c0c9cb8-ccd2-45c6-9b13-96bac4abc542':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:task.keyword,grow:!t,id:'0c0c9cb8-ccd2-45c6-9b13-96bac4abc542',searchTechnique:wildcard,selectedOptions:!(),singleSelect:!t,title:'Gradle%20Task',width:medium),grow:!t,order:0,type:optionsListControl,width:small),'144933da-5c1b-4257-a969-7f43455a7901':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:name.keyword,grow:!t,id:'144933da-5c1b-4257-a969-7f43455a7901',searchTechnique:wildcard,selectedOptions:!('testInferClassificationModel'),title:Test,width:medium),grow:!t,order:2,type:optionsListControl,width:medium),'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:className.keyword,grow:!t,id:'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850',searchTechnique:wildcard,selectedOptions:!('org.elasticsearch.xpack.ml.integration.InferenceIT'),title:Suite,width:medium),grow:!t,order:1,type:optionsListControl,width:medium)))))Failure excerpt: