elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.53k stars 24.61k forks source link

[CI] InferenceIngestIT testPipelineIngestWithModelAliases failing #105955

Open fcofdez opened 6 months ago

fcofdez commented 6 months ago

The test failed in the single-processor-node-tests pipeline

Build scan: https://gradle-enterprise.elastic.co/s/g2ys6irgpclr4/tests/:x-pack:plugin:ml:qa:native-multi-node-tests:javaRestTest/org.elasticsearch.xpack.ml.integration.InferenceIngestIT/testPipelineIngestWithModelAliases

Reproduction line:

./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:javaRestTest' --tests "org.elasticsearch.xpack.ml.integration.InferenceIngestIT.testPipelineIngestWithModelAliases" -Dtests.seed=F41853BAD2E190FF -Dtests.configure_test_clusters_with_one_processor=true -Dtests.locale=et -Dtests.timezone=America/Asuncion -Druntime.java=21

Applicable branches: main

Reproduces locally?: Didn't try

Failure history: Failure dashboard for org.elasticsearch.xpack.ml.integration.InferenceIngestIT#testPipelineIngestWithModelAliases&_a=(controlGroupInput:(chainingSystem:HIERARCHICAL,controlStyle:twoLine,ignoreParentSettings:(ignoreFilters:!f,ignoreQuery:!f,ignoreTimerange:!f,ignoreValidations:!t),panels:('0c0c9cb8-ccd2-45c6-9b13-96bac4abc542':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:task.keyword,grow:!t,id:'0c0c9cb8-ccd2-45c6-9b13-96bac4abc542',searchTechnique:wildcard,selectedOptions:!(),singleSelect:!t,title:'Gradle%20Task',width:medium),grow:!t,order:0,type:optionsListControl,width:small),'144933da-5c1b-4257-a969-7f43455a7901':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:name.keyword,grow:!t,id:'144933da-5c1b-4257-a969-7f43455a7901',searchTechnique:wildcard,selectedOptions:!('testPipelineIngestWithModelAliases'),title:Test,width:medium),grow:!t,order:2,type:optionsListControl,width:medium),'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:className.keyword,grow:!t,id:'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850',searchTechnique:wildcard,selectedOptions:!('org.elasticsearch.xpack.ml.integration.InferenceIngestIT'),title:Suite,width:medium),grow:!t,order:1,type:optionsListControl,width:medium)))))

Failure excerpt:

org.elasticsearch.client.ResponseException: method [GET], host [http://[::1]:44003], URI [/_ml/trained_models/_stats?size=10000], status line [HTTP/1.1 500 Internal Server Error]
{"error":{"root_cause":[],"type":"exception","reason":"Searching for stats for models [lang_ident_model_1,test_regression_2,test_regression_1] failed","caused_by":{"type":"search_phase_execution_exception","reason":"","phase":"query","grouped":true,"failed_shards":[],"caused_by":{"type":"search_phase_execution_exception","reason":"Search rejected due to missing shards [[.ml-stats-000001][0]]. Consider using `allow_partial_search_results` setting to bypass this error.","phase":"query","grouped":true,"failed_shards":[]}}},"status":500}

  at __randomizedtesting.SeedInfo.seed([F41853BAD2E190FF:609C4D14DE135688]:0)
  at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:351)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:317)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:292)
  at org.elasticsearch.xpack.core.ml.integration.MlRestTestStateCleaner.deleteAllTrainedModelIngestPipelines(MlRestTestStateCleaner.java:42)
  at org.elasticsearch.xpack.core.ml.integration.MlRestTestStateCleaner.resetFeatures(MlRestTestStateCleaner.java:33)
  at org.elasticsearch.xpack.ml.integration.InferenceIngestIT.cleanUpData(InferenceIngestIT.java:78)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:1004)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)
elasticsearchmachine commented 6 months ago

Pinging @elastic/ml-core (Team:ML)

droberts195 commented 6 months ago

This is failing during post-test cleanup. The call that's failing is: https://github.com/elastic/elasticsearch/blob/7be2df922c5d3c0c1eca5da170d1352f7f6d7326/x-pack/plugin/core/src/test/java/org/elasticsearch/xpack/core/ml/integration/MlRestTestStateCleaner.java#L40

Probably what we should do is have our stats endpoints wait at least a couple of seconds for yellow status on the stats indices they want to search. If those indices are still unavailable after the short wait then these endpoints can fail like they do now. But waiting a small amount of time might help avoid these occasional failures.

Marking as low-risk as the code involved here hasn't changed for years.

alex-spies commented 4 months ago

I have more failures that look just like this:

alex-spies commented 4 months ago

Two more:

alex-spies commented 4 months ago

Muting as these two started failing repeatedly: https://github.com/elastic/elasticsearch/pull/107771