elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.59k stars 24.63k forks source link

[CI] XPackRestIT test {p0=ml/jobs_crud/Test reopen job resets the finished time} failing #86877

Closed droberts195 closed 2 years ago

droberts195 commented 2 years ago

The problem is not related to any particular test. The autodetect process couldn't run due to permission denied starting a thread. The same thing happened every time it was run in this test suite:

[2022-05-18T07:11:32,959][ERROR][o.e.x.m.p.AbstractNativeProcess] [yamlRestTest-0] [jobs-crud-reset-finished-time] autodetect/303389 process stopped unexpectedly: Cannot create thread: Permission denied
Error joining thread: No such process
Fatal error: 'terminate called after throwing an instance of 'std::system_error'', version: 8.2.1-SNAPSHOT (build f3adac2acbf65c)
Fatal error: '  what():  Permission denied', version: 8.2.1-SNAPSHOT (build f3adac2acbf65c)
Fatal error: 'si_signo 11, si_code: 128, si_errno: 0, address: 0x7fbe8b800898, library: /lib/x86_64-linux-gnu/libc.so.6, base: 0x7fbe8b7d8000, normalized address: 0x28898', version: 8.2.1-SNAPSHOT (build f3adac2acbf65c)

Build scan: https://gradle-enterprise.elastic.co/s/36sbiahfwjjqc/tests/:x-pack:plugin:yamlRestTest/org.elasticsearch.xpack.test.rest.XPackRestIT/test%20%7Bp0=ml%2Fjobs_crud%2FTest%20reopen%20job%20resets%20the%20finished%20time%7D

Reproduction line: ./gradlew ':x-pack:plugin:yamlRestTest' --tests "org.elasticsearch.xpack.test.rest.XPackRestIT.test {p0=ml/jobs_crud/Test reopen job resets the finished time}" -Dtests.seed=37660F157587F87B -Dtests.locale=da -Dtests.timezone=Asia/Thimphu -Druntime.java=18

Applicable branches: 8.2

Reproduces locally?: No

Failure history: https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.test.rest.XPackRestIT&tests.test=test%20%7Bp0%3Dml/jobs_crud/Test%20reopen%20job%20resets%20the%20finished%20time%7D

Failure excerpt:

java.lang.AssertionError: Failure at [ml/jobs_crud:1632]: expected [2xx] status code but api [ml.close_job] returned [409 Conflict] [{"error":{"root_cause":[{"type":"status_exception","reason":"cannot close job [jobs-crud-reset-finished-time] because it failed, use force close","stack_trace":"org.elasticsearch.ElasticsearchStatusException: cannot close job [jobs-crud-reset-finished-time] because it failed, use force close\n\tat org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.conflictStatusException(ExceptionsHelper.java:81)\n\tat org.elasticsearch.xpack.ml.action.TransportCloseJobAction.validate(TransportCloseJobAction.java:266)\n\tat org.elasticsearch.xpack.ml.action.TransportCloseJobAction.lambda$doExecute$6(TransportCloseJobAction.java:157)\n\tat org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162)\n\tat org.elasticsearch.xpack.ml.job.persistence.JobConfigProvider.lambda$expandJobsIds$7(JobConfigProvider.java:523)\n\tat org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)\n\tat org.elasticsearch.client.internal.node.NodeClient$ActionResponseTaskListener.onResponse(NodeClient.java:175)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:176)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:170)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)\n\tat org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$applyInternal$2(SecurityActionFilter.java:165)\n\tat org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:245)\n\tat org.elasticsearch.action.ActionListener$RunAfterActionListener.onResponse(ActionListener.java:367)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:724)\n\tat org.elasticsearch.action.search.FetchLookupFieldsPhase.run(FetchLookupFieldsPhase.java:75)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:471)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:465)\n\tat org.elasticsearch.action.search.ExpandSearchPhase.onPhaseDone(ExpandSearchPhase.java:151)\n\tat org.elasticsearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:105)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:471)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:465)\n\tat org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:275)\n\tat org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$2(FetchSearchPhase.java:109)\n\tat org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:118)\n\tat org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:93)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"}],"type":"status_exception","reason":"cannot close job [jobs-crud-reset-finished-time] because it failed, use force close","stack_trace":"org.elasticsearch.ElasticsearchStatusException: cannot close job [jobs-crud-reset-finished-time] because it failed, use force close\n\tat org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.conflictStatusException(ExceptionsHelper.java:81)\n\tat org.elasticsearch.xpack.ml.action.TransportCloseJobAction.validate(TransportCloseJobAction.java:266)\n\tat org.elasticsearch.xpack.ml.action.TransportCloseJobAction.lambda$doExecute$6(TransportCloseJobAction.java:157)\n\tat org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162)\n\tat org.elasticsearch.xpack.ml.job.persistence.JobConfigProvider.lambda$expandJobsIds$7(JobConfigProvider.java:523)\n\tat org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)\n\tat org.elasticsearch.client.internal.node.NodeClient$ActionResponseTaskListener.onResponse(NodeClient.java:175)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:176)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:170)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)\n\tat org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$applyInternal$2(SecurityActionFilter.java:165)\n\tat org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:245)\n\tat org.elasticsearch.action.ActionListener$RunAfterActionListener.onResponse(ActionListener.java:367)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:724)\n\tat org.elasticsearch.action.search.FetchLookupFieldsPhase.run(FetchLookupFieldsPhase.java:75)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:471)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:465)\n\tat org.elasticsearch.action.search.ExpandSearchPhase.onPhaseDone(ExpandSearchPhase.java:151)\n\tat org.elasticsearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:105)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:471)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:465)\n\tat org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:275)\n\tat org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$2(FetchSearchPhase.java:109)\n\tat org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:118)\n\tat org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:93)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"},"status":409}]

  at __randomizedtesting.SeedInfo.seed([37660F157587F87B:BF3230CFDB7B9583]:0)
  at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:503)
  at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:472)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
  at java.lang.reflect.Method.invoke(Method.java:577)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
  at java.lang.Thread.run(Thread.java:833)

  Caused by: java.lang.AssertionError: expected [2xx] status code but api [ml.close_job] returned [409 Conflict] [{"error":{"root_cause":[{"type":"status_exception","reason":"cannot close job [jobs-crud-reset-finished-time] because it failed, use force close","stack_trace":"org.elasticsearch.ElasticsearchStatusException: cannot close job [jobs-crud-reset-finished-time] because it failed, use force close\n\tat org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.conflictStatusException(ExceptionsHelper.java:81)\n\tat org.elasticsearch.xpack.ml.action.TransportCloseJobAction.validate(TransportCloseJobAction.java:266)\n\tat org.elasticsearch.xpack.ml.action.TransportCloseJobAction.lambda$doExecute$6(TransportCloseJobAction.java:157)\n\tat org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162)\n\tat org.elasticsearch.xpack.ml.job.persistence.JobConfigProvider.lambda$expandJobsIds$7(JobConfigProvider.java:523)\n\tat org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)\n\tat org.elasticsearch.client.internal.node.NodeClient$ActionResponseTaskListener.onResponse(NodeClient.java:175)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:176)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:170)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)\n\tat org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$applyInternal$2(SecurityActionFilter.java:165)\n\tat org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:245)\n\tat org.elasticsearch.action.ActionListener$RunAfterActionListener.onResponse(ActionListener.java:367)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:724)\n\tat org.elasticsearch.action.search.FetchLookupFieldsPhase.run(FetchLookupFieldsPhase.java:75)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:471)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:465)\n\tat org.elasticsearch.action.search.ExpandSearchPhase.onPhaseDone(ExpandSearchPhase.java:151)\n\tat org.elasticsearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:105)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:471)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:465)\n\tat org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:275)\n\tat org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$2(FetchSearchPhase.java:109)\n\tat org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:118)\n\tat org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:93)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"}],"type":"status_exception","reason":"cannot close job [jobs-crud-reset-finished-time] because it failed, use force close","stack_trace":"org.elasticsearch.ElasticsearchStatusException: cannot close job [jobs-crud-reset-finished-time] because it failed, use force close\n\tat org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.conflictStatusException(ExceptionsHelper.java:81)\n\tat org.elasticsearch.xpack.ml.action.TransportCloseJobAction.validate(TransportCloseJobAction.java:266)\n\tat org.elasticsearch.xpack.ml.action.TransportCloseJobAction.lambda$doExecute$6(TransportCloseJobAction.java:157)\n\tat org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162)\n\tat org.elasticsearch.xpack.ml.job.persistence.JobConfigProvider.lambda$expandJobsIds$7(JobConfigProvider.java:523)\n\tat org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)\n\tat org.elasticsearch.client.internal.node.NodeClient$ActionResponseTaskListener.onResponse(NodeClient.java:175)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:176)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:170)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)\n\tat org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$applyInternal$2(SecurityActionFilter.java:165)\n\tat org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:245)\n\tat org.elasticsearch.action.ActionListener$RunAfterActionListener.onResponse(ActionListener.java:367)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:724)\n\tat org.elasticsearch.action.search.FetchLookupFieldsPhase.run(FetchLookupFieldsPhase.java:75)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:471)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:465)\n\tat org.elasticsearch.action.search.ExpandSearchPhase.onPhaseDone(ExpandSearchPhase.java:151)\n\tat org.elasticsearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:105)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:471)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:465)\n\tat org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:275)\n\tat org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$2(FetchSearchPhase.java:109)\n\tat org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:118)\n\tat org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:93)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"},"status":409}]

    at org.junit.Assert.fail(Assert.java:88)
    at org.elasticsearch.test.rest.yaml.section.DoSection.execute(DoSection.java:373)
    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:492)
    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:472)
    at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
    at java.lang.reflect.Method.invoke(Method.java:577)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
    at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
    at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
    at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
    at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
    at java.lang.Thread.run(Thread.java:833)
elasticmachine commented 2 years ago

Pinging @elastic/ml-core (Team:ML)

droberts195 commented 2 years ago

This happened on an Ubuntu 22.04 worker. It almost certainly means the system call filter in the ML native processes needs adjusting for a new kernel version.

/cc @bytebilly please don't add Ubuntu 22.04 to the Elasticsearch support matrix until this issue is fixed. It seems that until this is fixed ML is completely broken on this distribution. I will aim for 7.17.5/8.2.2/8.3.0.

mark-vieira commented 2 years ago

@droberts195 are we now good to add Ubuntu 22.04 to the general rotation as well as the testing matrix for 7.17?

droberts195 commented 2 years ago

The ML native processes will now work on Ubuntu 22.04 starting with 8.3.0, 8.2.2 and 7.17.5. But they'll never work for older versions. This is going to be problematic with the BWC tests. Any build that runs the X-Pack BWC tests against versions older than 8.3.0/8.2.2/7.17.5 on Ubuntu 22.04 is going to fail, and since we can't re-release those old versions that's going to be a problem forever.

Therefore we should probably do two things:

  1. Disable all the ML BWC tests if we detect the old version is before 8.3.0/8.2.2/7.17.5 and the glibc version (which can be got from ldconfig --version) is 2.35 or above
  2. Don't use Ubuntu 22.04 for PR builds, because otherwise ML BWC breakages will creep through into the periodic builds

It's interesting that this has come about because our system call filtering (which was added to improve security/reduce attack surface in the event of a breach) has also defeated the Linux developers' BWC efforts. You'd expect a recent version of Linux to run all the software that older versions from the previous few years could run, and usually this would be the case with Ubuntu 22.04 and Ubuntu 20.04, but our system call filter prevents it. If we keep the system call filter then this is going to happen again in the future.

bytebilly commented 2 years ago

@droberts195 are we now ok to add Ubuntu 22.04 to the support matrix of supported operating systems for 8.3/7.17?

droberts195 commented 2 years ago

are we now ok to add Ubuntu 22.04 to the support matrix of supported operating systems for 8.3/7.17?

8.3 is fine. 7.17 needs to specifically say 7.17.5 and above. 7.17.0-7.17.4 will never work.

bytebilly commented 2 years ago

The matrix doesn't have this granularity, so I added a footnote to mention that

mark-vieira commented 2 years ago
  • Disable all the ML BWC tests if we detect the old version is before 8.3.0/8.2.2/7.17.5 and the glibc version (which can be got from ldconfig --version) is 2.35 or above

What's the best way to do this. Can we do this in the tests themselves in with assertions?

  • Don't use Ubuntu 22.04 for PR builds, because otherwise ML BWC breakages will creep through into the periodic builds

This actually isn't a problem for PR builds since we only test snapshot versions there and those will include the fix. Only the periodic BWC builds are an issue.

droberts195 commented 2 years ago

What's the best way to do this. Can we do this in the tests themselves in with assertions?

Most of the BWC tests are YAML tests.

I think the best way to skip those ones would be to conditionally add an entry to tests.rest.blacklist that is */*_ml_*/* if the glibc version is 2.34 or above and the old version being upgraded from is < 7.17.5 or >= 8.0.0 and <= 8.2.2.

So to do that we'd somehow need to get Gradle to know the glibc version. It can be done by running ldd --version | grep '^ldd' | sed 's/.* \([1-9]\.[0-9]*\).*/\1/' on Linux. Or obviously if it's easier just the ldd --version can be run as an external command and the text processing can be done in the Gradle script.

Is it possible to make Gradle run an external command during the configuration phase rather than as a task?

Then there are also a few BWC tests that are written in Java rather than YAML. Like you say those can assumeFalse on the glibc version if it can be made available to them. So maybe we just have Gradle set a system property that contains it to pass it through.

I don't think it will be too hard if you could just recommend the best way to get Gradle to run ldd --version early enough that the configuration of the test tasks can know the answer.

mark-vieira commented 2 years ago

Is it possible to make Gradle run an external command during the configuration phase rather than as a task?

It is, but it's highly discouraged since it's expensive to do so and adds overhead to every build invocation. That was my though behind doing this in the test itself, since we'd only do it when attempting to execute the test. I'm wondering if we could implement such a filter in JUnit, even for the YAML tests. I'll have a look at this.

Alternatively, since this only applies to the BWC jobs, maybe we could inject the glibc version as an environment variable or something so we don't have to shell out to ldd during build configuration.

droberts195 commented 2 years ago

maybe we could inject the glibc version as an environment variable

Yes, that's a good idea. We could potentially add it to the per-worker Jenkins configuration for Linux workers. Then both the build.gradle for the YAML tests and the Java test classes would be able to access it.

Another thing we could potentially do is have the early bootstrap of the Java code (before installing system call filters) call this function using JNA and store the result in a variable that's available to other code later on. That would work nicely for the Java tests. But for the YAML tests we'd need to implement a new type of skip rule that could consider both glibc version and old cluster version. And that is problematic because all the client test harnesses have to understand the YAML syntax.

So, overall, adding a worker-specific environment variable is probably best.

mark-vieira commented 2 years ago

@droberts195 Do we have an ehaustive list of all the test we should mute in this scenario. I notice that not all ML tests fail: https://gradle-enterprise.elastic.co/s/36sbiahfwjjqc/tests/overview?class=org.elasticsearch.xpack.test.rest.XPackRestIT&test=test%20%7Bp0%3Dml/*

Should we blanketly skip all ML tests in BWC scenarios or individual ones? I'm leaning towards the former so we don't find ourselves in a whack-a-mole situtation.

droberts195 commented 2 years ago

@mark-vieira yes, I agree we should mute all the ML BWC YAML tests when we detect the OS is too new for the old version to work. Otherwise, like you say, almost every newly added test is likely to need another iteration of observing failures, opening issues and adding to the list of tests to mute.

The ones that work currently will be the ones that don't use any ML C++ functionality. But those ones are unlikely to fail in platform-specific ways, so there's not much point adding extra complexity to test them on a distribution where the rest of ML doesn't work.