elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.93k stars 24.74k forks source link

[CI] FullClusterRestartIT classMethod failing #94126

Closed benwtrent closed 9 months ago

benwtrent commented 1 year ago

This is a weird one, in the shutdown plugin, for some reason, we attempt to clear indices before cluster state is even set?

Maybe the test is executing too quickly in the class set up with the new testing framework?

Maybe related to: https://github.com/elastic/elasticsearch/pull/93477

Looking at the Shutdown plugin, that is the most recent change that could cause this.

Build scan: https://gradle-enterprise.elastic.co/s/cqm7svnjudyac/tests/:x-pack:qa:full-cluster-restart:v8.6.3%23bwcTest/org.elasticsearch.xpack.restart.FullClusterRestartIT

Reproduction line:

null

Applicable branches: 8.7

Reproduces locally?: No

Failure history: https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.restart.FullClusterRestartIT&tests.test=classMethod

Failure excerpt:

java.lang.RuntimeException: An error occurred orchestrating test cluster.

  at __randomizedtesting.SeedInfo.seed([FA80D8BB0D0C7909]:0)
  at org.elasticsearch.test.cluster.local.LocalClusterHandle.execute(LocalClusterHandle.java:225)
  at org.elasticsearch.test.cluster.local.LocalClusterHandle.writeUnicastHostsFile(LocalClusterHandle.java:206)
  at org.elasticsearch.test.cluster.local.LocalClusterHandle.waitUntilReady(LocalClusterHandle.java:149)
  at org.elasticsearch.test.cluster.local.LocalClusterHandle.start(LocalClusterHandle.java:70)
  at org.elasticsearch.test.cluster.local.LocalElasticsearchCluster$1.evaluate(LocalElasticsearchCluster.java:38)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:833)

  Caused by: java.lang.RuntimeException: Elasticsearch process died while waiting for ports file. See console output for details.

    at org.elasticsearch.test.cluster.local.LocalClusterFactory$Node.lambda$waitUntilReady$0(LocalClusterFactory.java:200)
    at org.elasticsearch.test.cluster.util.Retry.lambda$retryUntilTrue$0(Retry.java:33)
    at org.elasticsearch.test.cluster.util.Retry.lambda$getValueWithTimeout$1(Retry.java:47)
    at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
    at java.lang.Thread.run(Thread.java:833)
elasticsearchmachine commented 1 year ago

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-delivery (Team:Delivery)

mark-vieira commented 1 year ago

Maybe the test is executing too quickly in the class set up with the new testing framework?

No tests have even executed at this point, this is just starting the cluster so I don't think it's test setup related unless something went wonky with actually creating the test cluster itself.

DaveCTurner commented 1 year ago

Elasticsearch process died while waiting for ports file usually means it hit an AssertionError while starting up, and indeed that seems to be the case here:

$ cat x-pack/plugin/shutdown/qa/full-cluster-restart/build/testrun/v8.6.3_bwcTest/temp/test-cluster9392893976202056573/test-cluster-1/logs/elasticsearch_server.json | grep -e 'AssertionError' | jq '."error.stack_trace"' -cMr
java.lang.AssertionError: initial cluster state not set yet
        at org.elasticsearch.server@8.6.3-SNAPSHOT/org.elasticsearch.cluster.service.ClusterApplierService.state(ClusterApplierService.java:181)
        at org.elasticsearch.server@8.6.3-SNAPSHOT/org.elasticsearch.cluster.service.ClusterService.state(ClusterService.java:141)
        at org.elasticsearch.xpack.monitoring.exporter.local.LocalExporter.onCleanUpIndices(LocalExporter.java:598)
        at org.elasticsearch.xpack.monitoring.cleaner.CleanerService$IndicesCleaner.doRunInLifecycle(CleanerService.java:164)
        at org.elasticsearch.server@8.6.3-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractLifecycleRunnable.doRun(AbstractLifecycleRunnable.java:56)
        at org.elasticsearch.server@8.6.3-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:917)
        at org.elasticsearch.server@8.6.3-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1589)
elasticsearchmachine commented 1 year ago

Pinging @elastic/es-data-management (Team:Data Management)

mark-vieira commented 1 year ago

Elasticsearch process died while waiting for ports file usually means it hit an AssertionError while starting up, and indeed that seems to be the case here:

FYI, this error is available in the build scans directly as well: https://gradle-enterprise.elastic.co/s/cqm7svnjudyac/tests/:x-pack:qa:full-cluster-restart:v8.6.3%23bwcTest/org.elasticsearch.xpack.restart.FullClusterRestartIT?focused-execution=1&page=eyJvdXRwdXQiOnsiMCI6MX19&top-execution=1#L199

rjernst commented 1 year ago

Shouldn't the unicast hosts file have been created?

[2023-02-25T01:00:00,022][WARN ][o.e.d.FileBasedSeedHostsProvider] [test-cluster-0] expected, but did not find, a dynamic hosts list at [/dev/shm/elastic+elasticsearch+8.7+intake+multijob+bwc-snapshots/x-pack/qa/full-cluster-restart/build/testrun/v8.6.3_bwcTest/temp/test-cluster11922099425716326952/test-cluster-0/config/unicast_hosts.txt]

DaveCTurner commented 1 year ago

Shouldn't the unicast hosts file have been created?

I'd only expect that to happen after all the nodes have written their ports files, and died while waiting for ports file indicates that one of the nodes failed before it even got that far.

mark-vieira commented 1 year ago

I'd only expect that to happen after all the nodes have written their ports files

Correct, we start the nodes, and then when they all successfully come up we write the unicast hosts file. So that warning message is expected.

DaveCTurner commented 1 year ago

https://gradle-enterprise.elastic.co/s/uu46t6srqlywi is the same thing.

masseyke commented 1 year ago

I think I see what's going on here: We start all the plugin components (including monitoring's CleanerService) here: https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/node/Node.java#L1407. That's before we set the initial cluster state here: https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/node/Node.java#L1490. So the CleanerService schedules a task that uses the cluster state before the cluster state has been set. Ordinarily that's not a problem because it only kicks off once a day -- at 1:00 am. These two tests both fail at 1:00 am. I'm not sure what the best way to fix this is though. It seems odd that we're starting plugin components before the cluster state exists, but I assume there's a reason. I could change CleanerService to catch the exception, just in case someone starts their server at 1:00 am, but that seems a little clunky.

masseyke commented 1 year ago

Oh we actually already check for this condition when assertions are disabled:

        ClusterState clusterState = clusterService.state();
        if (clusterService.localNode() == null
            || clusterState == null
            || clusterState.blocks().hasGlobalBlockWithLevel(ClusterBlockLevel.METADATA_WRITE)) {
            logger.debug("exporter not ready");
            return;
        }

But since the check for the cluster state being null comes after the call to localNode(), I think we'd get a NPE in production.

masseyke commented 1 year ago

SearchableSnapshotsUsageTracker has a similar problem, but avoids this by waiting 15 minutes before running: https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/searchable-snapshots/src/main/java/org/elasticsearch/xpack/searchablesnapshots/SearchableSnapshots.java#L369. It looks like I can have the Monitoring plugin implement ClusterPlugin, and start the CleanerService schedule in onNodeStarted. I'm not sure if this is the intended use of ClusterPlugin, but it seems like it will work.

rjernst commented 1 year ago

SearchableSnapshotsUsageTracker does not explicitly try to avoid this issue. The 15 minutes there is the frequency at which we poll for usage.

It seems odd that we're starting plugin components before the cluster state exists, but I assume there's a reason.

Lifecycle components added by plugins are the first to be started when starting the node. Loading of initial state happens near the end of start(). I agree it is odd, in that the core system has not yet been started, but plugin components are started. Perhaps this is something that could change, though I don't know the ramifications (eg to security components starting later).

and start the CleanerService schedule in onNodeStarted

As you guessed, that is not the intention of onNodeStarted. That method is meant for notification, not active logic (it was originally added to notify downstream services, namely systemd, that the node was finished starting up).

I think this can be remedied by using cluster state updates instead of asking the cluster service for the current state. LocalExporter is already a ClusterStateListener. Capture the cluster state there (or whatever portion is needed), and operate on that within the exporter. The ClusterService dependency is then only needed in the ctor, to add itself as a listener.

masseyke commented 1 year ago

SearchableSnapshotsUsageTracker does not explicitly try to avoid this issue. The 15 minutes there is the frequency at which we poll for usage.

Yeah I wasn't saying there was any intent there -- just that that's why we don't see problems with it (I accidentally caused it to fail when I left my debugger paused for more than 15 minutes).

I think this can be remedied by using cluster state updates instead of asking the cluster service for the current state.

That makes sense. I think I need to go better understand LocalExporter first though -- there are quite a few places where we're currently getting the cluster state from the cluster service (and a 3 places where we pass the cluster service outside of the object). I don't want to accidentally make things worse.

craigtaverner commented 11 months ago

This failure has occurred all 19 out of 19 runs in the last week: https://es-delivery-stats.elastic.dev/app/dashboards#/view/dcec9e60-72ac-11ee-8f39-55975ded9e63?_g=(refreshInterval:(pause:!t,value:60000),time:(from:now-7d%2Fd,to:now)) So it would seem this is no longer low risk?

The latest failure seems to have very similar, but not identical error messages at https://gradle-enterprise.elastic.co/s/2vyruoe3ko26o/tests/task/:x-pack:qa:full-cluster-restart:v8.11.2%23bwcTest/details/org.elasticsearch.xpack.restart.FullClusterRestartIT?top-execution=1:

java.lang.RuntimeException: An error occurred orchestrating test cluster.      
at __randomizedtesting.SeedInfo.seed([E5793B190DFD6819]:0)      
at org.elasticsearch.test.cluster.local.DefaultLocalClusterHandle.execute(DefaultLocalClusterHandle.java:259)   
at org.elasticsearch.test.cluster.local.DefaultLocalClusterHandle.writeUnicastHostsFile(DefaultLocalClusterHandle.java:240)     
at org.elasticsearch.test.cluster.local.DefaultLocalClusterHandle.waitUntilReady(DefaultLocalClusterHandle.java:183)    
at org.elasticsearch.test.cluster.local.DefaultLocalClusterHandle.start(DefaultLocalClusterHandle.java:74)      
at org.elasticsearch.test.cluster.local.DefaultLocalElasticsearchCluster$1.evaluate(DefaultLocalElasticsearchCluster.java:38)   
at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)     
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) 
at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)       
at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) 
at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) 
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) 
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) 
at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)       
at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)     
at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)     
at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)       
at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)   
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) 
at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390) 
at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850) 
at java.lang.Thread.run(Thread.java:833)        

Caused by: java.lang.RuntimeException: Timed out after PT2M waiting for ports files for: { cluster: 'test-cluster', node: 'test-cluster-1' }    
at org.elasticsearch.test.cluster.local.AbstractLocalClusterFactory$Node.waitUntilReady(AbstractLocalClusterFactory.java:285)   
at org.elasticsearch.test.cluster.local.AbstractLocalClusterFactory$Node.getTransportEndpoint(AbstractLocalClusterFactory.java:204)     
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)    
at java.util.AbstractList$RandomAccessSpliterator.forEachRemaining(AbstractList.java:720)       
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)        
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) 
at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:960)     
at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:934)     
at java.util.stream.AbstractTask.compute(AbstractTask.java:327) 
at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:754)        
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)      
at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:686)      
at java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:927)     
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)        
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)       
at org.elasticsearch.test.cluster.local.DefaultLocalClusterHandle.lambda$writeUnicastHostsFile$12(DefaultLocalClusterHandle.java:240)   
at java.util.concurrent.ForkJoinTask$AdaptedCallable.exec(ForkJoinTask.java:1428)       
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)      
at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)     
at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)       
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)  
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
astefan commented 11 months ago

Good point. Another failure here: https://gradle-enterprise.elastic.co/s/a4v5sqwk6tm5u

  2> REPRODUCE WITH: ./gradlew ':x-pack:qa:full-cluster-restart:v7.17.13#bwcTest' -Dtests.class="org.elasticsearch.xpack.restart.MlConfigIndexMappingsFullClusterRestartIT" -Dtests.method="testMlConfigIndexMappingsAfterMigration {cluster=UPGRADED}" -Dtests.seed=C7889275F6BC65BA -Dtests.bwc=true -Dtests.locale=es-UY -Dtests.timezone=Asia/Dubai -Druntime.java=21
  2> java.lang.RuntimeException: An error occurred orchestrating test cluster.
        at __randomizedtesting.SeedInfo.seed([C7889275F6BC65BA:C3E99C049D79D2D7]:0)
        at org.elasticsearch.test.cluster.local.DefaultLocalClusterHandle.execute(DefaultLocalClusterHandle.java:259)
        at org.elasticsearch.test.cluster.local.DefaultLocalClusterHandle.writeUnicastHostsFile(DefaultLocalClusterHandle.java:240)
        at org.elasticsearch.test.cluster.local.DefaultLocalClusterHandle.waitUntilReady(DefaultLocalClusterHandle.java:183)
        at org.elasticsearch.test.cluster.local.DefaultLocalClusterHandle.upgradeToVersion(DefaultLocalClusterHandle.java:161)
        at org.elasticsearch.test.cluster.local.DefaultLocalElasticsearchCluster.upgradeToVersion(DefaultLocalElasticsearchCluster.java:140)
        at org.elasticsearch.upgrades.ParameterizedFullClusterRestartTestCase.maybeUpgrade(ParameterizedFullClusterRestartTestCase.java:93)
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
        at java.base/java.lang.reflect.Method.invoke(Method.java:580)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:980)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
        at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
        at org.elasticsearch.test.cluster.local.DefaultLocalElasticsearchCluster$1.evaluate(DefaultLocalElasticsearchCluster.java:39)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
        at java.base/java.lang.Thread.run(Thread.java:1583)

        Caused by:
        java.lang.RuntimeException: Timed out after PT2M waiting for ports files for: { cluster: 'test-cluster', node: 'test-cluster-1' }
            at org.elasticsearch.test.cluster.local.AbstractLocalClusterFactory$Node.waitUntilReady(AbstractLocalClusterFactory.java:285)
            at org.elasticsearch.test.cluster.local.AbstractLocalClusterFactory$Node.getTransportEndpoint(AbstractLocalClusterFactory.java:204)
            at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
            at java.base/java.util.AbstractList$RandomAccessSpliterator.forEachRemaining(AbstractList.java:722)
            at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
            at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
            at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:960)
            at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:934)
            at java.base/java.util.stream.AbstractTask.compute(AbstractTask.java:327)
            at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:754)
            at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387)
            at java.base/java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:667)
            at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:927)
            at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
            at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
            at org.elasticsearch.test.cluster.local.DefaultLocalClusterHandle.lambda$writeUnicastHostsFile$12(DefaultLocalClusterHandle.java:240)
            at java.base/java.util.concurrent.ForkJoinTask$AdaptedCallable.exec(ForkJoinTask.java:1456)
            at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387)
            at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
            at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
            at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
            at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)
mark-vieira commented 10 months ago

This failure has occurred all 19 out of 19 runs in the last week: https://es-delivery-stats.elastic.dev/app/dashboards#/view/dcec9e60-72ac-11ee-8f39-55975ded9e63?_g=(refreshInterval:(pause:!t,value:60000),time:(from:now-7d%2Fd,to:now)) So it would seem this is no longer low risk?

I don't think this is accurate. This tests runs thousands of times a week. Likely what you are seeing is if you filter that dashboard for classMethod you are effectively only seeing failures, since the "classMethod" test isn't really a thing, it's just how failures in test suite setup are reported.

iverase commented 9 months ago

Another one: https://gradle-enterprise.elastic.co/s/c7u532g6k2zxg

kingherc commented 9 months ago

Similar errors at:

joegallo commented 9 months ago

A smattering of these from today:

mark-vieira commented 9 months ago

I've opend https://github.com/elastic/elasticsearch/pull/104166 to help with this.

masseyke commented 9 months ago

It looks like @tvernum fixed this in #100565 (the original issue, not the CI issue that began being discussed here on Nov 21).