Closed cbuescher closed 5 years ago
Pinging @elastic/es-distributed
This test failed again in exactly the same way in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+artifactory/440/console
It still doesn't reproduce locally using:
./gradlew :server:test --tests "org.elasticsearch.transport.RemoteClusterServiceTests.testCollectNodes" \
-Dtests.seed=2521069C2420ECE5 \
-Dtests.security.manager=true \
-Dtests.locale=es-SV \
-Dtests.timezone=America/Los_Angeles \
-Dcompiler.java=12 \
-Druntime.java=8
It has failed 6 times now&_a=(columns:!(_source),index:e58bf320-7efd-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:'test:testCollectNodes'),sort:!(time,desc))), although only ever in 7.0 and 7.x. I have muted it in these branches:
7.0
: 1966347618c44006e8bed87c9bac9ea142b743477.x
: d8a2970fa40d34676242d520502fcd00a2a2fbfaAlso failed on master, muted in d184056bf5cf9968db95c16cb27f6f2ec505bc3e
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertNotNull(Assert.java:712)
at org.junit.Assert.assertNotNull(Assert.java:722)
at org.elasticsearch.transport.RemoteClusterServiceTests.testCollectNodes(RemoteClusterServiceTests.java:714)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
at java.base/java.lang.Thread.run(Thread.java:834)
com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=140, name=elasticsearch[transport_worker][T#1], state=RUNNABLE, group=TGRP-RemoteClusterServiceTests]
Caused by: java.lang.AssertionError: should not be called
at __randomizedtesting.SeedInfo.seed([AE9F4D8CEB8B632C]:0)
at org.junit.Assert.fail(Assert.java:88)
at org.elasticsearch.transport.RemoteClusterServiceTests$3.onResponse(RemoteClusterServiceTests.java:698)
at org.elasticsearch.transport.RemoteClusterServiceTests$3.onResponse(RemoteClusterServiceTests.java:694)
at org.elasticsearch.transport.RemoteClusterService$2.onResponse(RemoteClusterService.java:501)
at org.elasticsearch.transport.RemoteClusterService$2.onResponse(RemoteClusterService.java:494)
at org.elasticsearch.transport.RemoteClusterConnection$1.handleResponse(RemoteClusterConnection.java:222)
at org.elasticsearch.transport.RemoteClusterConnection$1.handleResponse(RemoteClusterConnection.java:210)
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1098)
at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:224)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:192)
at org.elasticsearch.transport.InboundHandler.handleResponse(InboundHandler.java:216)
at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:141)
at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105)
at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:645)
at org.elasticsearch.transport.TcpTransport.consumeNetworkReads(TcpTransport.java:669)
at org.elasticsearch.transport.nio.MockNioTransport$MockTcpReadWriteHandler.consumeReads(MockNioTransport.java:242)
at org.elasticsearch.nio.SocketChannelContext.handleReadBytes(SocketChannelContext.java:213)
at org.elasticsearch.nio.BytesChannelContext.read(BytesChannelContext.java:47)
at org.elasticsearch.nio.EventHandler.handleRead(EventHandler.java:119)
at org.elasticsearch.transport.nio.TestEventHandler.handleRead(TestEventHandler.java:121)
at org.elasticsearch.nio.NioSelector.handleRead(NioSelector.java:397)
at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:246)
at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:172)
at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129)
at java.base/java.lang.Thread.run(Thread.java:834)
./gradlew :server:test --tests "org.elasticsearch.transport.RemoteClusterServiceTests.testCollectNodes" \
-Dtests.seed=AE9F4D8CEB8B632C \
-Dtests.security.manager=true \
-Dtests.locale=os-GE \
-Dtests.timezone=Asia/Jakarta \
-Dcompiler.java=12 \
-Druntime.java=11
As we are missing the detailed logs here I've unmuted the test on master.
good news, I have a failure for you to look at :)
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/448/console
thanks @hub-cap. Unfortunately it's not enough logging. I've pushed 8a1a000da03 to increase log levels. If there's another failure, please report here.
Finally, we have TRACE-level logs: https://scans.gradle.com/s/yepruddzyxi3s/console-log/raw and https://scans.gradle.com/s/ouz4wjgdwui54/console-log/raw as well as an explanation for the failure.
The test fails because the RemoteClusterService
in this test manages to successfully establish connections after closing the transport service for the nodes that it should be connecting to.
The reason for this is that this test is interacting with other tests that run in parallel, and establishing connections to nodes from different tests.
Relevant log lines (from the first failure):
[2019-07-03T00:46:21,884][INFO ][o.e.t.RemoteClusterServiceTests] [testCollectNodes] all source nodes are closed
...
[2019-07-03T00:46:21,904][DEBUG][o.e.t.RemoteClusterConnection] [org.elasticsearch.transport.RemoteClusterServiceTests] [cluster_2] opening connection to seed node: [{cluster_2#127.0.0.1:10302}{yE0gNKf8TzSSNFh1qFF3Tw}{127.0.0.1}{127.0.0.1:10302}{dim}] proxy address: [null]
[2019-07-03T00:46:21,904][DEBUG][o.e.t.ConnectionManager ] [org.elasticsearch.transport.RemoteClusterServiceTests] connected to node [{node_remote4}{1hDgz5PSTvW7RUqr660YZA}{5AP317rDQYGEjD1siYBQiA}{127.0.0.1}{127.0.0.1:10300}{dim}]
...
[2019-07-03T00:46:21,909][TRACE][o.e.t.T.tracer ] [[elasticsearch[transport_worker][T#1]]] [21][cluster:monitor/state] sent to [{cluster_1#127.0.0.1:10300}{NHyXpgPsRgWuL58zCl9QGQ}{127.0.0.1}{127.0.0.1:10300}{dim}] (timeout: [null])
[2019-07-03T00:46:21,909][TRACE][o.e.t.TransportLogger ] [[elasticsearch[transport_worker][T#1]]] NioSocketChannel{localAddress=/127.0.0.1:60974, remoteAddress=/127.0.0.1:10300} [length: 274, request id: 21, type: response, version: 8.0.0] READ: 274B
[2019-07-03T00:46:21,910][TRACE][o.e.t.T.tracer ] [[elasticsearch[transport_worker][T#1]]] [21][cluster:monitor/state] received response from [{cluster_1#127.0.0.1:10300}{NHyXpgPsRgWuL58zCl9QGQ}{127.0.0.1}{127.0.0.1:10300}{dim}]
This shows connection to a node named node_remote4
. A quick grep shows that TransportSearchActionTests.startTransport()
is using "node_remote" + i
as node name.
TransportSearchActionTests.testCollectSearchShards
seems to have run about the same time as testCollectNodes
The second failure shows something similar (connection to a node discoverable_node_added6
), which suggests a concurrent race with RemoteClusterConnectionTests.testConnectedNodesConcurrentAccess
:
[2019-07-02T13:39:31,679][DEBUG][o.e.t.ConnectionManager ] [org.elasticsearch.transport.RemoteClusterServiceTests] connected to node [{discoverable_node_added6}{GYyBCC8CQ-6xot8_AkVCVw}{tYVyxrxUThS_TgDCsrPOaw}{127.0.0.1}{127.0.0.1:10302}{dim}]
That test started at
[2019-07-02T03:39:31,391][INFO ][o.e.t.RemoteClusterConnectionTests] [testConnectedNodesConcurrentAccess] before test
[2019-07-02T03:39:31,655][INFO ][o.e.t.TransportService ] [[Thread-96]] publish_address {127.0.0.1:10302}, bound_addresses {[::1]:10322}, {127.0.0.1:10302}
and finished at
[2019-07-02T03:39:31,794][INFO ][o.e.t.RemoteClusterConnectionTests] [testConnectedNodesConcurrentAccess] after test
so it perfectly explains the interaction between the two tests.
I'm not sure yet how to proceed on this, as in the presence of other nodes popping up under the same address we can't assert what we currently do in the test.
On 7.0: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+multijob-unix-compatibility/os=sles-12/99/console
Could not reproduce locally so far:
and