elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.56k stars 24.62k forks source link

RemoteClusterServiceTests#testCollectNodes failure #41067

Closed cbuescher closed 5 years ago

cbuescher commented 5 years ago

On 7.0: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+multijob-unix-compatibility/os=sles-12/99/console

Could not reproduce locally so far:

./gradlew :server:test --tests "org.elasticsearch.transport.RemoteClusterServiceTests.testCollectNodes" \
  -Dtests.seed=BCAFFE09C8A7A4A3 \
  -Dtests.security.manager=true \
  -Dtests.locale=th \
  -Dtests.timezone=Africa/Nouakchott \
  -Dcompiler.java=12 \
  -Druntime.java=8
java.lang.AssertionError
    at org.junit.Assert.fail(Assert.java:86)
    at org.junit.Assert.assertTrue(Assert.java:41)
    at org.junit.Assert.assertNotNull(Assert.java:712)
    at org.junit.Assert.assertNotNull(Assert.java:722)
    at org.elasticsearch.transport.RemoteClusterServiceTests.testCollectNodes(RemoteClusterServiceTests.java:721)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)

and

com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=1095, name=elasticsearch[transport_worker][T#2], state=RUNNABLE, group=TGRP-RemoteClusterServiceTests]
Caused by: java.lang.AssertionError: should not be called
    at __randomizedtesting.SeedInfo.seed([BCAFFE09C8A7A4A3]:0)
    at org.junit.Assert.fail(Assert.java:88)
    at org.elasticsearch.transport.RemoteClusterServiceTests$3.onResponse(RemoteClusterServiceTests.java:705)
    at org.elasticsearch.transport.RemoteClusterServiceTests$3.onResponse(RemoteClusterServiceTests.java:701)
    at org.elasticsearch.transport.RemoteClusterService$2.onResponse(RemoteClusterService.java:500)
    at org.elasticsearch.transport.RemoteClusterService$2.onResponse(RemoteClusterService.java:493)
    at org.elasticsearch.transport.RemoteClusterConnection$1.handleResponse(RemoteClusterConnection.java:222)
    at org.elasticsearch.transport.RemoteClusterConnection$1.handleResponse(RemoteClusterConnection.java:210)
    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1114)
    at org.elasticsearch.transport.TcpTransport$1.doRun(TcpTransport.java:975)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:192)
    at org.elasticsearch.transport.TcpTransport.handleResponse(TcpTransport.java:967)
    at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:942)
    at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:753)
    at org.elasticsearch.transport.TcpTransport.consumeNetworkReads(TcpTransport.java:780)
    at org.elasticsearch.transport.nio.MockNioTransport$MockTcpReadWriteHandler.consumeReads(MockNioTransport.java:241)
    at org.elasticsearch.nio.SocketChannelContext.handleReadBytes(SocketChannelContext.java:213)
    at org.elasticsearch.nio.BytesChannelContext.read(BytesChannelContext.java:47)
    at org.elasticsearch.nio.EventHandler.handleRead(EventHandler.java:119)
    at org.elasticsearch.nio.NioSelector.handleRead(NioSelector.java:398)
    at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:246)
    at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:172)
    at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129)
    at java.lang.Thread.run(Thread.java:748)
elasticmachine commented 5 years ago

Pinging @elastic/es-distributed

droberts195 commented 5 years ago

This test failed again in exactly the same way in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+artifactory/440/console

It still doesn't reproduce locally using:

./gradlew :server:test --tests "org.elasticsearch.transport.RemoteClusterServiceTests.testCollectNodes" \
  -Dtests.seed=2521069C2420ECE5 \
  -Dtests.security.manager=true \
  -Dtests.locale=es-SV \
  -Dtests.timezone=America/Los_Angeles \
  -Dcompiler.java=12 \
  -Druntime.java=8

It has failed 6 times now&_a=(columns:!(_source),index:e58bf320-7efd-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:'test:testCollectNodes'),sort:!(time,desc))), although only ever in 7.0 and 7.x. I have muted it in these branches:

davidkyle commented 5 years ago

Also failed on master, muted in d184056bf5cf9968db95c16cb27f6f2ec505bc3e

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=centos-6/364/console

java.lang.AssertionError
    at org.junit.Assert.fail(Assert.java:86)
    at org.junit.Assert.assertTrue(Assert.java:41)
    at org.junit.Assert.assertNotNull(Assert.java:712)
    at org.junit.Assert.assertNotNull(Assert.java:722)
    at org.elasticsearch.transport.RemoteClusterServiceTests.testCollectNodes(RemoteClusterServiceTests.java:714)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
    at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
    at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
    at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
    at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
    at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
    at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
    at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
    at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
    at java.base/java.lang.Thread.run(Thread.java:834)
com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=140, name=elasticsearch[transport_worker][T#1], state=RUNNABLE, group=TGRP-RemoteClusterServiceTests]
Caused by: java.lang.AssertionError: should not be called
    at __randomizedtesting.SeedInfo.seed([AE9F4D8CEB8B632C]:0)
    at org.junit.Assert.fail(Assert.java:88)
    at org.elasticsearch.transport.RemoteClusterServiceTests$3.onResponse(RemoteClusterServiceTests.java:698)
    at org.elasticsearch.transport.RemoteClusterServiceTests$3.onResponse(RemoteClusterServiceTests.java:694)
    at org.elasticsearch.transport.RemoteClusterService$2.onResponse(RemoteClusterService.java:501)
    at org.elasticsearch.transport.RemoteClusterService$2.onResponse(RemoteClusterService.java:494)
    at org.elasticsearch.transport.RemoteClusterConnection$1.handleResponse(RemoteClusterConnection.java:222)
    at org.elasticsearch.transport.RemoteClusterConnection$1.handleResponse(RemoteClusterConnection.java:210)
    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1098)
    at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:224)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:192)
    at org.elasticsearch.transport.InboundHandler.handleResponse(InboundHandler.java:216)
    at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:141)
    at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105)
    at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:645)
    at org.elasticsearch.transport.TcpTransport.consumeNetworkReads(TcpTransport.java:669)
    at org.elasticsearch.transport.nio.MockNioTransport$MockTcpReadWriteHandler.consumeReads(MockNioTransport.java:242)
    at org.elasticsearch.nio.SocketChannelContext.handleReadBytes(SocketChannelContext.java:213)
    at org.elasticsearch.nio.BytesChannelContext.read(BytesChannelContext.java:47)
    at org.elasticsearch.nio.EventHandler.handleRead(EventHandler.java:119)
    at org.elasticsearch.transport.nio.TestEventHandler.handleRead(TestEventHandler.java:121)
    at org.elasticsearch.nio.NioSelector.handleRead(NioSelector.java:397)
    at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:246)
    at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:172)
    at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129)
    at java.base/java.lang.Thread.run(Thread.java:834)
./gradlew :server:test --tests "org.elasticsearch.transport.RemoteClusterServiceTests.testCollectNodes" \
  -Dtests.seed=AE9F4D8CEB8B632C \
  -Dtests.security.manager=true \
  -Dtests.locale=os-GE \
  -Dtests.timezone=Asia/Jakarta \
  -Dcompiler.java=12 \
  -Druntime.java=11
ywelsch commented 5 years ago

As we are missing the detailed logs here I've unmuted the test on master.

hub-cap commented 5 years ago

good news, I have a failure for you to look at :)

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/448/console

ywelsch commented 5 years ago

thanks @hub-cap. Unfortunately it's not enough logging. I've pushed 8a1a000da03 to increase log levels. If there's another failure, please report here.

ywelsch commented 5 years ago

Finally, we have TRACE-level logs: https://scans.gradle.com/s/yepruddzyxi3s/console-log/raw and https://scans.gradle.com/s/ouz4wjgdwui54/console-log/raw as well as an explanation for the failure.

The test fails because the RemoteClusterService in this test manages to successfully establish connections after closing the transport service for the nodes that it should be connecting to.

The reason for this is that this test is interacting with other tests that run in parallel, and establishing connections to nodes from different tests.

Relevant log lines (from the first failure):

[2019-07-03T00:46:21,884][INFO ][o.e.t.RemoteClusterServiceTests] [testCollectNodes] all source nodes are closed
...
[2019-07-03T00:46:21,904][DEBUG][o.e.t.RemoteClusterConnection] [org.elasticsearch.transport.RemoteClusterServiceTests] [cluster_2] opening connection to seed node: [{cluster_2#127.0.0.1:10302}{yE0gNKf8TzSSNFh1qFF3Tw}{127.0.0.1}{127.0.0.1:10302}{dim}] proxy address: [null]
[2019-07-03T00:46:21,904][DEBUG][o.e.t.ConnectionManager  ] [org.elasticsearch.transport.RemoteClusterServiceTests] connected to node [{node_remote4}{1hDgz5PSTvW7RUqr660YZA}{5AP317rDQYGEjD1siYBQiA}{127.0.0.1}{127.0.0.1:10300}{dim}]
...
[2019-07-03T00:46:21,909][TRACE][o.e.t.T.tracer           ] [[elasticsearch[transport_worker][T#1]]] [21][cluster:monitor/state] sent to [{cluster_1#127.0.0.1:10300}{NHyXpgPsRgWuL58zCl9QGQ}{127.0.0.1}{127.0.0.1:10300}{dim}] (timeout: [null])
[2019-07-03T00:46:21,909][TRACE][o.e.t.TransportLogger    ] [[elasticsearch[transport_worker][T#1]]] NioSocketChannel{localAddress=/127.0.0.1:60974, remoteAddress=/127.0.0.1:10300} [length: 274, request id: 21, type: response, version: 8.0.0] READ: 274B
[2019-07-03T00:46:21,910][TRACE][o.e.t.T.tracer           ] [[elasticsearch[transport_worker][T#1]]] [21][cluster:monitor/state] received response from [{cluster_1#127.0.0.1:10300}{NHyXpgPsRgWuL58zCl9QGQ}{127.0.0.1}{127.0.0.1:10300}{dim}]

This shows connection to a node named node_remote4. A quick grep shows that TransportSearchActionTests.startTransport() is using "node_remote" + i as node name.

TransportSearchActionTests.testCollectSearchShards seems to have run about the same time as testCollectNodes

The second failure shows something similar (connection to a node discoverable_node_added6), which suggests a concurrent race with RemoteClusterConnectionTests.testConnectedNodesConcurrentAccess:

[2019-07-02T13:39:31,679][DEBUG][o.e.t.ConnectionManager  ] [org.elasticsearch.transport.RemoteClusterServiceTests] connected to node [{discoverable_node_added6}{GYyBCC8CQ-6xot8_AkVCVw}{tYVyxrxUThS_TgDCsrPOaw}{127.0.0.1}{127.0.0.1:10302}{dim}]

That test started at

[2019-07-02T03:39:31,391][INFO ][o.e.t.RemoteClusterConnectionTests] [testConnectedNodesConcurrentAccess] before test
[2019-07-02T03:39:31,655][INFO ][o.e.t.TransportService   ] [[Thread-96]] publish_address {127.0.0.1:10302}, bound_addresses {[::1]:10322}, {127.0.0.1:10302}

and finished at

[2019-07-02T03:39:31,794][INFO ][o.e.t.RemoteClusterConnectionTests] [testConnectedNodesConcurrentAccess] after test

so it perfectly explains the interaction between the two tests.

I'm not sure yet how to proceed on this, as in the presence of other nodes popping up under the same address we can't assert what we currently do in the test.

ywelsch commented 5 years ago

Closed by https://github.com/elastic/elasticsearch/pull/43983