elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.24k stars 24.85k forks source link

JDK/JRE 7 update 72 breaks multicast in 1.4.0.Beta1 #8225

Closed costin closed 10 years ago

costin commented 10 years ago

I've just updated to JDK 7 update 72 and 1.4.0.Beta1 doesn't work any more (win 8.1).

# java -version
java version "1.7.0_72"
Java(TM) SE Runtime Environment (build 1.7.0_72-b14)
Java HotSpot(TM) Client VM (build 24.72-b04, mixed mode, sharing)
[2014-10-25 00:37:58,821][INFO ][node                     ] [Thog] version[1.4.0.Beta1], pid[6224], build[1f25669/2014-10-01T14:58:15Z]
[2014-10-25 00:37:58,822][INFO ][node                     ] [Thog] initializing ...
[2014-10-25 00:37:58,827][INFO ][plugins                  ] [Thog] loaded [], sites []
[2014-10-25 00:38:01,206][INFO ][node                     ] [Thog] initialized
[2014-10-25 00:38:01,206][INFO ][node                     ] [Thog] starting ...
[2014-10-25 00:38:01,308][INFO ][transport                ] [Thog] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.1.50:9300]}
[2014-10-25 00:38:01,351][INFO ][discovery.zen.ping.multicast] [Thog] multicast failed to start [SocketException[An invalid argument was supplied]], disabling
[2014-10-25 00:38:01,353][INFO ][discovery                ] [Thog] elasticsearch/Sk3GjLbeTduVCehGYnfdRg
[2014-10-25 00:38:01,359][WARN ][discovery.zen.ping.multicast] [Thog] failed to send multicast ping request: NullPointerException[null]
[2014-10-25 00:38:02,860][WARN ][discovery.zen.ping.multicast] [Thog] failed to send multicast ping request: NullPointerException[null]
[2014-10-25 00:38:04,372][INFO ][cluster.service          ] [Thog] new_master [Thog][Sk3GjLbeTduVCehGYnfdRg][cerberus][inet[/192.168.1.50:9300]], reason: zen-disco-join (elected_as_master)
[2014-10-25 00:38:04,415][INFO ][http                     ] [Thog] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.1.50:9200]}
[2014-10-25 00:38:04,416][INFO ][node                     ] [Thog] started
[2014-10-25 00:38:04,926][INFO ][gateway                  ] [Thog] recovered [2] indices into cluster_state

1.4.0.Beta1 works fine with 7u67 and 8u25. It looks like only 7u72 is problematic (potentially 7u71). Tested 1.3.4 on 7u72 and it seems to be working fine. Verified the java version and the same problem occurs - clearly something has changed in the JDK.

# java -version
java version "1.7.0_72"
Java(TM) SE Runtime Environment (build 1.7.0_72-b14)
Java HotSpot(TM) Client VM (build 24.72-b04, mixed mode, sharing)

# elasticsearch
[2014-10-25 00:50:20,068][INFO ][node                     ] [Wendell Vaughn] version[1.3.4], pid[7220], build[a70f3cc/2014-09-30T09:07:17Z]
[2014-10-25 00:50:20,070][INFO ][node                     ] [Wendell Vaughn] initializing ...
[2014-10-25 00:50:20,074][INFO ][plugins                  ] [Wendell Vaughn] loaded [], sites []
[2014-10-25 00:50:22,367][INFO ][node                     ] [Wendell Vaughn] initialized
[2014-10-25 00:50:22,367][INFO ][node                     ] [Wendell Vaughn] starting ...
[2014-10-25 00:50:22,472][INFO ][transport                ] [Wendell Vaughn] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.1.50:9300]}
[2014-10-25 00:50:22,570][INFO ][discovery.zen.ping.multicast] [Wendell Vaughn] multicast failed to start [SocketException[An invalid argument was supplied]], disabling
[2014-10-25 00:50:22,571][INFO ][discovery                ] [Wendell Vaughn] elasticsearch/iROZFsnQQb-2OUFP6bA_bA
[2014-10-25 00:50:22,600][WARN ][discovery.zen.ping.multicast] [Wendell Vaughn] failed to send multicast ping request: NullPointerException[null]
[2014-10-25 00:50:24,102][WARN ][discovery.zen.ping.multicast] [Wendell Vaughn] failed to send multicast ping request: NullPointerException[null]
[2014-10-25 00:50:25,605][INFO ][cluster.service          ] [Wendell Vaughn] new_master [Wendell Vaughn][iROZFsnQQb-2OUFP6bA_bA][cerberus][inet[/192.168.1.50:9300]], reason: zen-disco-join (elected_as_master)
[2014-10-25 00:50:25,638][INFO ][gateway                  ] [Wendell Vaughn] recovered [0] indices into cluster_state
[2014-10-25 00:50:25,649][INFO ][http                     ] [Wendell Vaughn] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.1.50:9200]}
[2014-10-25 00:50:25,650][INFO ][node                     ] [Wendell Vaughn] started
costin commented 10 years ago

Looking at the JDK changelog it might be that update 71 (a critical security update) introduced the breaking behaviour, namely through 8021372: 8021372 core-libs java.net NetworkInterface.getNetworkInterfaces() returns duplicate hardware address

nik9000 commented 10 years ago

Yay! We're in the process of upgrading to 7u72 right now. Happy times.

If it needs an assignee I can look at it on Monday. I need an excuse to go poking around in this code. OTOH if someone has time to work on it please don't let me stop you. Your Monday probably starts before mine anyway.

costin commented 10 years ago

@nik9000 Note that I've currently tested it on Windows - there's a slim chance, the bug doesn't occur on *nix machines...

nik9000 commented 10 years ago

Ah. Well, I'll at least be able to test that on Monday. I don't think my wife would appreciate me commandeering her machine to reproduce on Windows. So if it doesn't reproduce on unix I'd probably be a bad person to work on it.

Mpdreamz commented 10 years ago

I'm on Windows 8.1 but I can't reproduce:

c:\>"%JAVA_HOME%\bin\java.exe" -version
java version "1.7.0_72"
Java(TM) SE Runtime Environment (build 1.7.0_72-b14)
Java HotSpot(TM) Client VM (build 24.72-b04, mixed mode, sharing)
c:\Data\elasticsearch-1.4.0.Beta1>bin\elasticsearch
[2014-10-27 11:25:44,236][WARN ][bootstrap                ] jvm uses the client vm, make sure to run `java` with the server vm for best performance by adding `-server` to the command line
[2014-10-27 11:25:44,299][INFO ][node                     ] [Pete Wisdom] version[1.4.0.Beta1], pid[4656], build[1f25669/2014-10-01T14:58:15Z]
[2014-10-27 11:25:44,299][INFO ][node                     ] [Pete Wisdom] initializing ...
[2014-10-27 11:25:44,314][INFO ][plugins                  ] [Pete Wisdom] loaded [], sites []
[2014-10-27 11:25:46,317][INFO ][node                     ] [Pete Wisdom] initialized
[2014-10-27 11:25:46,317][INFO ][node                     ] [Pete Wisdom] starting ...
[2014-10-27 11:25:46,458][INFO ][transport                ] [Pete Wisdom] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.194.169:9300]}
[2014-10-27 11:25:46,520][INFO ][discovery                ] [Pete Wisdom] elasticsearch/oVt8wl4-S9usPM0xWva_TA
[2014-10-27 11:25:49,567][INFO ][cluster.service          ] [Pete Wisdom] new_master [Pete Wisdom][oVt8wl4-S9usPM0xWva_TA][WIN-QIFLG7L0DOO][inet[/192.168.194.169:9300]], reason: zen-disco-join (elected_as_master)
[2014-10-27 11:25:49,598][INFO ][gateway                  ] [Pete Wisdom] recovered [0] indices into cluster_state
[2014-10-27 11:25:49,614][INFO ][http                     ] [Pete Wisdom] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.194.169:9200]}
[2014-10-27 11:25:49,614][INFO ][node                     ] [Pete Wisdom] started

Am I doing something wrong here @costin ?

costin commented 10 years ago

Looks like the issue can be reproduced only on Win 8.1 SP1 with both u71 and u72. Elasticsearch ran fine on Win 7 and Win 8.1 (no SP1). In my opinion, this is not a blocker any more however there is work on the way to provide a fix for it.

gmarz commented 10 years ago

@costin I'm also not able to reproduce this running Win 8.1.

What is considered SP1? I can't seem to find any updates being referred to as SP1. The latest update I have installed is KB2995388.

java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)

[2014-10-27 10:42:15,525][INFO ][node                     ] [Persuasion] version[1.4.0.Beta1], pid[4416], build[1f25669/2014-10-01T14:58:15Z]
[2014-10-27 10:42:15,525][INFO ][node                     ] [Persuasion] initializing ...
[2014-10-27 10:42:15,541][INFO ][plugins                  ] [Persuasion] loaded [], sites []
[2014-10-27 10:42:18,119][INFO ][node                     ] [Persuasion] initialized
[2014-10-27 10:42:18,119][INFO ][node                     ] [Persuasion] starting ...
[2014-10-27 10:42:18,259][INFO ][transport                ] [Persuasion] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.211.55.3:9300]}
[2014-10-27 10:42:18,291][INFO ][discovery                ] [Persuasion] elasticsearch/Mwp7dFlMQW6kVg-xsdU87w 
[2014-10-27 10:42:21,338][INFO ][cluster.service          ] [Persuasion] new_master [Persuasion][Mwp7dFlMQW6kVg-xsdU87w][GREGMARZOUK4138][inet[/10.211.55.3:9300]], reason: zen-disco-join (elected_as_master)
[2014-10-27 10:42:21,370][INFO ][http                     ] [Persuasion] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.211.55.3:9200]}
[2014-10-27 10:42:21,370][INFO ][node                     ] [Persuasion] started
costin commented 10 years ago

Based on various feedback it seems that I am the only one running into this issue which suggests there's some configuration issue with my system. The win 7, win 8.1 (with and without Update 1/ SP1) that we have tested, do not seem to be affected.

@clintongormley Hence why I removed the two labels above (critical and blocker).

costin commented 10 years ago

@nik9000 see the comment above - I don't think you're going to be affected and anyway, there's a fix already in the works (see the PR).

mrsolo commented 10 years ago

Jenkins seeing similar, if the the same failure on window 2012 r2 server

1> [2014-10-27 17:59:39,689][DEBUG][discovery.zen.ping.multicast] [node_t0] failed to send multicast ping request 1> java.lang.NullPointerException 1> at org.elasticsearch.discovery.zen.ping.multicast.MulticastZenPing.sendPingRequest(MulticastZenPing.java:256) 1> at org.elasticsearch.discovery.zen.ping.multicast.MulticastZenPing.ping(MulticastZenPing.java:188) 1> at org.elasticsearch.discovery.zen.ping.ZenPingService.ping(ZenPingService.java:146) 1> at org.elasticsearch.discovery.zen.ping.ZenPingService.pingAndWait(ZenPingService.java:124) 1> at org.elasticsearch.discovery.zen.ZenDiscovery.findMaster(ZenDiscovery.java:932) 1> at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:341) 1> at org.elasticsearch.discovery.zen.ZenDiscovery.access$5600(ZenDiscovery.java:83) 1> at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1299) 1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 1> at java.lang.Thread.run(Thread.java:745) 1> [2014-10-27 17:59:39,705][DEBUG][discovery.zen.ping.multicast] [node_t1] multicast failed to start [SocketException[Unrecognized Windows Sockets error: 0: Cannot bind]], disabling 1> java.net.SocketException: Unrecognized Windows Sockets error: 0: Cannot bind 1> at java.net.TwoStacksPlainDatagramSocketImpl.bind0(Native Method) 1> at java.net.TwoStacksPlainDatagramSocketImpl.bind0(TwoStacksPlainDatagramSocketImpl.java:107) 1> at java.net.AbstractPlainDatagramSocketImpl.bind(AbstractPlainDatagramSocketImpl.java:96) 1> at java.net.TwoStacksPlainDatagramSocketImpl.bind(TwoStacksPlainDatagramSocketImpl.java:97) 1> at java.net.DatagramSocket.bind(DatagramSocket.java:396) 1> at java.net.MulticastSocket.(MulticastSocket.java:172) 1> at java.net.MulticastSocket.(MulticastSocket.java:137) 1> at org.elasticsearch.common.network.MulticastChannel$Plain.buildMulticastSocket(MulticastChannel.java:283) 1> at org.elasticsearch.common.network.MulticastChannel$Plain.(MulticastChannel.java:276) 1> at org.elasticsearch.common.network.MulticastChannel.getChannel(MulticastChannel.java:50) 1> at org.elasticsearch.discovery.zen.ping.multicast.MulticastZenPing.doStart(MulticastZenPing.java:127) 1> at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:85) 1> at org.elasticsearch.discovery.zen.ping.ZenPingService.doStart(ZenPingService.java:103) 1> at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:85) 1> at org.elasticsearch.discovery.zen.ZenDiscovery.doStart(ZenDiscovery.java:215) 1> at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:85) 1> at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source) 1> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 1> at java.lang.reflect.Method.invoke(Method.java:606) 1> at org.elasticsearch.common.inject.internal.ConstructionContext$DelegatingInvocationHandler.invoke(ConstructionContext.java:110) 1> at com.sun.proxy.$Proxy35.start(Unknown Source) 1> at org.elasticsearch.discovery.DiscoveryService.doStart(DiscoveryService.java:84) 1> at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:85) 1> at org.elasticsearch.node.internal.InternalNode.start(InternalNode.java:244) 1> at org.elasticsearch.test.InternalTestCluster$5.run(InternalTestCluster.java:1341) 1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 1> at java.lang.Thread.run(Thread.java:745) 1> [2014-10-27 17:59:39,705][INFO ][discovery ] [node_t1] TEST-ElasticSearchAdTest-CHILD_VM=[0]-CLUSTER_SEED=[-2825708668188973463]-HASH=[12BD0B6F224490]/D3vQTSdQQlWVzREaEwf0AA 1> [2014-10-27 17:59:39,705][TRACE][discovery.zen.ping.unicast] [node_t0] [1] connecting to [node_t0][1HPXOGEJT46CaUCbPEkDNg][ElasticSearchAdTest][local[54]]{mode=local} 1> [2014-10-27 17:59:39,705][TRACE][discovery.zen.ping.unicast] [node_t0] [1] connected to [node_t0][1HPXOGEJT46CaUCbPEkDNg][ElasticSearchAdTest][local[54]]{mode=local} 1> [2014-10-27 17:59:39,705][TRACE][discovery.zen.ping.unicast] [node_t0] [1] sending to [node_t0][1HPXOGEJT46CaUCbPEkDNg][ElasticSearchAdTest][local[54]]{mode=local} 1> [2014-10-27 17:59:39,705][DEBUG][cluster.service ] [node_t1] processing [initial_join]: execute 1> [2014-10-27 17:59:39,720][DEBUG][cluster.service ] [node_t1] processing [initial_join]: no change in cluster_state 1> [2014-10-27 17:59:39,720][TRACE][discovery.zen ] [node_t1] starting to ping 1> [2014-10-27 17:59:39,720][DEBUG][discovery.zen.ping.multicast] [node_t1] failed to send multicast ping request 1> java.lang.NullPointerException 1> at org.elasticsearch.discovery.zen.ping.multicast.MulticastZenPing.sendPingRequest(MulticastZenPing.java:256) 1> at org.elasticsearch.discovery.zen.ping.multicast.MulticastZenPing.ping(MulticastZenPing.java:188) 1> at org.elasticsearch.discovery.zen.ping.ZenPingService.ping(ZenPingService.java:146) 1> at org.elasticsearch.discovery.zen.ping.ZenPingService.pingAndWait(ZenPingService.java:124) 1> at org.elasticsearch.discovery.zen.ZenDiscovery.findMaster(ZenDiscovery.java:932) 1> at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:341) 1> at org.elasticsearch.discovery.zen.ZenDiscovery.access$5600(ZenDiscovery.java:83) 1> at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1299) 1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

costin commented 10 years ago

Notice the NullPointerException have been fixed in the PR. However the core problem (multicast that suddenly fails) is still present. On my machine the error occurs when calling java.net.MulticastSocket#setInterface while here it seems to occur when calling MulticastSocket(SocketAddress)

costin commented 10 years ago

Pushed the fix in 1.3, 1.4, 1.x and master. In case of a socket error (for whatever reason) a nicer message is displayed to the user and the zen code is skipped (otherwise it causes its own exceptions which only add to the confusion). This plus the ES_USE_IPV4 flag seem to counter the error reproducible so far only on my windows machines...