dos-group / AURA

Distributed Execution Engine
Apache License 2.0
6 stars 8 forks source link

Connection to ZooKeeper lost on OS X #16

Open lauritzthamsen opened 10 years ago

lauritzthamsen commented 10 years ago

running the example clients currently fails on OS X with the following output:

2014-07-04 13:47:15,339 |  INFO [main] (LocalClusterSimulator.java:98) - CREATE TMP DIRECTORY: '/var/folders/5g/lk8wz6sd62b63m831_rh3h1w0000gn/T/zookeeper'
2014-07-04 13:47:15,820 |  INFO [nioEventLoopGroup-2-1] (DataReader.java:96) - network server bound to address /141.23.83.200:55283
2014-07-04 13:47:15,824 |  INFO [localEventLoopGroup-4-1] (DataReader.java:96) - network server bound to address local:5d47babf-0251-48ce-8bac-415aa2980314
2014-07-04 13:47:15,827 |  INFO [nioEventLoopGroup-6-1] (IOManager.java:312) - network server bound to address /141.23.83.200:55283
2014-07-04 13:47:21,044 | ERROR [main] (NIOServerCnxnFactory.java:44) - Thread Thread[main,5,main] died
java.lang.IllegalStateException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /aura
    at de.tuberlin.aura.workloadmanager.InfrastructureManager.<init>(InfrastructureManager.java:116)
    at de.tuberlin.aura.workloadmanager.InfrastructureManager.getInstance(InfrastructureManager.java:133)
    at de.tuberlin.aura.workloadmanager.WorkloadManager.<init>(WorkloadManager.java:77)
    at de.tuberlin.aura.workloadmanager.WorkloadManager.<init>(WorkloadManager.java:57)
    at de.tuberlin.aura.client.executors.LocalClusterSimulator.<init>(LocalClusterSimulator.java:128)
    at de.tuberlin.aura.client.executors.LocalClusterSimulator.<init>(LocalClusterSimulator.java:63)
    at de.tuberlin.aura.demo.examples.IntegrationTests.main(IntegrationTests.java:480)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)

this is the case for both the state on master (e.g. SimpleClient at 90147c369a45142dd1a66db0524788397dc1d4f2) and develop (e.g. IntegrationTests at 87451d6d219452e513e376e783e44858a80b668e).

stepping through these clients sometimes leads to successful runs, which might suggest a timing issue and not a general problem with OS X.

lauritzthamsen commented 10 years ago

The problem seems to be that we use the zookeeper-object without making sure that a connection to zookeeper has been established.

logging

LOG.info(String.valueOf(zookeeper.getState()));

before calling

ZookeeperHelper.initDirectories(this.zookeeper);

shows that the zookeeper-object is still in the CONNECTING state just before

java.lang.IllegalStateException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /aura
lauritzthamsen commented 10 years ago

waiting for state CONNECTED resolves this issue.

i also found Apache Curator. it's a framework built on top of ZooKeeper and provides a higher-level API as well as connection guarantees. i think it might be a good idea for us to use Curator.

Teots commented 10 years ago

Session establishment is asynchronous. This constructor will initiate connection to the server and return immediately - potentially (usually) before the session is fully established. The watcher argument specifies the watcher that will be notified of any changes in state. This notification can come at any point before or after the constructor call has returned.

Apparently, it can rarely happen that the connection setup last longer than the execution of the constructor. But this can be solved easily by adding a new statement in the switch of the Watcher. It should execute the initDirectories method after receiving the connected state.

lauritzthamsen commented 10 years ago

well, all further interactions with the ZooKeeper files need the connection to be established, not just initDirectories(). all these interactions would have to take place in the Watcher's event callback, but the TaskManager's setupZookeeper() method even returns the zookeeper object for further interactions with the ZooKeeper server... i think it's easiest to explicitly wait for the connection to establish as fix for now.

lauritzthamsen commented 10 years ago

i'll also have a look at Curator in the next days. would just be cool to have it take care of connection establishment and failures for us.