Open ap0n opened 7 years ago
SoftNet cluster zookeeper version: 3.4.6.2.2.4.2-2 SUH cluster zookeeper version: 3.4.6
@eichelbe I run on this thread; we also get a similar error (stormconf.ser does not exist). Is it possible that you mail me the '/etc/hosts' file of your cluster to cross reference it with ours?
In order to see what's going on when submitting multiple pipelines, I tried to sort the logs according to their timestamp. I first submitted FocusPip and after some time, the TransferPip.
Here I list the first problem of each logfile (worker) sorted by the time of appearance.
2016-12-12T16:59:41.505+0200 [clu04-w6703] b.s.d.worker [ERROR] Error on initialization of server mk-worker
java.io.FileNotFoundException: File '/app/storm/supervisor/stormdist/TransferPip-5-1481552990/stormconf.ser' does not exist
2016-12-12T17:14:49.771+0200 [clu02-w6704] o.a.s.z.s.NIOServerCnxnFactory [ERROR] Thread Thread[Thread-16,5,main] died
java.lang.ThreadDeath: null
2016-12-12T17:15:06.166+0200 [clu16-w6700] o.a.s.z.s.NIOServerCnxnFactory [ERROR] Thread Thread[Thread-16,5,main] died
java.lang.ThreadDeath: null
2016-12-12T17:18:53.646+0200 [clu06-w6701] b.s.d.worker [ERROR] Error on initialization of server mk-worker
java.io.FileNotFoundException: File '/app/storm/supervisor/stormdist/FocusPip-6-1481555597/stormconf.ser' does not exist
2016-12-12T17:19:26.047+0200 [clu07-w6701] b.s.d.worker [ERROR] Error on initialization of server mk-worker
java.io.FileNotFoundException: File '/app/storm/supervisor/stormdist/FocusPip-6-1481555597/stormconf.ser' does not exist
2016-12-12T17:19:54.275+0200 [clu04-w6702] o.a.s.z.s.NIOServerCnxnFactory [ERROR] Thread Thread[Thread-16,5,main] died
java.lang.ThreadDeath: null
2016-12-12T17:20:51.782+0200 [clu02-w6703] b.s.d.worker [ERROR] Error on initialization of server mk-worker
java.io.FileNotFoundException: File '/app/storm/supervisor/stormdist/TransferPip-7-1481555871/stormconf.ser' does not exist
2016-12-12T17:22:04.637+0200 [clu03-w6702] b.s.d.worker [ERROR] Error on initialization of server mk-worker
java.io.FileNotFoundException: File '/app/storm/supervisor/stormdist/FocusPip-6-1481555597/stormconf.ser' does not exist
2016-12-12T17:23:42.720+0200 [clu04-w6704] b.s.d.worker [ERROR] Error on initialization of server mk-worker
java.io.FileNotFoundException: File '/app/storm/supervisor/stormdist/TransferPip-7-1481555871/stormconf.ser' does not exist
(Mon Dec 12 17:27:26 EET 2016, [clu03-w6703] org.apache.hadoop.hbase.client.RpcRetryingCaller@3abe110a, java.net.ConnectException: Connection refused
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:129) ~[stormjar.jar:na])
2016-12-12T17:29:04.992+0200 [clu03-w6703] b.s.d.worker [ERROR] Error on initialization of server mk-worker
java.io.FileNotFoundException: File '/app/storm/supervisor/stormdist/TransferPip-7-1481555871/stormconf.ser' does not exist
There are also lots of entries like:
java.lang.RuntimeException: Returned channel was actually not established
I'm also attaching the log-files of each worker for the corresponding time period logs.zip
We can send you the hosts file, but I think that this will not really help you as it just contains local IP addresses (on our mini cluster we do not even run DHCP).
It seems that for some reason Storm is not storing the pipeline on the worker machines. No wonder that the worker is dying. As the pipeline before is working, file permission should not be the problem. Silly question, you have enough HD space? Do you also have the supervisor logs, as the supervisor is responsible for receiving the pipeline and for starting the workers.
HD space can't be the problem. Each node has at least 200GB of free space. I'm uploading the supervisor logs (here) but I'll examine them later this afternoon. I'll report back if I find something.
We tried a few more things today.
java.io.FileNotFoundException: File '/app/storm/supervisor/stormdist/FocusPip-1-1481640573/stormconf.ser' does not exist
exception.So, @antoine-tran @tahoangl3s can you please checkout the replay sink exception? @eichelbe can you run both pipelines simultaneously on your cluster (you can find them here if you want to skip generation) to see what happens?
Thanks @ap0n for the updates. We inspected the merged.txt file and found the error message. Actually, it is a warning that the HBase table (that used to store pairwise correlation scores) is not named properly, and the function returns a proper name. Nevertheless, we will clear the issue soon. TA
I forwarded your idea to @cuiqin, not sure whether our cluster is sufficiently large. But we may try on LUH cluster ;)
hi all @eichelbe @tahoangl3s @ap0n : I have fixed the Exception regarding HBase table naming. Please let me know if you still notice the issue
I don't think our cluster is able to handle both Transfer and Focus pipeline as we have rather small cluster. We can try it on the LUH cluster once we have the access.
That would be nice. @tahoangl3s any news from your side regarding the exception?
@ap0n We have fixed the issue. Please refer to my comment and Tuan's comment above.
You are right. I somehow missed that comment... We'll check it out and let you know.
Now the Tweeter source for the FocusPip throws the following exception
2016-12-19T12:39:06.950+0200 STDIO [ERROR] Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /user/storm/eventset_16_8_until_19_8/symbols.sym
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1728)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1671)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1651)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1625)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:503)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
Has something changed at the HDFS files?
While running both Focus and Transfer pipelines we get this error:
clu02,w6704,2016-12-21T13:21:41.299+0200 STDIO [INFO] Reconfiguring infrastructure settings: {eventBus.port=14000, pipelines.ports=22000-22500, zookeeper.host=clu01.softnet.tuc.gr,clu02.softnet.tuc.gr,clu03.softnet.tuc.gr, monitoring.volume.enabled=true, zookeeper.retry.times=5, zookeeper.port=2181, eventBus.disableLogging=eu.qualimaster.monitoring.events.PipelineElementMultiObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineElementObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineObservationMonitoringEvent,eu.qualimaster.monitoring.events.PlatformMultiObservationHostMonitoringEvent, eventBus.host=clu01.softnet.tuc.gr, zookeeper.retry.interval=1000} from {"storm.id" "FocusPip-1-1482319272", "dev.zookeeper.path" "/tmp/dev-storm-zookeeper", "topology.tick.tuple.freq.secs" nil, "topology.builtin.metrics.bucket.size.secs" 60, "topology.fall.back.on.java.serialization" true, "topology.max.error.report.per.interval" 5, "zmq.linger.millis" 5000, "topology.skip.missing.kryo.registrations" false, "storm.messaging.netty.client_worker_threads" 1, "ui.childopts" "-Xmx768m -Djava.net.preferIPv4Stack=true", "storm.zookeeper.session.timeout" 20000, "nimbus.reassign" true, "topology.trident.batch.emit.interval.millis" 500, "storm.messaging.netty.flush.check.interval.ms" 10, "nimbus.monitor.freq.secs" 10, "logviewer.childopts" "-Xmx128m", "java.library.path" "/usr/local/lib:/opt/local/lib:/usr/lib", "topology.executor.send.buffer.size" 16384, "storm.local.dir" "/app/storm", "storm.messaging.netty.buffer_size" 5242880, "supervisor.worker.start.timeout.secs" 120, "topology.enable.message.timeouts" true, "nimbus.cleanup.inbox.freq.secs" 600, "nimbus.inbox.jar.expiration.secs" 3600, "drpc.worker.threads" 64, "storm.meta.serialization.delegate" "backtype.storm.serialization.DefaultSerializationDelegate", "topology.worker.shared.thread.pool.size" 4, "nimbus.host" "clu01.softnet.tuc.gr", "storm.messaging.netty.min_wait_ms" 100, "storm.zookeeper.port" 2181, "transactional.zookeeper.port" nil, "topology.executor.receive.buffer.size" 16384, "transactional.zookeeper.servers" nil, "storm.zookeeper.root" "/storm", "storm.zookeeper.retry.intervalceiling.millis" 30000, "supervisor.enable" true, "storm.messaging.netty.server_worker_threads" 1, "storm.zookeeper.servers" ["clu01.softnet.tuc.gr" "clu02.softnet.tuc.gr" "clu03.softnet.tuc.gr"], "transactional.zookeeper.root" "/transactional", "topology.acker.executors" nil, "eventBus.port" "14000", "topology.kryo.decorators" (), "topology.name" "FocusPip", "topology.transfer.buffer.size" 32, "topology.worker.childopts" nil, "drpc.queue.size" 128, "worker.childopts" "-Xmx6g -Djava.net.preferIPv4Stack=true -agentpath:/home/softnet/libyjpagent.so", "supervisor.heartbeat.frequency.secs" 5, "topology.error.throttle.interval.secs" 10, "zmq.hwm" 0, "drpc.port" 3772, "supervisor.monitor.frequency.secs" 3, "drpc.childopts" "-Xmx768m", "topology.receiver.buffer.size" 8, "task.heartbeat.frequency.secs" 3, "topology.tasks" nil, "eventBus.host" "clu01.softnet.tuc.gr", "storm.messaging.netty.max_retries" 300, "topology.spout.wait.strategy" "backtype.storm.spout.SleepSpoutWaitStrategy", "nimbus.thrift.max_buffer_size" 1048576, "windowSize" 30, "topology.max.spout.pending" nil, "storm.zookeeper.retry.interval" 1000, "windowAdvance" 1, "topology.sleep.spout.wait.strategy.time.ms" 1, "nimbus.topology.validator" "backtype.storm.nimbus.DefaultTopologyValidator", "supervisor.slots.ports" [6700 6701 6702 6703 6704], "monitoring.volume.enabled" "true", "topology.environment" nil, "topology.debug" false, "nimbus.task.launch.secs" 120, "nimbus.supervisor.timeout.secs" 60, "topology.kryo.register" nil, "pipelines.ports" "22000-22500", "topology.message.timeout.secs" 100, "task.refresh.poll.secs" 10, "topology.workers" 9, "supervisor.childopts" "-Djava.net.preferIPv4Stack=true", "nimbus.thrift.port" 6627, "topology.stats.sample.rate" 0.05, "worker.heartbeat.frequency.secs" 1, "SUBPIPELINE.NAME" "FocusPip", "topology.tuple.serializer" "backtype.storm.serialization.types.ListDelegateSerializer", "topology.disruptor.wait.strategy" "com.lmax.disruptor.BlockingWaitStrategy", "topology.multilang.serializer" "backtype.storm.multilang.JsonSerializer", "nimbus.task.timeout.secs" 30, "storm.zookeeper.connection.timeout" 15000, "topology.kryo.factory" "backtype.storm.serialization.DefaultKryoFactory", "drpc.invocations.port" 3773, "logviewer.port" 8000, "zmq.threads" 1, "storm.zookeeper.retry.times" 5, "topology.worker.receiver.thread.count" 1, "storm.thrift.transport" "backtype.storm.security.auth.SimpleTransportPlugin", "topology.state.synchronization.timeout.secs" 60, "supervisor.worker.timeout.secs" 30, "nimbus.file.copy.expiration.secs" 600, "storm.messaging.transport" "backtype.storm.messaging.netty.Context", "logviewer.appender.name" "A1", "storm.messaging.netty.max_wait_ms" 1000, "drpc.request.timeout.secs" 600, "storm.local.mode.zmq" false, "ui.port" 8746, "confModel.initMode" "STATIC", "nimbus.childopts" "-Xmx1024m -Djava.net.preferIPv4Stack=true", "storm.cluster.mode" "distributed", "topology.max.task.parallelism" nil, "storm.messaging.netty.transfer.batch.size" 262144, "topology.classpath" nil, "eventBus.disableLogging" "eu.qualimaster.monitoring.events.PipelineElementMultiObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineElementObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineObservationMonitoringEvent,eu.qualimaster.monitoring.events.PlatformMultiObservationHostMonitoringEvent"}
clu02,w6704,2016-12-21T13:21:41.300+0200 STDIO [INFO] Reconfiguring infrastructure settings: {eventBus.port=14000, pipelines.ports=22000-22500, zookeeper.host=clu01.softnet.tuc.gr,clu02.softnet.tuc.gr,clu03.softnet.tuc.gr, monitoring.volume.enabled=true, zookeeper.retry.times=5, zookeeper.port=2181, eventBus.disableLogging=eu.qualimaster.monitoring.events.PipelineElementMultiObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineElementObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineObservationMonitoringEvent,eu.qualimaster.monitoring.events.PlatformMultiObservationHostMonitoringEvent, eventBus.host=clu01.softnet.tuc.gr, zookeeper.retry.interval=1000} from {"storm.id" "FocusPip-1-1482319272", "dev.zookeeper.path" "/tmp/dev-storm-zookeeper", "topology.tick.tuple.freq.secs" nil, "topology.builtin.metrics.bucket.size.secs" 60, "topology.fall.back.on.java.serialization" true, "topology.max.error.report.per.interval" 5, "zmq.linger.millis" 5000, "topology.skip.missing.kryo.registrations" false, "storm.messaging.netty.client_worker_threads" 1, "ui.childopts" "-Xmx768m -Djava.net.preferIPv4Stack=true", "storm.zookeeper.session.timeout" 20000, "nimbus.reassign" true, "topology.trident.batch.emit.interval.millis" 500, "storm.messaging.netty.flush.check.interval.ms" 10, "nimbus.monitor.freq.secs" 10, "logviewer.childopts" "-Xmx128m", "java.library.path" "/usr/local/lib:/opt/local/lib:/usr/lib", "topology.executor.send.buffer.size" 16384, "storm.local.dir" "/app/storm", "storm.messaging.netty.buffer_size" 5242880, "supervisor.worker.start.timeout.secs" 120, "topology.enable.message.timeouts" true, "nimbus.cleanup.inbox.freq.secs" 600, "nimbus.inbox.jar.expiration.secs" 3600, "drpc.worker.threads" 64, "storm.meta.serialization.delegate" "backtype.storm.serialization.DefaultSerializationDelegate", "topology.worker.shared.thread.pool.size" 4, "nimbus.host" "clu01.softnet.tuc.gr", "storm.messaging.netty.min_wait_ms" 100, "storm.zookeeper.port" 2181, "transactional.zookeeper.port" nil, "topology.executor.receive.buffer.size" 16384, "transactional.zookeeper.servers" nil, "storm.zookeeper.root" "/storm", "storm.zookeeper.retry.intervalceiling.millis" 30000, "supervisor.enable" true, "storm.messaging.netty.server_worker_threads" 1, "storm.zookeeper.servers" ["clu01.softnet.tuc.gr" "clu02.softnet.tuc.gr" "clu03.softnet.tuc.gr"], "transactional.zookeeper.root" "/transactional", "topology.acker.executors" nil, "eventBus.port" "14000", "topology.kryo.decorators" (), "topology.name" "FocusPip", "topology.transfer.buffer.size" 32, "topology.worker.childopts" nil, "drpc.queue.size" 128, "worker.childopts" "-Xmx6g -Djava.net.preferIPv4Stack=true -agentpath:/home/softnet/libyjpagent.so", "supervisor.heartbeat.frequency.secs" 5, "topology.error.throttle.interval.secs" 10, "zmq.hwm" 0, "drpc.port" 3772, "supervisor.monitor.frequency.secs" 3, "drpc.childopts" "-Xmx768m", "topology.receiver.buffer.size" 8, "task.heartbeat.frequency.secs" 3, "topology.tasks" nil, "eventBus.host" "clu01.softnet.tuc.gr", "storm.messaging.netty.max_retries" 300, "topology.spout.wait.strategy" "backtype.storm.spout.SleepSpoutWaitStrategy", "nimbus.thrift.max_buffer_size" 1048576, "windowSize" 30, "topology.max.spout.pending" nil, "storm.zookeeper.retry.interval" 1000, "windowAdvance" 1, "topology.sleep.spout.wait.strategy.time.ms" 1, "nimbus.topology.validator" "backtype.storm.nimbus.DefaultTopologyValidator", "supervisor.slots.ports" [6700 6701 6702 6703 6704], "monitoring.volume.enabled" "true", "topology.environment" nil, "topology.debug" false, "nimbus.task.launch.secs" 120, "nimbus.supervisor.timeout.secs" 60, "topology.kryo.register" nil, "pipelines.ports" "22000-22500", "topology.message.timeout.secs" 100, "task.refresh.poll.secs" 10, "topology.workers" 9, "supervisor.childopts" "-Djava.net.preferIPv4Stack=true", "nimbus.thrift.port" 6627, "topology.stats.sample.rate" 0.05, "worker.heartbeat.frequency.secs" 1, "SUBPIPELINE.NAME" "FocusPip", "topology.tuple.serializer" "backtype.storm.serialization.types.ListDelegateSerializer", "topology.disruptor.wait.strategy" "com.lmax.disruptor.BlockingWaitStrategy", "topology.multilang.serializer" "backtype.storm.multilang.JsonSerializer", "nimbus.task.timeout.secs" 30, "storm.zookeeper.connection.timeout" 15000, "topology.kryo.factory" "backtype.storm.serialization.DefaultKryoFactory", "drpc.invocations.port" 3773, "logviewer.port" 8000, "zmq.threads" 1, "storm.zookeeper.retry.times" 5, "topology.worker.receiver.thread.count" 1, "storm.thrift.transport" "backtype.storm.security.auth.SimpleTransportPlugin", "topology.state.synchronization.timeout.secs" 60, "supervisor.worker.timeout.secs" 30, "nimbus.file.copy.expiration.secs" 600, "storm.messaging.transport" "backtype.storm.messaging.netty.Context", "logviewer.appender.name" "A1", "storm.messaging.netty.max_wait_ms" 1000, "drpc.request.timeout.secs" 600, "storm.local.mode.zmq" false, "ui.port" 8746, "confModel.initMode" "STATIC", "nimbus.childopts" "-Xmx1024m -Djava.net.preferIPv4Stack=true", "storm.cluster.mode" "distributed", "topology.max.task.parallelism" nil, "storm.messaging.netty.transfer.batch.size" 262144, "topology.classpath" nil, "eventBus.disableLogging" "eu.qualimaster.monitoring.events.PipelineElementMultiObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineElementObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineObservationMonitoringEvent,eu.qualimaster.monitoring.events.PlatformMultiObservationHostMonitoringEvent"}
clu02,w6704,2016-12-21T13:21:41.301+0200 STDIO [INFO] Reconfiguring infrastructure settings: {eventBus.port=14000, pipelines.ports=22000-22500, zookeeper.host=clu01.softnet.tuc.gr,clu02.softnet.tuc.gr,clu03.softnet.tuc.gr, monitoring.volume.enabled=true, zookeeper.retry.times=5, zookeeper.port=2181, eventBus.disableLogging=eu.qualimaster.monitoring.events.PipelineElementMultiObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineElementObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineObservationMonitoringEvent,eu.qualimaster.monitoring.events.PlatformMultiObservationHostMonitoringEvent, eventBus.host=clu01.softnet.tuc.gr, zookeeper.retry.interval=1000} from {"storm.id" "FocusPip-1-1482319272", "dev.zookeeper.path" "/tmp/dev-storm-zookeeper", "topology.tick.tuple.freq.secs" nil, "topology.builtin.metrics.bucket.size.secs" 60, "topology.fall.back.on.java.serialization" true, "topology.max.error.report.per.interval" 5, "zmq.linger.millis" 5000, "topology.skip.missing.kryo.registrations" false, "storm.messaging.netty.client_worker_threads" 1, "ui.childopts" "-Xmx768m -Djava.net.preferIPv4Stack=true", "storm.zookeeper.session.timeout" 20000, "nimbus.reassign" true, "topology.trident.batch.emit.interval.millis" 500, "storm.messaging.netty.flush.check.interval.ms" 10, "nimbus.monitor.freq.secs" 10, "logviewer.childopts" "-Xmx128m", "java.library.path" "/usr/local/lib:/opt/local/lib:/usr/lib", "topology.executor.send.buffer.size" 16384, "storm.local.dir" "/app/storm", "storm.messaging.netty.buffer_size" 5242880, "supervisor.worker.start.timeout.secs" 120, "topology.enable.message.timeouts" true, "nimbus.cleanup.inbox.freq.secs" 600, "nimbus.inbox.jar.expiration.secs" 3600, "drpc.worker.threads" 64, "storm.meta.serialization.delegate" "backtype.storm.serialization.DefaultSerializationDelegate", "topology.worker.shared.thread.pool.size" 4, "nimbus.host" "clu01.softnet.tuc.gr", "storm.messaging.netty.min_wait_ms" 100, "storm.zookeeper.port" 2181, "transactional.zookeeper.port" nil, "topology.executor.receive.buffer.size" 16384, "transactional.zookeeper.servers" nil, "storm.zookeeper.root" "/storm", "storm.zookeeper.retry.intervalceiling.millis" 30000, "supervisor.enable" true, "storm.messaging.netty.server_worker_threads" 1, "storm.zookeeper.servers" ["clu01.softnet.tuc.gr" "clu02.softnet.tuc.gr" "clu03.softnet.tuc.gr"], "transactional.zookeeper.root" "/transactional", "topology.acker.executors" nil, "eventBus.port" "14000", "topology.kryo.decorators" (), "topology.name" "FocusPip", "topology.transfer.buffer.size" 32, "topology.worker.childopts" nil, "drpc.queue.size" 128, "worker.childopts" "-Xmx6g -Djava.net.preferIPv4Stack=true -agentpath:/home/softnet/libyjpagent.so", "supervisor.heartbeat.frequency.secs" 5, "topology.error.throttle.interval.secs" 10, "zmq.hwm" 0, "drpc.port" 3772, "supervisor.monitor.frequency.secs" 3, "drpc.childopts" "-Xmx768m", "topology.receiver.buffer.size" 8, "task.heartbeat.frequency.secs" 3, "topology.tasks" nil, "eventBus.host" "clu01.softnet.tuc.gr", "storm.messaging.netty.max_retries" 300, "topology.spout.wait.strategy" "backtype.storm.spout.SleepSpoutWaitStrategy", "nimbus.thrift.max_buffer_size" 1048576, "windowSize" 30, "topology.max.spout.pending" nil, "storm.zookeeper.retry.interval" 1000, "windowAdvance" 1, "topology.sleep.spout.wait.strategy.time.ms" 1, "nimbus.topology.validator" "backtype.storm.nimbus.DefaultTopologyValidator", "supervisor.slots.ports" [6700 6701 6702 6703 6704], "monitoring.volume.enabled" "true", "topology.environment" nil, "topology.debug" false, "nimbus.task.launch.secs" 120, "nimbus.supervisor.timeout.secs" 60, "topology.kryo.register" nil, "pipelines.ports" "22000-22500", "topology.message.timeout.secs" 100, "task.refresh.poll.secs" 10, "topology.workers" 9, "supervisor.childopts" "-Djava.net.preferIPv4Stack=true", "nimbus.thrift.port" 6627, "topology.stats.sample.rate" 0.05, "worker.heartbeat.frequency.secs" 1, "SUBPIPELINE.NAME" "FocusPip", "topology.tuple.serializer" "backtype.storm.serialization.types.ListDelegateSerializer", "topology.disruptor.wait.strategy" "com.lmax.disruptor.BlockingWaitStrategy", "topology.multilang.serializer" "backtype.storm.multilang.JsonSerializer", "nimbus.task.timeout.secs" 30, "storm.zookeeper.connection.timeout" 15000, "topology.kryo.factory" "backtype.storm.serialization.DefaultKryoFactory", "drpc.invocations.port" 3773, "logviewer.port" 8000, "zmq.threads" 1, "storm.zookeeper.retry.times" 5, "topology.worker.receiver.thread.count" 1, "storm.thrift.transport" "backtype.storm.security.auth.SimpleTransportPlugin", "topology.state.synchronization.timeout.secs" 60, "supervisor.worker.timeout.secs" 30, "nimbus.file.copy.expiration.secs" 600, "storm.messaging.transport" "backtype.storm.messaging.netty.Context", "logviewer.appender.name" "A1", "storm.messaging.netty.max_wait_ms" 1000, "drpc.request.timeout.secs" 600, "storm.local.mode.zmq" false, "ui.port" 8746, "confModel.initMode" "STATIC", "nimbus.childopts" "-Xmx1024m -Djava.net.preferIPv4Stack=true", "storm.cluster.mode" "distributed", "topology.max.task.parallelism" nil, "storm.messaging.netty.transfer.batch.size" 262144, "topology.classpath" nil, "eventBus.disableLogging" "eu.qualimaster.monitoring.events.PipelineElementMultiObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineElementObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineObservationMonitoringEvent,eu.qualimaster.monitoring.events.PlatformMultiObservationHostMonitoringEvent"}
clu05,w6701,2016-12-21T13:21:41.304+0200 e.q.c.s.SignalMechanism [INFO] Changing namespace state: FocusPip ENABLE
clu05,w6701,2016-12-21T13:21:41.304+0200 e.q.c.s.SignalMechanism [INFO] Changing namespace state: FocusPip ENABLE
clu05,w6701,2016-12-21T13:21:41.306+0200 e.q.c.s.SignalMechanism [INFO] Changed namespace state: FocusPip ENABLE signal count 0
clu05,w6701,2016-12-21T13:21:41.306+0200 e.q.c.s.SignalMechanism [INFO] Changed namespace state: FocusPip ENABLE signal count 0
clu05,w6701,2016-12-21T13:21:41.306+0200 e.q.a.i.c.s.HayashiYoshidaBolt [INFO] Prepared--basesignalbolt.... FocusPip/TopoSoftwareCorrelationFinancialHayashiYoshidaBolt
clu05,w6701,2016-12-21T13:21:41.306+0200 e.q.F.t.PipelineVar_7_FamilyElement5FamilyElement [INFO] Prepared--basesignalbolt.... FocusPip/PipelineVar_7_FamilyElement5
clu05,w6701,2016-12-21T13:21:41.311+0200 b.s.d.executor [INFO] Prepared bolt TopoSoftwareCorrelationFinancialHayashiYoshidaBolt:(24)
clu05,w6701,2016-12-21T13:21:41.316+0200 e.q.a.i.c.s.HayashiYoshidaBolt [ERROR] Error SignalConnection:KeeperErrorCode = NodeExists for /qm/pipelines/FocusPip/TopoSoftwareCorrelationFinancialHayashiYoshidaBolt
It's related to the signal mechanism; @eichelbe do you have any ideas on this one? Generally, we've ran the pipelines one by one and everything goes smoothly. The same goes for running multiple storm topologies (non-QM) concurrently. Therefore, it seems that the problem lies somewhere in the QM infrastructure. Is it possible that you have a look on this issue from the infrastructure point of view?
Worker log? Seems to be a double creation, probably not problematic as long as the node is there at all.
Here it is worker-6701.log.txt. Line 327. And here is the merged log from the whole cluster for this run merged.log.txt.zip
There may be a problem due to combined exception handling in case that some other component created the same node just between checking for its existence and doing some operations on it. Either we can fix it by separating the exception handling or by Zktransactions :|
Initial version for creating ZkNodes more carefully is committed and building... If the problem persists, we will try it with incremental path creation as done by the PortManager.
Christoph the SpringClientTwitter produces the following error. I don't know if this causes problems to the whole pipeline but could you have a look?
clu04,w6703,2016-12-29T17:28:45.338+0200 o.a.s.z.s.NIOServerCnxnFactory [ERROR] Thread Thread[Thread-17,5,main] died
java.lang.ThreadDeath: null
at java.lang.Thread.stop(Thread.java:836) ~[na:1.7.0_80]
at eu.qualimaster.data.SpringConnector.DataConnector.stopRunning(DataConnector.java:278) ~[stormjar.jar:na]
at eu.qualimaster.data.SpringConnector.SpringClientTwitter.initConnector(SpringClientTwitter.java:67) ~[stormjar.jar:na]
at eu.qualimaster.data.SpringConnector.SpringClientTwitter.getInstance(SpringClientTwitter.java:24) ~[stormjar.jar:na]
at eu.qualimaster.data.impl.TwitterStreamData.TwitterStreamingDataSource.<init>(TwitterStreamingDataSource.java:60) ~[stormjar.jar:na]
at eu.qualimaster.data.stream.source.TwitterFileStreaming.<init>(TwitterFileStreaming.java:49) ~[stormjar.jar:na]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.7.0_80]
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) ~[na:1.7.0_80]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[na:1.7.0_80]
at java.lang.reflect.Constructor.newInstance(Constructor.java:526) ~[na:1.7.0_80]
at java.lang.Class.newInstance(Class.java:383) ~[na:1.7.0_80]
at eu.qualimaster.FocusPip.topology.PipelineVar_7_Source0Source.open(PipelineVar_7_Source0Source.java:98) ~[stormjar.jar:na]
at backtype.storm.daemon.executor$fn__6579$fn__6594.invoke(executor.clj:522) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.util$async_loop$fn__459.invoke(util.clj:461) ~[storm-core-0.9.5.jar:0.9.5]
at clojure.lang.AFn.run(AFn.java:24) ~[clojure-1.5.1.jar:na]
at java.lang.Thread.run(Thread.java:745) ~[na:1.7.0_80]
That's actually not a problem. The SpringConnector is supposed to be shut down after it did its job. So that's fine.
Unfortunately, still no useful update, and no useful error to help us understand why multiple QM pipelines do not work together. All I got in the last execution is Netty reconnection errors, and after some time, I also got some CuratorConnectionLossException errors.
I am re-capping all we know so far in case anyone has an idea: 1) Multiple non-QM topologies are executed without any problems in parallel. 2) A single QM topology is executed in parallel with multiple other non-QM topologies without any problems. 3) Two QM topologies cannot be executed in parallel. 4) Components start getting re-started at random when the second topology is submitted. 5) No useful exceptions or errors occur in storm logs
I am attaching two log files in case it helps anyone come up with ideas. main.txt contains the infrastructure logs. The infrastructure was updated today. merged.txt contains storm logs from all workers and the nimbus, merged in one file, sorted in chronological order.
I am thinking that since the schedule is tight, maybe it would be wiser to introduce some redundancy and have an alternative solution in case this issue is not resolved. The alternative would be creating a new pipeline which would "merge" three pipelines (TransferPip, FocusPip, DynamicGraphPip) into one large pipeline. The initial idea (during the Bremen meeting) was to avoid creating a new pipeline (especially such a complicated one) because it would require a lot of work, and running multiple pipelines seemed like a good solution at the time. Yet, it seems that it is taking too much time and effort without any guarantee that it will be resolved.
My proposal: Since it seems that this problem is related to the QM infrastructure, and not the individual pipelines/components (which are independently tested and working properly), maybe SUH (@eichelbe) could continue looking into this issue (e.g. on Okeanos cluster), while at the same time we work on creating the full, merged pipeline. This way, whichever of the two approaches ends up working, the demo would be safe.
Thoughts?
We did several basic modifications yesterday to bring back the hardware profiling. Did you test with these changes, as they involved some resolutions for the event processing?
Which QM topologies did you test?
I updated the infrastructure today, so it should include the latest changes, but I did not re-create the pipelines from QM-IConf, I used older builds. I tried FocusPip with TransferPip.
This is ok. @cuiqin could you please run SwitchPip and RandomPip in parallel on Okeanos. Let's try with the simple things first...
From the log it seems that we have a similar problem even with a smaller pipeline on our cluster, while other pipelines are running (and the same one is now running on Okeanos but not on our side). We also observed ''o.a.s.z.ClientCnxn [INFO] EventThread shut down'' and the worker is shutting down using the original Storm shutdown hook (we checked that) while Netty is not connected, trying to go on connecting or even the worker is loading and initializing its components.
Nathan wrote in a post that it may be that the Zookeeper is getting overloaded. This can be a plausible reason, because upon startup we also create nodes and set watchers - the bigger the pipelines, the more Zookeeper operations. And as the Zookeeper is central, it may cause others/existing through a timeout to die. Others suggest increasing some specific buffer sizes.
We will try that and let you know, but, depending on the cluster, it seems that also smaller pipelines can fail without going for parallel execution - if it's the same problem.
Okay, let's hope it is as easy as tweaking some buffers. In the meantime, I will start working on the merged pipeline as a backup scenario until this is resolved.
Makes pretty much sense...
Another option would be to do the signals via the event bus avoiding additional Zookeeper connections. But this is rather untested...
I got the problematic pipeline up (the first time), but then the workers still continued dying... It could be the right direction although not the solution...
I ran both RandomPip and SwitchPip in parallel on the Okeanos cluster. I tried them with 3 and 5 workers respectively. Although I also saw the netty reconnections and "EventThread shut down" logs, both pipelines are running well. If you want to check, they are still on the Okeanos now.
I did a series of modifications on the workers, the nimbus and the zookeepers. The problematic pipeline is coming up but it is not stable. Workers restart then at runtime, so it seems that it defers only the problem. At least in our cluster the zookeepers are not powerful enough to handle the number of startup connections. Some statements say that the most powerful machines with the fastest harddisks shall not be the workers rather than the zookeepers.
Although there are still ways to mitigate this problem, it will ever limit the ability what to run on a cluster with zookeepers. Options (for now): 1) try sending events through the QM-Event bus to avoid creating Zookeeper connections (currently possible except for the nodes requiring the port manager - try it lazily hoping that loose integration is still working then). 2) shuffling the zookeeper sequence for creating the connections. Not sure whether it can help at all... 3) switching to a new zookeeper version hoping that this is faster 4) ... resigning (hopefully not)
I'm checking the untested code for 1) and running the test cases. Looks surprisingly good, even for load shedding, replay, pipeline commands... 2) is a small modification and potentially worth a trial.
I'll let you know...
Ok, it seems that I have good news. And after some cluster reboots (for tracking the worker shutdowns) our problematic pipeline runs without worker restarts for several minutes. I will collect the settings later and let you know...
Here are the settings that worked for us...
minSessionTimeout=300000
maxSessionTimeout=500000
nimbus.thrift.max_buffer_size: 20480000
nimbus.task.timeout.secs: 320
nimbus.task.launch.secs: 320
supervisor.worker.start.timeout.secs: 320
storm.zookeeper.session.timeout: 300000
The zookeeper session timeout shall fit to the minimum session timeout in the Zookeepers, the other timeouts should be a bit larger. Switching from curator to QM event bus for the adaptation signals was not needed so far, but if you run out of Zookeeper resources, setting pipeline.signals.curator=false
in qm.infrastructure.cfg
can help (to be sure, further enactment/execution tests is then needed).Please make sure to properly restart the entire cluster after all modifications. It might be that the infrastructure needs some time between storm restart (happened only the first times on our side).
The full setup description is here.
Good luck...
This is a great news :). Let's hope the pipelines play nicely along with each other :o
I messaged our Administrator so she can make the changes Holger proposed, since we are usually not allowed to change configuration arbitrarily on SoftNet cluster. Hopefully, this will be done soon and I can give it another try. Thanks Holger.
In the mean time we are updating the LUH cluster together with Miroslav for experiments with adaptive storm...
Good news everyone!
Both focus pip and transfer pip, seem to be running at the same time on SoftNet cluster without having the "restart" problems. Will keep testing, but at the moment both generate results without problems!!!
Maybe we should keep this infrastructure version and configuration as a "backup"
Good news indeed :) So SoftNet cluster and LUH can run multiple pipelines now :)
A backup is a good idea, might be we can even tag the component commits in Git...
Run FocusPip & TransferPip concurently on SoftNet cluster.