QualiMaster / qm-issues

2 stars 0 forks source link

Run multiple pipelines concurrently #56

Open ap0n opened 7 years ago

ap0n commented 7 years ago

Run FocusPip & TransferPip concurently on SoftNet cluster.

ap0n commented 7 years ago

SoftNet cluster zookeeper version: 3.4.6.2.2.4.2-2 SUH cluster zookeeper version: 3.4.6

ap0n commented 7 years ago

@eichelbe I run on this thread; we also get a similar error (stormconf.ser does not exist). Is it possible that you mail me the '/etc/hosts' file of your cluster to cross reference it with ours?

ap0n commented 7 years ago

In order to see what's going on when submitting multiple pipelines, I tried to sort the logs according to their timestamp. I first submitted FocusPip and after some time, the TransferPip.

Here I list the first problem of each logfile (worker) sorted by the time of appearance.

2016-12-12T16:59:41.505+0200 [clu04-w6703] b.s.d.worker [ERROR] Error on initialization of server mk-worker
java.io.FileNotFoundException: File '/app/storm/supervisor/stormdist/TransferPip-5-1481552990/stormconf.ser' does not exist

2016-12-12T17:14:49.771+0200 [clu02-w6704] o.a.s.z.s.NIOServerCnxnFactory [ERROR] Thread Thread[Thread-16,5,main] died
java.lang.ThreadDeath: null

2016-12-12T17:15:06.166+0200 [clu16-w6700] o.a.s.z.s.NIOServerCnxnFactory [ERROR] Thread Thread[Thread-16,5,main] died
java.lang.ThreadDeath: null

2016-12-12T17:18:53.646+0200 [clu06-w6701] b.s.d.worker [ERROR] Error on initialization of server mk-worker
java.io.FileNotFoundException: File '/app/storm/supervisor/stormdist/FocusPip-6-1481555597/stormconf.ser' does not exist

2016-12-12T17:19:26.047+0200 [clu07-w6701] b.s.d.worker [ERROR] Error on initialization of server mk-worker
java.io.FileNotFoundException: File '/app/storm/supervisor/stormdist/FocusPip-6-1481555597/stormconf.ser' does not exist

2016-12-12T17:19:54.275+0200 [clu04-w6702] o.a.s.z.s.NIOServerCnxnFactory [ERROR] Thread Thread[Thread-16,5,main] died
java.lang.ThreadDeath: null

2016-12-12T17:20:51.782+0200 [clu02-w6703] b.s.d.worker [ERROR] Error on initialization of server mk-worker
java.io.FileNotFoundException: File '/app/storm/supervisor/stormdist/TransferPip-7-1481555871/stormconf.ser' does not exist

2016-12-12T17:22:04.637+0200 [clu03-w6702] b.s.d.worker [ERROR] Error on initialization of server mk-worker
java.io.FileNotFoundException: File '/app/storm/supervisor/stormdist/FocusPip-6-1481555597/stormconf.ser' does not exist

2016-12-12T17:23:42.720+0200 [clu04-w6704] b.s.d.worker [ERROR] Error on initialization of server mk-worker
java.io.FileNotFoundException: File '/app/storm/supervisor/stormdist/TransferPip-7-1481555871/stormconf.ser' does not exist

(Mon Dec 12 17:27:26 EET 2016, [clu03-w6703] org.apache.hadoop.hbase.client.RpcRetryingCaller@3abe110a, java.net.ConnectException: Connection refused
    at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:129) ~[stormjar.jar:na])

2016-12-12T17:29:04.992+0200 [clu03-w6703] b.s.d.worker [ERROR] Error on initialization of server mk-worker
java.io.FileNotFoundException: File '/app/storm/supervisor/stormdist/TransferPip-7-1481555871/stormconf.ser' does not exist

There are also lots of entries like:

java.lang.RuntimeException: Returned channel was actually not established

I'm also attaching the log-files of each worker for the corresponding time period logs.zip

eichelbe commented 7 years ago

We can send you the hosts file, but I think that this will not really help you as it just contains local IP addresses (on our mini cluster we do not even run DHCP).

It seems that for some reason Storm is not storing the pipeline on the worker machines. No wonder that the worker is dying. As the pipeline before is working, file permission should not be the problem. Silly question, you have enough HD space? Do you also have the supervisor logs, as the supervisor is responsible for receiving the pipeline and for starting the workers.

ap0n commented 7 years ago

HD space can't be the problem. Each node has at least 200GB of free space. I'm uploading the supervisor logs (here) but I'll examine them later this afternoon. I'll report back if I find something.

ap0n commented 7 years ago

We tried a few more things today.

  1. Run three word-count topologies simultaneously without any errors (ran about 2-3 hours)
  2. Run FocusPip with a word-count without any errors (ran about 5-6 mins)
  3. Run TransferPip with a word-count without any errors (ran about 5-6 mins)
  4. Run FocusPip with TransferPip. Random component restarts begun after about 2-3 mins
  5. For experiment 4 we collected every logfile from the cluster (nimbus, supervisors, workers) and merged them in chronological order (merged.txt). At line 4672 we spotted an exception related to the replay sink. After that serious netty issues begin (>20-30 attempts to reconnect) and finally at line 5114 we get a java.io.FileNotFoundException: File '/app/storm/supervisor/stormdist/FocusPip-1-1481640573/stormconf.ser' does not exist exception.

So, @antoine-tran @tahoangl3s can you please checkout the replay sink exception? @eichelbe can you run both pipelines simultaneously on your cluster (you can find them here if you want to skip generation) to see what happens?

tahoangl3s commented 7 years ago

Thanks @ap0n for the updates. We inspected the merged.txt file and found the error message. Actually, it is a warning that the HBase table (that used to store pairwise correlation scores) is not named properly, and the function returns a proper name. Nevertheless, we will clear the issue soon. TA

eichelbe commented 7 years ago

I forwarded your idea to @cuiqin, not sure whether our cluster is sufficiently large. But we may try on LUH cluster ;)

antoine-tran commented 7 years ago

hi all @eichelbe @tahoangl3s @ap0n : I have fixed the Exception regarding HBase table naming. Please let me know if you still notice the issue

cuiqin commented 7 years ago

I don't think our cluster is able to handle both Transfer and Focus pipeline as we have rather small cluster. We can try it on the LUH cluster once we have the access.

ap0n commented 7 years ago

That would be nice. @tahoangl3s any news from your side regarding the exception?

tahoangl3s commented 7 years ago

@ap0n We have fixed the issue. Please refer to my comment and Tuan's comment above.

ap0n commented 7 years ago

You are right. I somehow missed that comment... We'll check it out and let you know.

ap0n commented 7 years ago

Now the Tweeter source for the FocusPip throws the following exception

2016-12-19T12:39:06.950+0200 STDIO [ERROR] Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /user/storm/eventset_16_8_until_19_8/symbols.sym
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1728)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1671)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1651)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1625)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:503)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

Has something changed at the HDFS files?

ap0n commented 7 years ago

While running both Focus and Transfer pipelines we get this error:

clu02,w6704,2016-12-21T13:21:41.299+0200 STDIO [INFO] Reconfiguring infrastructure settings: {eventBus.port=14000, pipelines.ports=22000-22500, zookeeper.host=clu01.softnet.tuc.gr,clu02.softnet.tuc.gr,clu03.softnet.tuc.gr, monitoring.volume.enabled=true, zookeeper.retry.times=5, zookeeper.port=2181, eventBus.disableLogging=eu.qualimaster.monitoring.events.PipelineElementMultiObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineElementObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineObservationMonitoringEvent,eu.qualimaster.monitoring.events.PlatformMultiObservationHostMonitoringEvent, eventBus.host=clu01.softnet.tuc.gr, zookeeper.retry.interval=1000} from {"storm.id" "FocusPip-1-1482319272", "dev.zookeeper.path" "/tmp/dev-storm-zookeeper", "topology.tick.tuple.freq.secs" nil, "topology.builtin.metrics.bucket.size.secs" 60, "topology.fall.back.on.java.serialization" true, "topology.max.error.report.per.interval" 5, "zmq.linger.millis" 5000, "topology.skip.missing.kryo.registrations" false, "storm.messaging.netty.client_worker_threads" 1, "ui.childopts" "-Xmx768m -Djava.net.preferIPv4Stack=true", "storm.zookeeper.session.timeout" 20000, "nimbus.reassign" true, "topology.trident.batch.emit.interval.millis" 500, "storm.messaging.netty.flush.check.interval.ms" 10, "nimbus.monitor.freq.secs" 10, "logviewer.childopts" "-Xmx128m", "java.library.path" "/usr/local/lib:/opt/local/lib:/usr/lib", "topology.executor.send.buffer.size" 16384, "storm.local.dir" "/app/storm", "storm.messaging.netty.buffer_size" 5242880, "supervisor.worker.start.timeout.secs" 120, "topology.enable.message.timeouts" true, "nimbus.cleanup.inbox.freq.secs" 600, "nimbus.inbox.jar.expiration.secs" 3600, "drpc.worker.threads" 64, "storm.meta.serialization.delegate" "backtype.storm.serialization.DefaultSerializationDelegate", "topology.worker.shared.thread.pool.size" 4, "nimbus.host" "clu01.softnet.tuc.gr", "storm.messaging.netty.min_wait_ms" 100, "storm.zookeeper.port" 2181, "transactional.zookeeper.port" nil, "topology.executor.receive.buffer.size" 16384, "transactional.zookeeper.servers" nil, "storm.zookeeper.root" "/storm", "storm.zookeeper.retry.intervalceiling.millis" 30000, "supervisor.enable" true, "storm.messaging.netty.server_worker_threads" 1, "storm.zookeeper.servers" ["clu01.softnet.tuc.gr" "clu02.softnet.tuc.gr" "clu03.softnet.tuc.gr"], "transactional.zookeeper.root" "/transactional", "topology.acker.executors" nil, "eventBus.port" "14000", "topology.kryo.decorators" (), "topology.name" "FocusPip", "topology.transfer.buffer.size" 32, "topology.worker.childopts" nil, "drpc.queue.size" 128, "worker.childopts" "-Xmx6g -Djava.net.preferIPv4Stack=true -agentpath:/home/softnet/libyjpagent.so", "supervisor.heartbeat.frequency.secs" 5, "topology.error.throttle.interval.secs" 10, "zmq.hwm" 0, "drpc.port" 3772, "supervisor.monitor.frequency.secs" 3, "drpc.childopts" "-Xmx768m", "topology.receiver.buffer.size" 8, "task.heartbeat.frequency.secs" 3, "topology.tasks" nil, "eventBus.host" "clu01.softnet.tuc.gr", "storm.messaging.netty.max_retries" 300, "topology.spout.wait.strategy" "backtype.storm.spout.SleepSpoutWaitStrategy", "nimbus.thrift.max_buffer_size" 1048576, "windowSize" 30, "topology.max.spout.pending" nil, "storm.zookeeper.retry.interval" 1000, "windowAdvance" 1, "topology.sleep.spout.wait.strategy.time.ms" 1, "nimbus.topology.validator" "backtype.storm.nimbus.DefaultTopologyValidator", "supervisor.slots.ports" [6700 6701 6702 6703 6704], "monitoring.volume.enabled" "true", "topology.environment" nil, "topology.debug" false, "nimbus.task.launch.secs" 120, "nimbus.supervisor.timeout.secs" 60, "topology.kryo.register" nil, "pipelines.ports" "22000-22500", "topology.message.timeout.secs" 100, "task.refresh.poll.secs" 10, "topology.workers" 9, "supervisor.childopts" "-Djava.net.preferIPv4Stack=true", "nimbus.thrift.port" 6627, "topology.stats.sample.rate" 0.05, "worker.heartbeat.frequency.secs" 1, "SUBPIPELINE.NAME" "FocusPip", "topology.tuple.serializer" "backtype.storm.serialization.types.ListDelegateSerializer", "topology.disruptor.wait.strategy" "com.lmax.disruptor.BlockingWaitStrategy", "topology.multilang.serializer" "backtype.storm.multilang.JsonSerializer", "nimbus.task.timeout.secs" 30, "storm.zookeeper.connection.timeout" 15000, "topology.kryo.factory" "backtype.storm.serialization.DefaultKryoFactory", "drpc.invocations.port" 3773, "logviewer.port" 8000, "zmq.threads" 1, "storm.zookeeper.retry.times" 5, "topology.worker.receiver.thread.count" 1, "storm.thrift.transport" "backtype.storm.security.auth.SimpleTransportPlugin", "topology.state.synchronization.timeout.secs" 60, "supervisor.worker.timeout.secs" 30, "nimbus.file.copy.expiration.secs" 600, "storm.messaging.transport" "backtype.storm.messaging.netty.Context", "logviewer.appender.name" "A1", "storm.messaging.netty.max_wait_ms" 1000, "drpc.request.timeout.secs" 600, "storm.local.mode.zmq" false, "ui.port" 8746, "confModel.initMode" "STATIC", "nimbus.childopts" "-Xmx1024m -Djava.net.preferIPv4Stack=true", "storm.cluster.mode" "distributed", "topology.max.task.parallelism" nil, "storm.messaging.netty.transfer.batch.size" 262144, "topology.classpath" nil, "eventBus.disableLogging" "eu.qualimaster.monitoring.events.PipelineElementMultiObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineElementObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineObservationMonitoringEvent,eu.qualimaster.monitoring.events.PlatformMultiObservationHostMonitoringEvent"}
clu02,w6704,2016-12-21T13:21:41.300+0200 STDIO [INFO] Reconfiguring infrastructure settings: {eventBus.port=14000, pipelines.ports=22000-22500, zookeeper.host=clu01.softnet.tuc.gr,clu02.softnet.tuc.gr,clu03.softnet.tuc.gr, monitoring.volume.enabled=true, zookeeper.retry.times=5, zookeeper.port=2181, eventBus.disableLogging=eu.qualimaster.monitoring.events.PipelineElementMultiObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineElementObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineObservationMonitoringEvent,eu.qualimaster.monitoring.events.PlatformMultiObservationHostMonitoringEvent, eventBus.host=clu01.softnet.tuc.gr, zookeeper.retry.interval=1000} from {"storm.id" "FocusPip-1-1482319272", "dev.zookeeper.path" "/tmp/dev-storm-zookeeper", "topology.tick.tuple.freq.secs" nil, "topology.builtin.metrics.bucket.size.secs" 60, "topology.fall.back.on.java.serialization" true, "topology.max.error.report.per.interval" 5, "zmq.linger.millis" 5000, "topology.skip.missing.kryo.registrations" false, "storm.messaging.netty.client_worker_threads" 1, "ui.childopts" "-Xmx768m -Djava.net.preferIPv4Stack=true", "storm.zookeeper.session.timeout" 20000, "nimbus.reassign" true, "topology.trident.batch.emit.interval.millis" 500, "storm.messaging.netty.flush.check.interval.ms" 10, "nimbus.monitor.freq.secs" 10, "logviewer.childopts" "-Xmx128m", "java.library.path" "/usr/local/lib:/opt/local/lib:/usr/lib", "topology.executor.send.buffer.size" 16384, "storm.local.dir" "/app/storm", "storm.messaging.netty.buffer_size" 5242880, "supervisor.worker.start.timeout.secs" 120, "topology.enable.message.timeouts" true, "nimbus.cleanup.inbox.freq.secs" 600, "nimbus.inbox.jar.expiration.secs" 3600, "drpc.worker.threads" 64, "storm.meta.serialization.delegate" "backtype.storm.serialization.DefaultSerializationDelegate", "topology.worker.shared.thread.pool.size" 4, "nimbus.host" "clu01.softnet.tuc.gr", "storm.messaging.netty.min_wait_ms" 100, "storm.zookeeper.port" 2181, "transactional.zookeeper.port" nil, "topology.executor.receive.buffer.size" 16384, "transactional.zookeeper.servers" nil, "storm.zookeeper.root" "/storm", "storm.zookeeper.retry.intervalceiling.millis" 30000, "supervisor.enable" true, "storm.messaging.netty.server_worker_threads" 1, "storm.zookeeper.servers" ["clu01.softnet.tuc.gr" "clu02.softnet.tuc.gr" "clu03.softnet.tuc.gr"], "transactional.zookeeper.root" "/transactional", "topology.acker.executors" nil, "eventBus.port" "14000", "topology.kryo.decorators" (), "topology.name" "FocusPip", "topology.transfer.buffer.size" 32, "topology.worker.childopts" nil, "drpc.queue.size" 128, "worker.childopts" "-Xmx6g -Djava.net.preferIPv4Stack=true -agentpath:/home/softnet/libyjpagent.so", "supervisor.heartbeat.frequency.secs" 5, "topology.error.throttle.interval.secs" 10, "zmq.hwm" 0, "drpc.port" 3772, "supervisor.monitor.frequency.secs" 3, "drpc.childopts" "-Xmx768m", "topology.receiver.buffer.size" 8, "task.heartbeat.frequency.secs" 3, "topology.tasks" nil, "eventBus.host" "clu01.softnet.tuc.gr", "storm.messaging.netty.max_retries" 300, "topology.spout.wait.strategy" "backtype.storm.spout.SleepSpoutWaitStrategy", "nimbus.thrift.max_buffer_size" 1048576, "windowSize" 30, "topology.max.spout.pending" nil, "storm.zookeeper.retry.interval" 1000, "windowAdvance" 1, "topology.sleep.spout.wait.strategy.time.ms" 1, "nimbus.topology.validator" "backtype.storm.nimbus.DefaultTopologyValidator", "supervisor.slots.ports" [6700 6701 6702 6703 6704], "monitoring.volume.enabled" "true", "topology.environment" nil, "topology.debug" false, "nimbus.task.launch.secs" 120, "nimbus.supervisor.timeout.secs" 60, "topology.kryo.register" nil, "pipelines.ports" "22000-22500", "topology.message.timeout.secs" 100, "task.refresh.poll.secs" 10, "topology.workers" 9, "supervisor.childopts" "-Djava.net.preferIPv4Stack=true", "nimbus.thrift.port" 6627, "topology.stats.sample.rate" 0.05, "worker.heartbeat.frequency.secs" 1, "SUBPIPELINE.NAME" "FocusPip", "topology.tuple.serializer" "backtype.storm.serialization.types.ListDelegateSerializer", "topology.disruptor.wait.strategy" "com.lmax.disruptor.BlockingWaitStrategy", "topology.multilang.serializer" "backtype.storm.multilang.JsonSerializer", "nimbus.task.timeout.secs" 30, "storm.zookeeper.connection.timeout" 15000, "topology.kryo.factory" "backtype.storm.serialization.DefaultKryoFactory", "drpc.invocations.port" 3773, "logviewer.port" 8000, "zmq.threads" 1, "storm.zookeeper.retry.times" 5, "topology.worker.receiver.thread.count" 1, "storm.thrift.transport" "backtype.storm.security.auth.SimpleTransportPlugin", "topology.state.synchronization.timeout.secs" 60, "supervisor.worker.timeout.secs" 30, "nimbus.file.copy.expiration.secs" 600, "storm.messaging.transport" "backtype.storm.messaging.netty.Context", "logviewer.appender.name" "A1", "storm.messaging.netty.max_wait_ms" 1000, "drpc.request.timeout.secs" 600, "storm.local.mode.zmq" false, "ui.port" 8746, "confModel.initMode" "STATIC", "nimbus.childopts" "-Xmx1024m -Djava.net.preferIPv4Stack=true", "storm.cluster.mode" "distributed", "topology.max.task.parallelism" nil, "storm.messaging.netty.transfer.batch.size" 262144, "topology.classpath" nil, "eventBus.disableLogging" "eu.qualimaster.monitoring.events.PipelineElementMultiObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineElementObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineObservationMonitoringEvent,eu.qualimaster.monitoring.events.PlatformMultiObservationHostMonitoringEvent"}
clu02,w6704,2016-12-21T13:21:41.301+0200 STDIO [INFO] Reconfiguring infrastructure settings: {eventBus.port=14000, pipelines.ports=22000-22500, zookeeper.host=clu01.softnet.tuc.gr,clu02.softnet.tuc.gr,clu03.softnet.tuc.gr, monitoring.volume.enabled=true, zookeeper.retry.times=5, zookeeper.port=2181, eventBus.disableLogging=eu.qualimaster.monitoring.events.PipelineElementMultiObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineElementObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineObservationMonitoringEvent,eu.qualimaster.monitoring.events.PlatformMultiObservationHostMonitoringEvent, eventBus.host=clu01.softnet.tuc.gr, zookeeper.retry.interval=1000} from {"storm.id" "FocusPip-1-1482319272", "dev.zookeeper.path" "/tmp/dev-storm-zookeeper", "topology.tick.tuple.freq.secs" nil, "topology.builtin.metrics.bucket.size.secs" 60, "topology.fall.back.on.java.serialization" true, "topology.max.error.report.per.interval" 5, "zmq.linger.millis" 5000, "topology.skip.missing.kryo.registrations" false, "storm.messaging.netty.client_worker_threads" 1, "ui.childopts" "-Xmx768m -Djava.net.preferIPv4Stack=true", "storm.zookeeper.session.timeout" 20000, "nimbus.reassign" true, "topology.trident.batch.emit.interval.millis" 500, "storm.messaging.netty.flush.check.interval.ms" 10, "nimbus.monitor.freq.secs" 10, "logviewer.childopts" "-Xmx128m", "java.library.path" "/usr/local/lib:/opt/local/lib:/usr/lib", "topology.executor.send.buffer.size" 16384, "storm.local.dir" "/app/storm", "storm.messaging.netty.buffer_size" 5242880, "supervisor.worker.start.timeout.secs" 120, "topology.enable.message.timeouts" true, "nimbus.cleanup.inbox.freq.secs" 600, "nimbus.inbox.jar.expiration.secs" 3600, "drpc.worker.threads" 64, "storm.meta.serialization.delegate" "backtype.storm.serialization.DefaultSerializationDelegate", "topology.worker.shared.thread.pool.size" 4, "nimbus.host" "clu01.softnet.tuc.gr", "storm.messaging.netty.min_wait_ms" 100, "storm.zookeeper.port" 2181, "transactional.zookeeper.port" nil, "topology.executor.receive.buffer.size" 16384, "transactional.zookeeper.servers" nil, "storm.zookeeper.root" "/storm", "storm.zookeeper.retry.intervalceiling.millis" 30000, "supervisor.enable" true, "storm.messaging.netty.server_worker_threads" 1, "storm.zookeeper.servers" ["clu01.softnet.tuc.gr" "clu02.softnet.tuc.gr" "clu03.softnet.tuc.gr"], "transactional.zookeeper.root" "/transactional", "topology.acker.executors" nil, "eventBus.port" "14000", "topology.kryo.decorators" (), "topology.name" "FocusPip", "topology.transfer.buffer.size" 32, "topology.worker.childopts" nil, "drpc.queue.size" 128, "worker.childopts" "-Xmx6g -Djava.net.preferIPv4Stack=true -agentpath:/home/softnet/libyjpagent.so", "supervisor.heartbeat.frequency.secs" 5, "topology.error.throttle.interval.secs" 10, "zmq.hwm" 0, "drpc.port" 3772, "supervisor.monitor.frequency.secs" 3, "drpc.childopts" "-Xmx768m", "topology.receiver.buffer.size" 8, "task.heartbeat.frequency.secs" 3, "topology.tasks" nil, "eventBus.host" "clu01.softnet.tuc.gr", "storm.messaging.netty.max_retries" 300, "topology.spout.wait.strategy" "backtype.storm.spout.SleepSpoutWaitStrategy", "nimbus.thrift.max_buffer_size" 1048576, "windowSize" 30, "topology.max.spout.pending" nil, "storm.zookeeper.retry.interval" 1000, "windowAdvance" 1, "topology.sleep.spout.wait.strategy.time.ms" 1, "nimbus.topology.validator" "backtype.storm.nimbus.DefaultTopologyValidator", "supervisor.slots.ports" [6700 6701 6702 6703 6704], "monitoring.volume.enabled" "true", "topology.environment" nil, "topology.debug" false, "nimbus.task.launch.secs" 120, "nimbus.supervisor.timeout.secs" 60, "topology.kryo.register" nil, "pipelines.ports" "22000-22500", "topology.message.timeout.secs" 100, "task.refresh.poll.secs" 10, "topology.workers" 9, "supervisor.childopts" "-Djava.net.preferIPv4Stack=true", "nimbus.thrift.port" 6627, "topology.stats.sample.rate" 0.05, "worker.heartbeat.frequency.secs" 1, "SUBPIPELINE.NAME" "FocusPip", "topology.tuple.serializer" "backtype.storm.serialization.types.ListDelegateSerializer", "topology.disruptor.wait.strategy" "com.lmax.disruptor.BlockingWaitStrategy", "topology.multilang.serializer" "backtype.storm.multilang.JsonSerializer", "nimbus.task.timeout.secs" 30, "storm.zookeeper.connection.timeout" 15000, "topology.kryo.factory" "backtype.storm.serialization.DefaultKryoFactory", "drpc.invocations.port" 3773, "logviewer.port" 8000, "zmq.threads" 1, "storm.zookeeper.retry.times" 5, "topology.worker.receiver.thread.count" 1, "storm.thrift.transport" "backtype.storm.security.auth.SimpleTransportPlugin", "topology.state.synchronization.timeout.secs" 60, "supervisor.worker.timeout.secs" 30, "nimbus.file.copy.expiration.secs" 600, "storm.messaging.transport" "backtype.storm.messaging.netty.Context", "logviewer.appender.name" "A1", "storm.messaging.netty.max_wait_ms" 1000, "drpc.request.timeout.secs" 600, "storm.local.mode.zmq" false, "ui.port" 8746, "confModel.initMode" "STATIC", "nimbus.childopts" "-Xmx1024m -Djava.net.preferIPv4Stack=true", "storm.cluster.mode" "distributed", "topology.max.task.parallelism" nil, "storm.messaging.netty.transfer.batch.size" 262144, "topology.classpath" nil, "eventBus.disableLogging" "eu.qualimaster.monitoring.events.PipelineElementMultiObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineElementObservationMonitoringEvent,eu.qualimaster.monitoring.events.PipelineObservationMonitoringEvent,eu.qualimaster.monitoring.events.PlatformMultiObservationHostMonitoringEvent"}
clu05,w6701,2016-12-21T13:21:41.304+0200 e.q.c.s.SignalMechanism [INFO] Changing namespace state: FocusPip ENABLE
clu05,w6701,2016-12-21T13:21:41.304+0200 e.q.c.s.SignalMechanism [INFO] Changing namespace state: FocusPip ENABLE
clu05,w6701,2016-12-21T13:21:41.306+0200 e.q.c.s.SignalMechanism [INFO] Changed namespace state: FocusPip ENABLE signal count 0
clu05,w6701,2016-12-21T13:21:41.306+0200 e.q.c.s.SignalMechanism [INFO] Changed namespace state: FocusPip ENABLE signal count 0
clu05,w6701,2016-12-21T13:21:41.306+0200 e.q.a.i.c.s.HayashiYoshidaBolt [INFO] Prepared--basesignalbolt.... FocusPip/TopoSoftwareCorrelationFinancialHayashiYoshidaBolt
clu05,w6701,2016-12-21T13:21:41.306+0200 e.q.F.t.PipelineVar_7_FamilyElement5FamilyElement [INFO] Prepared--basesignalbolt.... FocusPip/PipelineVar_7_FamilyElement5
clu05,w6701,2016-12-21T13:21:41.311+0200 b.s.d.executor [INFO] Prepared bolt TopoSoftwareCorrelationFinancialHayashiYoshidaBolt:(24)
clu05,w6701,2016-12-21T13:21:41.316+0200 e.q.a.i.c.s.HayashiYoshidaBolt [ERROR] Error SignalConnection:KeeperErrorCode = NodeExists for /qm/pipelines/FocusPip/TopoSoftwareCorrelationFinancialHayashiYoshidaBolt

It's related to the signal mechanism; @eichelbe do you have any ideas on this one? Generally, we've ran the pipelines one by one and everything goes smoothly. The same goes for running multiple storm topologies (non-QM) concurrently. Therefore, it seems that the problem lies somewhere in the QM infrastructure. Is it possible that you have a look on this issue from the infrastructure point of view?

eichelbe commented 7 years ago

Worker log? Seems to be a double creation, probably not problematic as long as the node is there at all.

ap0n commented 7 years ago

Here it is worker-6701.log.txt. Line 327. And here is the merged log from the whole cluster for this run merged.log.txt.zip

eichelbe commented 7 years ago

There may be a problem due to combined exception handling in case that some other component created the same node just between checking for its existence and doing some operations on it. Either we can fix it by separating the exception handling or by Zktransactions :|

eichelbe commented 7 years ago

Initial version for creating ZkNodes more carefully is committed and building... If the problem persists, we will try it with incremental path creation as done by the PortManager.

ap0n commented 7 years ago

Christoph the SpringClientTwitter produces the following error. I don't know if this causes problems to the whole pipeline but could you have a look?

clu04,w6703,2016-12-29T17:28:45.338+0200 o.a.s.z.s.NIOServerCnxnFactory [ERROR] Thread Thread[Thread-17,5,main] died
 java.lang.ThreadDeath: null
    at java.lang.Thread.stop(Thread.java:836) ~[na:1.7.0_80]
    at eu.qualimaster.data.SpringConnector.DataConnector.stopRunning(DataConnector.java:278) ~[stormjar.jar:na]
    at eu.qualimaster.data.SpringConnector.SpringClientTwitter.initConnector(SpringClientTwitter.java:67) ~[stormjar.jar:na]
    at eu.qualimaster.data.SpringConnector.SpringClientTwitter.getInstance(SpringClientTwitter.java:24) ~[stormjar.jar:na]
    at eu.qualimaster.data.impl.TwitterStreamData.TwitterStreamingDataSource.<init>(TwitterStreamingDataSource.java:60) ~[stormjar.jar:na]
    at eu.qualimaster.data.stream.source.TwitterFileStreaming.<init>(TwitterFileStreaming.java:49) ~[stormjar.jar:na]
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.7.0_80]
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) ~[na:1.7.0_80]
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[na:1.7.0_80]
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526) ~[na:1.7.0_80]
    at java.lang.Class.newInstance(Class.java:383) ~[na:1.7.0_80]
    at eu.qualimaster.FocusPip.topology.PipelineVar_7_Source0Source.open(PipelineVar_7_Source0Source.java:98) ~[stormjar.jar:na]
    at backtype.storm.daemon.executor$fn__6579$fn__6594.invoke(executor.clj:522) ~[storm-core-0.9.5.jar:0.9.5]
    at backtype.storm.util$async_loop$fn__459.invoke(util.clj:461) ~[storm-core-0.9.5.jar:0.9.5]
    at clojure.lang.AFn.run(AFn.java:24) ~[clojure-1.5.1.jar:na]
    at java.lang.Thread.run(Thread.java:745) ~[na:1.7.0_80]
ChristophHubeL3S commented 7 years ago

That's actually not a problem. The SpringConnector is supposed to be shut down after it did its job. So that's fine.

npav commented 7 years ago

Unfortunately, still no useful update, and no useful error to help us understand why multiple QM pipelines do not work together. All I got in the last execution is Netty reconnection errors, and after some time, I also got some CuratorConnectionLossException errors.

I am re-capping all we know so far in case anyone has an idea: 1) Multiple non-QM topologies are executed without any problems in parallel. 2) A single QM topology is executed in parallel with multiple other non-QM topologies without any problems. 3) Two QM topologies cannot be executed in parallel. 4) Components start getting re-started at random when the second topology is submitted. 5) No useful exceptions or errors occur in storm logs

I am attaching two log files in case it helps anyone come up with ideas. main.txt contains the infrastructure logs. The infrastructure was updated today. merged.txt contains storm logs from all workers and the nimbus, merged in one file, sorted in chronological order.

main.txt merged.txt

I am thinking that since the schedule is tight, maybe it would be wiser to introduce some redundancy and have an alternative solution in case this issue is not resolved. The alternative would be creating a new pipeline which would "merge" three pipelines (TransferPip, FocusPip, DynamicGraphPip) into one large pipeline. The initial idea (during the Bremen meeting) was to avoid creating a new pipeline (especially such a complicated one) because it would require a lot of work, and running multiple pipelines seemed like a good solution at the time. Yet, it seems that it is taking too much time and effort without any guarantee that it will be resolved.

My proposal: Since it seems that this problem is related to the QM infrastructure, and not the individual pipelines/components (which are independently tested and working properly), maybe SUH (@eichelbe) could continue looking into this issue (e.g. on Okeanos cluster), while at the same time we work on creating the full, merged pipeline. This way, whichever of the two approaches ends up working, the demo would be safe.

Thoughts?

eichelbe commented 7 years ago

We did several basic modifications yesterday to bring back the hardware profiling. Did you test with these changes, as they involved some resolutions for the event processing?

Which QM topologies did you test?

npav commented 7 years ago

I updated the infrastructure today, so it should include the latest changes, but I did not re-create the pipelines from QM-IConf, I used older builds. I tried FocusPip with TransferPip.

eichelbe commented 7 years ago

This is ok. @cuiqin could you please run SwitchPip and RandomPip in parallel on Okeanos. Let's try with the simple things first...

eichelbe commented 7 years ago

From the log it seems that we have a similar problem even with a smaller pipeline on our cluster, while other pipelines are running (and the same one is now running on Okeanos but not on our side). We also observed ''o.a.s.z.ClientCnxn [INFO] EventThread shut down'' and the worker is shutting down using the original Storm shutdown hook (we checked that) while Netty is not connected, trying to go on connecting or even the worker is loading and initializing its components.

Nathan wrote in a post that it may be that the Zookeeper is getting overloaded. This can be a plausible reason, because upon startup we also create nodes and set watchers - the bigger the pipelines, the more Zookeeper operations. And as the Zookeeper is central, it may cause others/existing through a timeout to die. Others suggest increasing some specific buffer sizes.

We will try that and let you know, but, depending on the cluster, it seems that also smaller pipelines can fail without going for parallel execution - if it's the same problem.

npav commented 7 years ago

Okay, let's hope it is as easy as tweaking some buffers. In the meantime, I will start working on the merged pipeline as a backup scenario until this is resolved.

eichelbe commented 7 years ago

Makes pretty much sense...

eichelbe commented 7 years ago

Another option would be to do the signals via the event bus avoiding additional Zookeeper connections. But this is rather untested...

eichelbe commented 7 years ago

I got the problematic pipeline up (the first time), but then the workers still continued dying... It could be the right direction although not the solution...

cuiqin commented 7 years ago

I ran both RandomPip and SwitchPip in parallel on the Okeanos cluster. I tried them with 3 and 5 workers respectively. Although I also saw the netty reconnections and "EventThread shut down" logs, both pipelines are running well. If you want to check, they are still on the Okeanos now.

eichelbe commented 7 years ago

I did a series of modifications on the workers, the nimbus and the zookeepers. The problematic pipeline is coming up but it is not stable. Workers restart then at runtime, so it seems that it defers only the problem. At least in our cluster the zookeepers are not powerful enough to handle the number of startup connections. Some statements say that the most powerful machines with the fastest harddisks shall not be the workers rather than the zookeepers.

Although there are still ways to mitigate this problem, it will ever limit the ability what to run on a cluster with zookeepers. Options (for now): 1) try sending events through the QM-Event bus to avoid creating Zookeeper connections (currently possible except for the nodes requiring the port manager - try it lazily hoping that loose integration is still working then). 2) shuffling the zookeeper sequence for creating the connections. Not sure whether it can help at all... 3) switching to a new zookeeper version hoping that this is faster 4) ... resigning (hopefully not)

I'm checking the untested code for 1) and running the test cases. Looks surprisingly good, even for load shedding, replay, pipeline commands... 2) is a small modification and potentially worth a trial.

I'll let you know...

eichelbe commented 7 years ago

Ok, it seems that I have good news. And after some cluster reboots (for tracking the worker shutdowns) our problematic pipeline runs without worker restarts for several minutes. I will collect the settings later and let you know...

eichelbe commented 7 years ago

Here are the settings that worked for us...

Please make sure to properly restart the entire cluster after all modifications. It might be that the infrastructure needs some time between storm restart (happened only the first times on our side).

The full setup description is here.

Good luck...

antoine-tran commented 7 years ago

This is a great news :). Let's hope the pipelines play nicely along with each other :o

npav commented 7 years ago

I messaged our Administrator so she can make the changes Holger proposed, since we are usually not allowed to change configuration arbitrarily on SoftNet cluster. Hopefully, this will be done soon and I can give it another try. Thanks Holger.

eichelbe commented 7 years ago

In the mean time we are updating the LUH cluster together with Miroslav for experiments with adaptive storm...

npav commented 7 years ago

Good news everyone!

Both focus pip and transfer pip, seem to be running at the same time on SoftNet cluster without having the "restart" problems. Will keep testing, but at the moment both generate results without problems!!!

Maybe we should keep this infrastructure version and configuration as a "backup"

eichelbe commented 7 years ago

Good news indeed :) So SoftNet cluster and LUH can run multiple pipelines now :)

A backup is a good idea, might be we can even tag the component commits in Git...