Open gaga0808 opened 8 years ago
ZK都是稳定的么,有没有挂过?
ZK稳定着,一直开着~
你搜一下日志中,有没有 "Halting due to Out Of Memory Error"
搜了下,没有这个
gtalk with me @gaga0808 e.neverme(at)gmail.com
@unsleepy22 e.neverme(at)gmail.com 这句是什么意思? 没看明白。。。
gtalk... 算了,那你看看其他日志有没有异常的,.metrics.log, gc.log这些
On Wed, Mar 2, 2016 at 10:42 AM, gaga0808 notifications@github.com wrote:
@unsleepy22 https://github.com/unsleepy22 e.neverme(at)gmail.com 这句是什么意思? 没看明白。。。
— Reply to this email directly or view it on GitHub https://github.com/alibaba/jstorm/issues/208#issuecomment-191025301.
谢谢你的指点~ @unsleepy22
我也遇到这个问题了,nimbus和supervisor总是莫名就关闭了
我重启了zookeeper集群。这个情况得到改观,但是还是有一个jstorm节点的supervisor进程运行几个小时后就自动shutdown..不知道原因在哪。。。。虽然可以使用linux的supervisor工具来监控这个进程,down掉后重启,但是心中一直有个梗。。。。
supervisor有什么日志么
启动nimbus和supervisor 最好是类似这样
nohup jstorm nimbus >/dev/null 2>&1 & nohup jstorm supervisor >/dev/null 2>&1 &
没有加nohup 有的时候,当退出终端时, 被终端杀死
.先前几个supervisor中老是其中的某一台运行几个小时后就自动shutdown.,所天按你的写法启动那一台机器的supervisor进程,: nohup jstorm supervisor >/dev/null 2>&1 & 到目前为止那台机器的supervisor进程果真依然运行着。先前本人都是用 jstorm supervisor > nohup.out 2>&1 & 命令启动,少加了个nohup. @longdafeng
今天早上来上班发现集群的nimbus和几个supervisor都莫名的shutdown,看了日志一头雾水,还请懂的人能够指点一二。 supervisor的日志: [INFO 2016-03-01 10:19:08 c.a.j.d.s.SyncProcessEvent:283 EventManagerImp] Successfully start worker 784579aa-9497-442e-a825-687065330707 [INFO 2016-03-01 10:19:08 c.a.j.d.s.SyncProcessEvent:283 EventManagerImp] Successfully start worker c2dc9b1e-96ec-454b-b67a-c310dcb704fe [INFO 2016-03-01 10:19:08 c.a.j.d.s.SyncProcessEvent:283 EventManagerImp] Successfully start worker a708abc3-e9f1-40eb-86a5-435e97b469a9 [INFO 2016-03-01 16:20:52 c.a.j.d.s.SupervisorManger:91 Thread-8] Shutting down supervisor 40a52209-c38d-4769-abe9-2c8f8a817991 [INFO 2016-03-01 16:20:52 c.a.j.d.s.SupervisorManger:88 Thread-7] Supervisor has been shutdown before 40a52209-c38d-4769-abe9-2c8f8a817991 [INFO 2016-03-01 16:20:52 c.a.j.d.s.SupervisorManger:104 Thread-8] Successfully shutdown thread:Heartbeat [INFO 2016-03-01 16:20:52 c.a.j.d.s.SupervisorManger:104 Thread-8] Successfully shutdown thread:EventManagerImp [INFO 2016-03-01 16:20:52 c.a.j.e.EventManagerImp:78 EventManagerImp] InterruptedException when processing event [INFO 2016-03-01 16:20:52 c.a.j.c.AsyncLoopRunnable:78 EventManagerImp] Succefully shutdown [INFO 2016-03-01 16:20:52 c.a.j.d.s.SupervisorManger:104 Thread-8] Successfully shutdown thread:EventManagerImp [INFO 2016-03-01 16:20:52 c.a.j.e.EventManagerImp:78 EventManagerImp] InterruptedException when processing event [INFO 2016-03-01 16:20:52 c.a.j.c.AsyncLoopRunnable:78 EventManagerImp] Succefully shutdown [INFO 2016-03-01 16:20:52 c.a.j.d.s.SupervisorManger:104 Thread-8] Successfully shutdown thread:EventManagerPusher [INFO 2016-03-01 16:20:52 c.a.j.d.s.Httpserver:452 Thread-8] Successfully stop http server [INFO 2016-03-01 16:20:52 c.a.j.u.JStormUtils:186 Thread-8] Halting process: !!!Shutdown!!! [INFO 2016-03-01 16:20:53 c.a.j.d.s.Supervisor:208 main] Shutdown supervisor!!!
Nimbus的日志: [INFO 2016-03-01 16:19:55 c.a.j.d.n.TopologyMetricsRunnable:175 pool-6-thread-1] cluster metrics force upload. [INFO 2016-03-01 16:19:55 c.a.j.d.n.TopologyMetricsRunnable:607 pool-6-thread-1] send update event for cluster metrics, size : 8 [ERROR 2016-03-01 16:19:55 c.a.j.d.n.TopologyMetricsRunnable:686 TopologyMetricsRunnable] exceeding maxPendingUploadMetrics, skip metrics data for topology:CLUSTER [INFO 2016-03-01 16:20:00 c.a.j.d.n.ServiceHandler:1256 pool-8-thread-46] Received topology metrics:NIMBUS [ERROR 2016-03-01 16:20:00 c.a.j.d.n.TopologyMetricsRunnable:686 TopologyMetricsRunnable] exceeding maxPendingUploadMetrics, skip metrics data for topology:NIMBUS [INFO 2016-03-01 16:20:25 c.a.j.d.n.TopologyMetricsRunnable:549 TopologyMetricsRunnable] refresh topologies, cost:1 [INFO 2016-03-01 16:20:34 c.a.j.d.n.ServiceHandler:1256 pool-8-thread-52] Received topology metrics:ping-alarm-count-1-1456798731 [INFO 2016-03-01 16:20:34 c.a.j.d.n.TopologyMetricsRunnable:355 TopologyMetricsRunnable] register metrics, topology:CLUSTER, size:8, cost:0 [ERROR 2016-03-01 16:20:34 c.a.j.d.n.TopologyMetricsRunnable:686 TopologyMetricsRunnable] exceeding maxPendingUploadMetrics, skip metrics data for topology:ping-alarm-count-1-1456798731 [INFO 2016-03-01 16:20:45 c.a.j.d.n.NimbusServer:290 Thread-0] Begin to shutdown nimbus [INFO 2016-03-01 16:20:45 c.a.j.d.n.ServiceHandler:86 Thread-0] Begin to shut down master [INFO 2016-03-01 16:20:45 c.a.j.d.n.ServiceHandler:89 Thread-0] Successfully shut down master [INFO 2016-03-01 16:20:45 c.a.j.d.n.NimbusServer:314 Thread-0] Successfully shutdown TopologyAssign thread [INFO 2016-03-01 16:20:45 c.a.j.d.n.NimbusServer:319 Thread-0] Successfully shutdown follower thread [INFO 2016-03-01 16:20:45 c.a.j.c.RocksDBCache:134 Thread-0] Begin to close rocketDb of /lib/soft/jstorm/data/nimbus/rocksdb [INFO 2016-03-01 16:20:45 c.a.j.c.RocksDBCache:140 Thread-0] Successfully closed rocketDb of /lib/soft/jstorm/data/nimbus/rocksdb [INFO 2016-03-01 16:20:45 c.a.j.d.n.NimbusData:262 Thread-0] Successfully shutdown Cache [INFO 2016-03-01 16:20:45 c.a.j.d.n.NimbusData:265 Thread-0] Successfully shutdown ZK Cluster Instance [INFO 2016-03-01 16:20:45 c.a.j.d.n.NimbusData:272 Thread-0] Successfully shutdown threadpool [INFO 2016-03-01 16:20:45 c.a.j.d.n.NimbusServer:324 Thread-0] Successfully shutdown NimbusData [INFO 2016-03-01 16:20:45 c.a.j.d.n.NimbusServer:329 Thread-0] Successfully shutdown thrift server [INFO 2016-03-01 16:20:45 c.a.j.d.s.Httpserver:452 Thread-0] Successfully stop http server [INFO 2016-03-01 16:20:45 c.a.j.d.n.NimbusServer:334 Thread-0] Successfully shutdown httpserver [INFO 2016-03-01 16:20:45 c.a.j.d.n.NimbusServer:337 Thread-0] Successfully shutdown nimbus [INFO 2016-03-01 16:20:45 c.a.j.u.JStormUtils:186 Thread-0] Halting process: !!!Shutdown!!! [INFO 2016-03-01 16:20:45 c.a.j.d.n.NimbusServer:286 main] Notify to quit nimbus [INFO 2016-03-01 16:20:45 c.a.j.d.n.NimbusServer:135 main] Quit nimbus 这个问题出现两次了,星期一的时候出现,星期二上班的时候发现了,重启了,但是不知道怎么解决,就寻思着先放着,但是今天星期三上班的时候又发现了同样的问题。比较巧合的是,星期一和星期二都是在16点20分左右nimbus和supervisor自动shutdown的。JSTORM已经有用几个月了,先前还好好的,上周末公司对部分虚拟节点进行了迁移,之后星期一跑了一天的任务后就发现这个问题。但是hadoop集群都没问题。~