Open simonsigre opened 7 years ago
As I look at this, its the engine that seems to have an issue .. The following logs are observed when its restarting (that 5-10 mins) It seems to be stuck in a stopping state
2017-07-18T16:40:11 (14912)mgmtbus._merge_status ERROR: old clock: 46 > 44 - dropped
2017-07-18T16:40:11 (14912)mgmtbus._merge_status ERROR: old clock: 38 > 37 - dropped
2017-07-18T16:40:11 (14912)mgmtbus._merge_status ERROR: old clock: 58 > 54 - dropped
Traceback (most recent call last):
File "/opt/minemeld/engine/current/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run
result = self._run(*self.args, **self.kwargs)
File "/opt/minemeld/engine/core/minemeld/comm/amqp.py", line 561, in _ioloop
conn.drain_events()
File "/opt/minemeld/engine/current/lib/python2.7/site-packages/amqp/connection.py", line 323, in drain_events
return amqp_method(channel, args)
File "/opt/minemeld/engine/current/lib/python2.7/site-packages/amqp/channel.py", line 241, in _close
reply_code, reply_text, (class_id, method_id), ChannelError,
NotFound: Basic.publish: (404) NOT_FOUND - no exchange '20170714-BL01-In-alienvaultreputation' in vhost '/'
<Greenlet at 0x381b9b0: <bound method AMQP._ioloop of <minemeld.comm.amqp.AMQP object at 0x35fbad0>>(9)> failed with NotFound
2017-07-18T16:40:20 (14926)amqp._ioloop_failure ERROR: _ioloop_failure: exception in ioloop
Traceback (most recent call last):
File "/opt/minemeld/engine/core/minemeld/comm/amqp.py", line 567, in _ioloop_failure
g.get()
File "/opt/minemeld/engine/current/lib/python2.7/site-packages/gevent/greenlet.py", line 251, in get
raise self._exception
NotFound: Basic.publish: (404) NOT_FOUND - no exchange '20170714-BL01-In-alienvaultreputation' in vhost '/'
2017-07-18T16:40:20 (14926)chassis.stop INFO: chassis stop called
2017-07-18T16:40:20 (14926)base.state INFO: 20170714-BL01-In-spamhausDROP - transitioning to state 8
2017-07-18T16:40:20 (14926)basepoller.stop INFO: 20170714-BL01-In-spamhausDROP - # indicators: 831
2017-07-18T16:40:20 (14926)base.state INFO: 20170714-BL01-In-dshieldblock - transitioning to state 8
2017-07-18T16:40:20 (14926)basepoller.stop INFO: 20170714-BL01-In-dshieldblock - # indicators: 20
2017-07-18T16:40:20 (14926)base.state INFO: 20170714-BL01-In-binarydefensebanlist - transitioning to state 8
2017-07-18T16:40:20 (14926)basepoller.stop INFO: 20170714-BL01-In-binarydefensebanlist - # indicators: 8608
2017-07-18T16:40:54 (14912)launcher._sigterm_handler INFO: SIGTERM received
2017-07-18T16:40:54 (14912)mgmtbus.checkpoint_graph INFO: checkpoint_graph called, checking current state
2017-07-18T16:40:54 (14912)mgmtbus.checkpoint_graph ERROR: some nodes not started yet, waiting
[root@xxxxxxxx log]# date
Tue Jul 18 16:41:23 AEST 2017
[root@xxxxxxxx log]#
2017-07-18T16:40:54 (14912)mgmtbus.checkpoint_graph INFO: checkpoint_graph called, checking current state 2017-07-18T16:40:54 (14912)mgmtbus.checkpoint_graph ERROR: some nodes not started yet, waiting 2017-07-18T16:41:54 (14912)mgmtbus.checkpoint_graph ERROR: some nodes not started yet, waiting 2017-07-18T16:42:54 (14912)mgmtbus.checkpoint_graph ERROR: some nodes not started yet, waiting 2017-07-18T16:43:54 (14912)mgmtbus.checkpoint_graph ERROR: some nodes not started yet, waiting 2017-07-18T16:44:54 (14912)mgmtbus.checkpoint_graph ERROR: some nodes not started yet, waiting 2017-07-18T16:45:54 (14912)mgmtbus.checkpoint_graph ERROR: some nodes not started yet, waiting
Hi @simonsigre,
looking at the error: NOT_FOUND - no exchange '20170714-BL01-In-alienvaultreputation' in vhost '/'
, did the instance ran out of memory ?
@jtschichold how am I able to determine this? The hosts themselves have plenty of resources in all directions (CPU, RAM and Disk) .
If they have plenty of resources, memory should not be a problem. Please could you check the rabbitmq logs for errors of any sort ?
We have recently deployed 2x Minemeld servers straight from the Ansible playbook (worked first time) both identical installs .. however.. we are finding that some nodes just seem to stop and others seem to not function at all. The 2x hosts display different results at any one time and even commits seem to take 5-10 minutes to apply. I have review the logs but cant seem to identify anything. I initially feared it was a SELinux issue however no errors are observed in the /audit.log