PaloAltoNetworks / minemeld-ansible

Ansible playbook for installing MineMeld on Linux
Apache License 2.0
48 stars 48 forks source link

CentOS7 feels 'buggy' #15

Open simonsigre opened 7 years ago

simonsigre commented 7 years ago

We have recently deployed 2x Minemeld servers straight from the Ansible playbook (worked first time) both identical installs .. however.. we are finding that some nodes just seem to stop and others seem to not function at all. The 2x hosts display different results at any one time and even commits seem to take 5-10 minutes to apply. I have review the logs but cant seem to identify anything. I initially feared it was a SELinux issue however no errors are observed in the /audit.log

simonsigre commented 7 years ago

As I look at this, its the engine that seems to have an issue .. The following logs are observed when its restarting (that 5-10 mins) It seems to be stuck in a stopping state

2017-07-18T16:40:11 (14912)mgmtbus._merge_status ERROR: old clock: 46 > 44 - dropped
2017-07-18T16:40:11 (14912)mgmtbus._merge_status ERROR: old clock: 38 > 37 - dropped
2017-07-18T16:40:11 (14912)mgmtbus._merge_status ERROR: old clock: 58 > 54 - dropped
Traceback (most recent call last):
  File "/opt/minemeld/engine/current/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run
    result = self._run(*self.args, **self.kwargs)
  File "/opt/minemeld/engine/core/minemeld/comm/amqp.py", line 561, in _ioloop
    conn.drain_events()
  File "/opt/minemeld/engine/current/lib/python2.7/site-packages/amqp/connection.py", line 323, in drain_events
    return amqp_method(channel, args)
  File "/opt/minemeld/engine/current/lib/python2.7/site-packages/amqp/channel.py", line 241, in _close
    reply_code, reply_text, (class_id, method_id), ChannelError,
NotFound: Basic.publish: (404) NOT_FOUND - no exchange '20170714-BL01-In-alienvaultreputation' in vhost '/'
<Greenlet at 0x381b9b0: <bound method AMQP._ioloop of <minemeld.comm.amqp.AMQP object at 0x35fbad0>>(9)> failed with NotFound

2017-07-18T16:40:20 (14926)amqp._ioloop_failure ERROR: _ioloop_failure: exception in ioloop
Traceback (most recent call last):
  File "/opt/minemeld/engine/core/minemeld/comm/amqp.py", line 567, in _ioloop_failure
    g.get()
  File "/opt/minemeld/engine/current/lib/python2.7/site-packages/gevent/greenlet.py", line 251, in get
    raise self._exception
NotFound: Basic.publish: (404) NOT_FOUND - no exchange '20170714-BL01-In-alienvaultreputation' in vhost '/'
2017-07-18T16:40:20 (14926)chassis.stop INFO: chassis stop called
2017-07-18T16:40:20 (14926)base.state INFO: 20170714-BL01-In-spamhausDROP - transitioning to state 8
2017-07-18T16:40:20 (14926)basepoller.stop INFO: 20170714-BL01-In-spamhausDROP - # indicators: 831
2017-07-18T16:40:20 (14926)base.state INFO: 20170714-BL01-In-dshieldblock - transitioning to state 8
2017-07-18T16:40:20 (14926)basepoller.stop INFO: 20170714-BL01-In-dshieldblock - # indicators: 20
2017-07-18T16:40:20 (14926)base.state INFO: 20170714-BL01-In-binarydefensebanlist - transitioning to state 8
2017-07-18T16:40:20 (14926)basepoller.stop INFO: 20170714-BL01-In-binarydefensebanlist - # indicators: 8608
2017-07-18T16:40:54 (14912)launcher._sigterm_handler INFO: SIGTERM received
2017-07-18T16:40:54 (14912)mgmtbus.checkpoint_graph INFO: checkpoint_graph called, checking current state
2017-07-18T16:40:54 (14912)mgmtbus.checkpoint_graph ERROR: some nodes not started yet, waiting
[root@xxxxxxxx log]# date
Tue Jul 18 16:41:23 AEST 2017
[root@xxxxxxxx log]# 
simonsigre commented 7 years ago

2017-07-18T16:40:54 (14912)mgmtbus.checkpoint_graph INFO: checkpoint_graph called, checking current state 2017-07-18T16:40:54 (14912)mgmtbus.checkpoint_graph ERROR: some nodes not started yet, waiting 2017-07-18T16:41:54 (14912)mgmtbus.checkpoint_graph ERROR: some nodes not started yet, waiting 2017-07-18T16:42:54 (14912)mgmtbus.checkpoint_graph ERROR: some nodes not started yet, waiting 2017-07-18T16:43:54 (14912)mgmtbus.checkpoint_graph ERROR: some nodes not started yet, waiting 2017-07-18T16:44:54 (14912)mgmtbus.checkpoint_graph ERROR: some nodes not started yet, waiting 2017-07-18T16:45:54 (14912)mgmtbus.checkpoint_graph ERROR: some nodes not started yet, waiting

jtschichold commented 7 years ago

Hi @simonsigre, looking at the error: NOT_FOUND - no exchange '20170714-BL01-In-alienvaultreputation' in vhost '/', did the instance ran out of memory ?

simonsigre commented 7 years ago

@jtschichold how am I able to determine this? The hosts themselves have plenty of resources in all directions (CPU, RAM and Disk) .

jtschichold commented 7 years ago

If they have plenty of resources, memory should not be a problem. Please could you check the rabbitmq logs for errors of any sort ?