agccie / ACI-EnhancedEndpointTracker

16 stars 15 forks source link

503 error #48

Open xtprox opened 5 years ago

xtprox commented 5 years ago

Hello! I'm getting this quite often.

Loading App
(503) failed to connect to mongo database

The only way to return to normal operation is to disable application and enable again. APIC 4.0.3d

agccie commented 5 years ago

What version of the app are your running? Also, can you confirm if this is the full version or Mini? Generally failed to connect to db means the db is no longer running. Additionally, no longer running means it ran out of memory and was killed by the kernel. The app has relatively verbose logs for each component, including the db. You can collect them via an app-techsupport. Also, it may be useful on each APIC to check dmesg for any oom messages

xtprox commented 5 years ago

@agccie App version 2.1, full version. I will look into logs and try to find out something related to this failure.

xtprox commented 5 years ago

@agccie maybe you could give me advice on what I should look for and in which files, because there's so many of them.

Here I found little info from APIC-3 manager_m1.log file:

149||mgr-tracker||UTC 2019-08-27 20:04:57.376||DEBUG||ept_manager.py:(388)||manager stop fabric: <**********> (worker heartbeat timeout [w14, w13, w12, w11, w10, w5, w4, w3, w2, w1, w0])
149||mgr-stats||UTC 2019-08-27 20:04:58.897||DEBUG||common.py:(207)||failed to connect to mongo db: connection pool: connect failed sh0/db1-sh0-cisco-enhancedendpointtracker.service.apic.local:6202 : connect failed to replica set sh0/db1-sh0-cisco-enhancedendpointtracker.service.apic.local:6202
149||mgr-stats||UTC 2019-08-27 20:04:58.897||ERROR||ept_manager.py:(465)||db no longer reachable/alive
149||mgr-main||UTC 2019-08-27 20:04:58.899||ERROR||ept_manager.py:(96)||Traceback:
Traceback (most recent call last):
  File "app/models/aci/ept/ept_manager.py", line 94, in run
    self._run()
  File "app/models/aci/ept/ept_manager.py", line 140, in _run
    (q, data) = self.redis.blpop(MANAGER_WORK_QUEUE)
  File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 1269, in blpop
    return self.execute_command('BLPOP', *keys)
  File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 668, in execute_command
    return self.parse_response(connection, command_name, **options)
  File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 680, in parse_response
    response = connection.read_response()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 624, in read_response
    response = self._parser.read_response()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 390, in read_response
    bufflen = recv_into(self._sock, self._buffer)
  File "/usr/local/lib/python2.7/dist-packages/redis/_compat.py", line 73, in recv_into
    return _retryable_call(sock, sock.recv_into, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/redis/_compat.py", line 58, in _retryable_call
    return func(*args, **kwargs)
KeyboardInterrupt

149||mgr-main||UTC 2019-08-27 20:04:58.899||DEBUG||common.py:(542)||exiting background thread: mgr-stats
149||mgr-main||UTC 2019-08-27 20:04:58.900||DEBUG||common.py:(542)||exiting background thread: mgr-tracker
149||mgr-main||UTC 2019-08-27 20:04:58.900||DEBUG||utils.py:(545)||sending SIGTERM to pid(231)
231||sub-main||UTC 2019-08-27 20:04:58.900||ERROR||utils.py:(508)||exiting due to signal (15)SIGTERM
231||sub-main||UTC 2019-08-27 20:04:58.901||ERROR||ept_subscriber.py:(208)||Traceback:
Traceback (most recent call last):
  File "app/models/aci/ept/ept_subscriber.py", line 204, in run
    self._run()
  File "app/models/aci/ept/ept_subscriber.py", line 532, in _run
    time.sleep(self.subscription_check_interval)
KeyboardInterrupt

231||sub-main||UTC 2019-08-27 20:04:58.902||INFO||subscription_ctrl.py:(148)||unsubscribe: (alive:True, thread:True)
231||sub-main||UTC 2019-08-27 20:04:58.902||DEBUG||subscription_ctrl.py:(159)||setting ctrl to close
231||sub-main||UTC 2019-08-27 20:04:58.902||DEBUG||subscription_ctrl.py:(164)||waiting for worker thread to exit

I've also checked dmesg and didn't found anything suspicious.

agccie commented 5 years ago

You can also look at the logs on the APIC directly. You need to determine which APIC the app is running. In the example below, you can see that all services are running on apic1. Go to /data2/logs/Cisco_EnhancedEndpointTracker and all the logs will be there.


fab4-apic1# acidiag scheduler appstatus Cisco_EnhancedEndpointTracker-service-primary

Container            Node             Status       Modified             Taskgroup                                Image
----------------------------------------------------------------------------------------------------------------------------------
ept_mgr              fab4-apic1       running      15d 10h 55m  2s      Cisco_EnhancedEndpointTracker-service-primary.group-primary enhancedendpointtracker:2-1-1c
ept_redis            fab4-apic1       running      15d 10h 55m  2s      Cisco_EnhancedEndpointTracker-service-primary.group-primary enhancedendpointtracker:2-1-1c
ept_watch            fab4-apic1       running      15d 10h 55m  2s      Cisco_EnhancedEndpointTracker-service-primary.group-primary enhancedendpointtracker:2-1-1c
ept_worker1          fab4-apic1       running      15d 10h 55m  2s      Cisco_EnhancedEndpointTracker-service-primary.group-primary enhancedendpointtracker:2-1-1c
ept_worker2          fab4-apic1       running      15d 10h 55m  2s      Cisco_EnhancedEndpointTracker-service-primary.group-primary enhancedendpointtracker:2-1-1c
ept_worker3          fab4-apic1       running      15d 10h 55m  2s      Cisco_EnhancedEndpointTracker-service-primary.group-primary enhancedendpointtracker:2-1-1c
ept_db1              fab4-apic1       running      15d 10h 55m  2s      Cisco_EnhancedEndpointTracker-service-primary.group-primary enhancedendpointtracker:2-1-1c
xtprox commented 5 years ago

By some reason I get no output at all when executing acidiag scheduler appstatus

APIC-1# acidiag scheduler appstatus

APIC-1#
agccie commented 5 years ago

I'm not sure what version the appstatus details were added, it may not be there in 4.0.3. However, you found the logs so you are good to go :-)

From the manager logs you provided, we can see that it restarted because it also failed to connect to mongo:

149||mgr-stats||UTC 2019-08-27 20:04:58.897||ERROR||ept_manager.py:(465)||db no longer reachable/alive

The corresponding traceback is expected. Either mongo was crashed/restarted because of an error or memory was exhausted and kernel killed the container. Can you check the latest mongo logs:

/data2/logs/Cisco_EnhancedEndpointTracker/mongo/db1/

You would need to check the latest in mongos, cfg, sh0, sh1, and sh2

xtprox commented 5 years ago

@agccie

Can you check the latest mongo logs

On all three controllers this folder is empty.

xtprox commented 5 years ago

I probably will wait for the time when app hang again with this error and will try to collect logs.