Open xtprox opened 5 years ago
What version of the app are your running? Also, can you confirm if this is the full version or Mini? Generally failed to connect to db means the db is no longer running. Additionally, no longer running means it ran out of memory and was killed by the kernel. The app has relatively verbose logs for each component, including the db. You can collect them via an app-techsupport. Also, it may be useful on each APIC to check dmesg for any oom messages
@agccie App version 2.1, full version. I will look into logs and try to find out something related to this failure.
@agccie maybe you could give me advice on what I should look for and in which files, because there's so many of them.
Here I found little info from APIC-3 manager_m1.log
file:
149||mgr-tracker||UTC 2019-08-27 20:04:57.376||DEBUG||ept_manager.py:(388)||manager stop fabric: <**********> (worker heartbeat timeout [w14, w13, w12, w11, w10, w5, w4, w3, w2, w1, w0])
149||mgr-stats||UTC 2019-08-27 20:04:58.897||DEBUG||common.py:(207)||failed to connect to mongo db: connection pool: connect failed sh0/db1-sh0-cisco-enhancedendpointtracker.service.apic.local:6202 : connect failed to replica set sh0/db1-sh0-cisco-enhancedendpointtracker.service.apic.local:6202
149||mgr-stats||UTC 2019-08-27 20:04:58.897||ERROR||ept_manager.py:(465)||db no longer reachable/alive
149||mgr-main||UTC 2019-08-27 20:04:58.899||ERROR||ept_manager.py:(96)||Traceback:
Traceback (most recent call last):
File "app/models/aci/ept/ept_manager.py", line 94, in run
self._run()
File "app/models/aci/ept/ept_manager.py", line 140, in _run
(q, data) = self.redis.blpop(MANAGER_WORK_QUEUE)
File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 1269, in blpop
return self.execute_command('BLPOP', *keys)
File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 668, in execute_command
return self.parse_response(connection, command_name, **options)
File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 680, in parse_response
response = connection.read_response()
File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 624, in read_response
response = self._parser.read_response()
File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 390, in read_response
bufflen = recv_into(self._sock, self._buffer)
File "/usr/local/lib/python2.7/dist-packages/redis/_compat.py", line 73, in recv_into
return _retryable_call(sock, sock.recv_into, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/redis/_compat.py", line 58, in _retryable_call
return func(*args, **kwargs)
KeyboardInterrupt
149||mgr-main||UTC 2019-08-27 20:04:58.899||DEBUG||common.py:(542)||exiting background thread: mgr-stats
149||mgr-main||UTC 2019-08-27 20:04:58.900||DEBUG||common.py:(542)||exiting background thread: mgr-tracker
149||mgr-main||UTC 2019-08-27 20:04:58.900||DEBUG||utils.py:(545)||sending SIGTERM to pid(231)
231||sub-main||UTC 2019-08-27 20:04:58.900||ERROR||utils.py:(508)||exiting due to signal (15)SIGTERM
231||sub-main||UTC 2019-08-27 20:04:58.901||ERROR||ept_subscriber.py:(208)||Traceback:
Traceback (most recent call last):
File "app/models/aci/ept/ept_subscriber.py", line 204, in run
self._run()
File "app/models/aci/ept/ept_subscriber.py", line 532, in _run
time.sleep(self.subscription_check_interval)
KeyboardInterrupt
231||sub-main||UTC 2019-08-27 20:04:58.902||INFO||subscription_ctrl.py:(148)||unsubscribe: (alive:True, thread:True)
231||sub-main||UTC 2019-08-27 20:04:58.902||DEBUG||subscription_ctrl.py:(159)||setting ctrl to close
231||sub-main||UTC 2019-08-27 20:04:58.902||DEBUG||subscription_ctrl.py:(164)||waiting for worker thread to exit
I've also checked dmesg and didn't found anything suspicious.
You can also look at the logs on the APIC directly. You need to determine which APIC the app is running. In the example below, you can see that all services are running on apic1. Go to /data2/logs/Cisco_EnhancedEndpointTracker and all the logs will be there.
fab4-apic1# acidiag scheduler appstatus Cisco_EnhancedEndpointTracker-service-primary
Container Node Status Modified Taskgroup Image
----------------------------------------------------------------------------------------------------------------------------------
ept_mgr fab4-apic1 running 15d 10h 55m 2s Cisco_EnhancedEndpointTracker-service-primary.group-primary enhancedendpointtracker:2-1-1c
ept_redis fab4-apic1 running 15d 10h 55m 2s Cisco_EnhancedEndpointTracker-service-primary.group-primary enhancedendpointtracker:2-1-1c
ept_watch fab4-apic1 running 15d 10h 55m 2s Cisco_EnhancedEndpointTracker-service-primary.group-primary enhancedendpointtracker:2-1-1c
ept_worker1 fab4-apic1 running 15d 10h 55m 2s Cisco_EnhancedEndpointTracker-service-primary.group-primary enhancedendpointtracker:2-1-1c
ept_worker2 fab4-apic1 running 15d 10h 55m 2s Cisco_EnhancedEndpointTracker-service-primary.group-primary enhancedendpointtracker:2-1-1c
ept_worker3 fab4-apic1 running 15d 10h 55m 2s Cisco_EnhancedEndpointTracker-service-primary.group-primary enhancedendpointtracker:2-1-1c
ept_db1 fab4-apic1 running 15d 10h 55m 2s Cisco_EnhancedEndpointTracker-service-primary.group-primary enhancedendpointtracker:2-1-1c
By some reason I get no output at all when executing acidiag scheduler appstatus
APIC-1# acidiag scheduler appstatus
APIC-1#
I'm not sure what version the appstatus details were added, it may not be there in 4.0.3. However, you found the logs so you are good to go :-)
From the manager logs you provided, we can see that it restarted because it also failed to connect to mongo:
149||mgr-stats||UTC 2019-08-27 20:04:58.897||ERROR||ept_manager.py:(465)||db no longer reachable/alive
The corresponding traceback is expected. Either mongo was crashed/restarted because of an error or memory was exhausted and kernel killed the container. Can you check the latest mongo logs:
/data2/logs/Cisco_EnhancedEndpointTracker/mongo/db1/
You would need to check the latest in mongos, cfg, sh0, sh1, and sh2
@agccie
Can you check the latest mongo logs
On all three controllers this folder is empty.
I probably will wait for the time when app hang again with this error and will try to collect logs.
Hello! I'm getting this quite often.
The only way to return to normal operation is to disable application and enable again. APIC 4.0.3d