atsign-foundation / at_server

The software implementation of Atsign's core technology
https://docs.atsign.com
BSD 3-Clause "New" or "Revised" License
39 stars 13 forks source link

Investigate ability to detect out-of-memory-related restart looping #1318

Open gkc opened 1 year ago

gkc commented 1 year ago

Is your feature request related to a problem? Please describe.

We need a way to detect out-of-memory-related restart looping

Describe the solution you'd like

Have a tool which listens for events such as when the docker swarm manager has killed a container. Such a tool could then check if the container was killed due to container memory usage exceeding its cap, and could also check if it was previously killed within N (e.g. 10) minutes also due to container memory usage exceeding its cap

See https://docs.docker.com/engine/reference/commandline/events/

Describe alternatives you've considered

No response

Additional context

Linked to https://github.com/atsign-foundation/at_server/issues/1303

athandle commented 1 year ago

1st incident - logs from @kumarnarendra701 @sitaram-kalluri is checking root@swarm0002-01:~# docker service logs -f uc0c4qz75h98k830v3nimbf1d 5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal | INFO|2023-04-25 12:32:20.817282|AtSecondaryServer|currentAtSign : @foolishgemini1 5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal | 5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal | INFO|2023-04-25 12:32:20.829918|HiveBase|commit_log_106a8370ed3426f079bb4b1239aa00ee682c830a9898eadeea1a84b6d7fedd2a initialized successfully 5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal | 5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal | INFO|2023-04-25 12:32:20.896658|HiveBase|access_log_106a8370ed3426f079bb4b1239aa00ee682c830a9898eadeea1a84b6d7fedd2a initialized successfully 5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal | 5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal | INFO|2023-04-25 12:32:27.270485|HiveBase|notifications_106a8370ed3426f079bb4b1239aa00ee682c830a9898eadeea1a84b6d7fedd2a initialized successfully 5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal |

kumarnarendra701 commented 1 year ago

2nd incident - @snowthe18raw (5835)

Logs -

O|2023-05-01 12:37:56.179259|AtSecondaryServer|currentAtSign : @snowthe18raw 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal    | 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal    | INFO|2023-05-01 12:37:56.209010|HiveBase|commit_log_f259ac132bcb4ea5ebbba9b8202c4c96067b0047c38fd5be48bc0dc7ea4c6b7f initialized successfully 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal    | 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal    | INFO|2023-05-01 12:37:56.240765|HiveBase|access_log_f259ac132bcb4ea5ebbba9b8202c4c96067b0047c38fd5be48bc0dc7ea4c6b7f initialized successfully 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal    | 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal    | INFO|2023-05-01 12:38:04.473256|HiveBase|notifications_f259ac132bcb4ea5ebbba9b8202c4c96067b0047c38fd5be48bc0dc7ea4c6b7f initialized successfully 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal    | 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal    | INFO|2023-05-01 12:38:53.763544|AtSecondaryServer|currentAtSign : @snowthe18raw 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal    | 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal    | INFO|2023-05-01 12:38:53.774617|HiveBase|commit_log_f259ac132bcb4ea5ebbba9b8202c4c96067b0047c38fd5be48bc0dc7ea4c6b7f initialized successfully 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal    | 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal    | INFO|2023-05-01 12:38:53.790753|HiveBase|access_log_f259ac132bcb4ea5ebbba9b8202c4c96067b0047c38fd5be48bc0dc7ea4c6b7f initialized successfully 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal    | 
^C
athandle commented 1 year ago

@sitaram-kalluri @murali-shris can you add your comments on this ticket

gkc commented 1 year ago

@athandle : @sitaram-kalluri has made good progress in profiling memory usage on #1303 and has PR #1428 which reduces memory consumption during startup and in steady state

However, this ticket is about detecting out-of-memory restarts at the swarm level. See "Describe the solution you'd like" in this ticket's description above

kumarnarendra701 commented 10 months ago

Moving to the next sprint

cpswan commented 10 months ago

Moving to PR72. @kumarnarendra701 can you please make sure this one's on your list for this sprint.

kumarnarendra701 commented 10 months ago

@cpswan @gkc - Can we use this swarmprom to monitor our swarm clusters? Because if we use Docker events, we'd have to set up our custom script on each of our swarm nodes. Please share your thoughts and let me know if you see any other monitoring tools.

https://dockerswarm.rocks/swarmprom/

cpswan commented 10 months ago

Nice find @kumarnarendra701 Swarmprom looks really nice.

Let's get it set up on staging and see how we get on with it (and whether it can solve the problem we're looking at here).

kumarnarendra701 commented 9 months ago

@cpswan - I set up Swarmprom in our staging environment, and I'm currently checking out the UI to explore more about it and moving it to the next sprint for more work.

kumarnarendra701 commented 9 months ago

@cpswan - Didn't get a chance to work on this tool and move to next sprint.

kumarnarendra701 commented 8 months ago

@cpswan - Moving to the next sprint. I'm seeing the issue in Swarmprom UI. I'll update further progress on the ticket.

kumarnarendra701 commented 8 months ago

@cpswan Quick update on Swarmprom - I'm currently experiencing some trouble as I'm unable to view all swarm nodes and their services on Swarmprom UI. Finding a solution has been quite challenging as there is very little documentation available on the internet. However, I'm actively working on resolving this issue and will keep you updated on the progress of the setup. cc: @athandle

kumarnarendra701 commented 7 months ago

@cpswan -I am seeing issues with Swarmprom setup as they have stopped development on their repository. Therefore, I have found another tool and I am exploring Portainer.

image cc: @athandle

kumarnarendra701 commented 7 months ago

@cpswan - Portainer UI setup completed and facing some issues in agent connectivity and working on this.

cpswan commented 7 months ago

@cconstab I know that you tried Portainer a while ago, so it would be good to get your feedback on it?

kumarnarendra701 commented 7 months ago

@cpswan - I used Portainer in my staging Swarm cluster, but I noticed it's mainly for managing Docker Swarm itself and doesn't focus much on monitoring. Also, it doesn't show more visibility of stacks that are created outside of Portainer.

cconstab commented 7 months ago

I found it worked ok in small setups like mybhome lab but did not scale well to our setup. It used tobat least become laggy and unreliable. I also had security concerns.

My take was in the end use the cli and if we needed tools look else where.

The portainer team also got "k8s" pretty bad and that started to pull the project away from Swarm mode.

This was 2 years back so things may well have changed.

cpswan commented 7 months ago

Bumping to PR78 so that @kumarnarendra701 can continue. I've suggested:

If Portainer isn't suitable then maybe go back to swarmprom and let's see what it might take to get it up to scratch.

kumarnarendra701 commented 6 months ago

@cpswan - I tried to setup Swarmform on a staging cluster, and while all services seem to be working fine, I'm unable to view all cluster data on the Swarm node data and only show one node. I've tried to find a solution for this, but it's proving to be very difficult to debug due to the limited blogs available online. Active service:

image

Swarm UI:

image

Setup Informations: Server: staging0001-01 Dir: /root/swarmprom Command to start swarmprom: ADMIN_USER=atadmin ADMIN_PASSWORD=**** SLACK_URL=https://hooks.slack.com/services/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 SLACK_CHANNEL=devops-alerts SLACK_USER=alertmanager docker stack deploy -c docker-compose.yml mon

Can you please quickly review this and let me know if you notice any issues with the setup? cc: @athandle

cpswan commented 6 months ago

@kumarnarendra701 looks like the mon_dockerd-exporter containers are unable to send their data:

...
17/Jan/2024:14:18:02 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
17/Jan/2024:14:18:17 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
17/Jan/2024:14:18:32 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
17/Jan/2024:14:18:47 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
17/Jan/2024:14:19:02 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused

My fault finding process:

I'd call out that 172.18.x addresses aren't in the LAN range for that Swarm.

kumarnarendra701 commented 6 months ago

@cpswan - Thanks for your input. I tried running "swarmprom" in the secondary Docker network, but it failed. Although I can ping the IP from the container, I cannot connect to port 9323.

Errors -

19/Jan/2024:13:33:32 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
19/Jan/2024:13:33:47 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
19/Jan/2024:13:34:02 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
19/Jan/2024:13:34:17 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
19/Jan/2024:13:34:32 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
staging0001-04 ~ # docker exec -it fe83602d5257 sh
/www # ping 172.18.0.1
PING 172.18.0.1 (172.18.0.1): 56 data bytes
64 bytes from 172.18.0.1: seq=0 ttl=64 time=0.239 ms
64 bytes from 172.18.0.1: seq=1 ttl=64 time=0.111 ms

64 bytes from 172.18.0.1: seq=2 ttl=64 time=0.126 ms
64 bytes from 172.18.0.1: seq=3 ttl=64 time=0.131 ms
64 bytes from 172.18.0.1: seq=4 ttl=64 time=0.111 ms
^C
--- 172.18.0.1 ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 0.111/0.143/0.239 ms
/www # 
/www # 
/www # 
/www # telnet 172.18.0.1 9323
telnet: can't connect to remote host (172.18.0.1): Connection refused
/www # 
/www # 
/www # exit
staging0001-04 ~ # 
staging0001-04 ~ # 
staging0001-04 ~ # ping 172.18.0.1
PING 172.18.0.1 (172.18.0.1) 56(84) bytes of data.
64 bytes from 172.18.0.1: icmp_seq=1 ttl=64 time=0.192 ms
64 bytes from 172.18.0.1: icmp_seq=2 ttl=64 time=0.051 ms
64 bytes from 172.18.0.1: icmp_seq=3 ttl=64 time=0.071 ms
^C
--- 172.18.0.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2053ms
rtt min/avg/max/mdev = 0.051/0.104/0.192/0.062 ms

The IP it trying to connect is docker network

docker_gwbridge: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.18.0.1  netmask 255.255.0.0  broadcast 172.18.255.255
        inet6 fe80::42:82ff:fe2f:5115  prefixlen 64  scopeid 0x20<link>
        ether 02:42:82:2f:51:15  txqueuelen 0  (Ethernet)
        RX packets 1126064  bytes 308309940 (294.0 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1351476  bytes 134159319 (127.9 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

cc: @athandle

athandle commented 4 months ago

Reduced SP and moved to next sprint

cpswan commented 2 months ago

@kumarnarendra701 can you please try to get back into this and see if you can resolve the network issues.

kumarnarendra701 commented 2 months ago

@cpswan - I've tried using Swarmprom several times, but it looks like the repository was archived 4 years ago and there are very few blog posts about it. It seems like we might need to consider using other monitoring tools, but most tools are designed for Kubernetes with very few options for Docker swarm monitoring. If you know of any tools that can monitor a swarm cluster, please suggest them so that I can start implementing them. cc: @gkc

gkc commented 2 months ago

I've started running docker events to a log on each swarm, I'll take a look at the output tomorrow

gkc commented 1 month ago

I did look at the output and all interesting events are being logged. There weren't any "too much memory being used" restarts when last I looked after a couple of days; I will look again at the weekend

gkc commented 4 weeks ago

docker events has indeed been reporting container die messages which include the exit code - i.e. docker events produces enough information to allow creation of a script which listens to and acts on the event stream as described in the original description of this PR.

gkc commented 2 weeks ago

I will create a script during this sprint and do some testing via my atServer to verify it