Open gkc opened 1 year ago
1st incident - logs from @kumarnarendra701 @sitaram-kalluri is checking root@swarm0002-01:~# docker service logs -f uc0c4qz75h98k830v3nimbf1d 5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal | INFO|2023-04-25 12:32:20.817282|AtSecondaryServer|currentAtSign : @foolishgemini1 5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal | 5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal | INFO|2023-04-25 12:32:20.829918|HiveBase|commit_log_106a8370ed3426f079bb4b1239aa00ee682c830a9898eadeea1a84b6d7fedd2a initialized successfully 5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal | 5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal | INFO|2023-04-25 12:32:20.896658|HiveBase|access_log_106a8370ed3426f079bb4b1239aa00ee682c830a9898eadeea1a84b6d7fedd2a initialized successfully 5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal | 5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal | INFO|2023-04-25 12:32:27.270485|HiveBase|notifications_106a8370ed3426f079bb4b1239aa00ee682c830a9898eadeea1a84b6d7fedd2a initialized successfully 5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal |
2nd incident - @snowthe18raw (5835)
Logs -
O|2023-05-01 12:37:56.179259|AtSecondaryServer|currentAtSign : @snowthe18raw
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal |
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal | INFO|2023-05-01 12:37:56.209010|HiveBase|commit_log_f259ac132bcb4ea5ebbba9b8202c4c96067b0047c38fd5be48bc0dc7ea4c6b7f initialized successfully
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal |
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal | INFO|2023-05-01 12:37:56.240765|HiveBase|access_log_f259ac132bcb4ea5ebbba9b8202c4c96067b0047c38fd5be48bc0dc7ea4c6b7f initialized successfully
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal |
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal | INFO|2023-05-01 12:38:04.473256|HiveBase|notifications_f259ac132bcb4ea5ebbba9b8202c4c96067b0047c38fd5be48bc0dc7ea4c6b7f initialized successfully
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal |
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal | INFO|2023-05-01 12:38:53.763544|AtSecondaryServer|currentAtSign : @snowthe18raw
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal |
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal | INFO|2023-05-01 12:38:53.774617|HiveBase|commit_log_f259ac132bcb4ea5ebbba9b8202c4c96067b0047c38fd5be48bc0dc7ea4c6b7f initialized successfully
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal |
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal | INFO|2023-05-01 12:38:53.790753|HiveBase|access_log_f259ac132bcb4ea5ebbba9b8202c4c96067b0047c38fd5be48bc0dc7ea4c6b7f initialized successfully
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal |
^C
@sitaram-kalluri @murali-shris can you add your comments on this ticket
@athandle : @sitaram-kalluri has made good progress in profiling memory usage on #1303 and has PR #1428 which reduces memory consumption during startup and in steady state
However, this ticket is about detecting out-of-memory restarts at the swarm level. See "Describe the solution you'd like" in this ticket's description above
Moving to the next sprint
Moving to PR72. @kumarnarendra701 can you please make sure this one's on your list for this sprint.
@cpswan @gkc - Can we use this swarmprom to monitor our swarm clusters? Because if we use Docker events, we'd have to set up our custom script on each of our swarm nodes. Please share your thoughts and let me know if you see any other monitoring tools.
Nice find @kumarnarendra701 Swarmprom looks really nice.
Let's get it set up on staging and see how we get on with it (and whether it can solve the problem we're looking at here).
@cpswan - I set up Swarmprom in our staging environment, and I'm currently checking out the UI to explore more about it and moving it to the next sprint for more work.
@cpswan - Didn't get a chance to work on this tool and move to next sprint.
@cpswan - Moving to the next sprint. I'm seeing the issue in Swarmprom UI. I'll update further progress on the ticket.
@cpswan Quick update on Swarmprom - I'm currently experiencing some trouble as I'm unable to view all swarm nodes and their services on Swarmprom UI. Finding a solution has been quite challenging as there is very little documentation available on the internet. However, I'm actively working on resolving this issue and will keep you updated on the progress of the setup. cc: @athandle
@cpswan -I am seeing issues with Swarmprom setup as they have stopped development on their repository. Therefore, I have found another tool and I am exploring Portainer.
cc: @athandle
@cpswan - Portainer UI setup completed and facing some issues in agent connectivity and working on this.
@cconstab I know that you tried Portainer a while ago, so it would be good to get your feedback on it?
@cpswan - I used Portainer in my staging Swarm cluster, but I noticed it's mainly for managing Docker Swarm itself and doesn't focus much on monitoring. Also, it doesn't show more visibility of stacks that are created outside of Portainer.
I found it worked ok in small setups like mybhome lab but did not scale well to our setup. It used tobat least become laggy and unreliable. I also had security concerns.
My take was in the end use the cli and if we needed tools look else where.
The portainer team also got "k8s" pretty bad and that started to pull the project away from Swarm mode.
This was 2 years back so things may well have changed.
Bumping to PR78 so that @kumarnarendra701 can continue. I've suggested:
If Portainer isn't suitable then maybe go back to swarmprom and let's see what it might take to get it up to scratch.
@cpswan - I tried to setup Swarmform on a staging cluster, and while all services seem to be working fine, I'm unable to view all cluster data on the Swarm node data and only show one node. I've tried to find a solution for this, but it's proving to be very difficult to debug due to the limited blogs available online. Active service:
Swarm UI:
Setup Informations: Server: staging0001-01 Dir: /root/swarmprom Command to start swarmprom: ADMIN_USER=atadmin ADMIN_PASSWORD=**** SLACK_URL=https://hooks.slack.com/services/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 SLACK_CHANNEL=devops-alerts SLACK_USER=alertmanager docker stack deploy -c docker-compose.yml mon
Can you please quickly review this and let me know if you notice any issues with the setup? cc: @athandle
@kumarnarendra701 looks like the mon_dockerd-exporter containers are unable to send their data:
...
17/Jan/2024:14:18:02 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
17/Jan/2024:14:18:17 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
17/Jan/2024:14:18:32 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
17/Jan/2024:14:18:47 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
17/Jan/2024:14:19:02 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
My fault finding process:
docker service ps mon_dockerd-exporter
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
j9l9814758ti mon_dockerd-exporter.f7fctkgsxqyzqbz2qvivpvmc2 stefanprodan/caddy:latest staging0001-03.us-central1-c.c.development-305719.internal Running Running 6 days ago
q3d5yyzdkxcv mon_dockerd-exporter.frqgu019hd0jzohhvpsgf6s6v stefanprodan/caddy:latest staging0001-04.us-central1-a.c.development-305719.internal Running Running 6 days ago
pn83y3720y3r mon_dockerd-exporter.hdyr1xdcacg9ahrsvec1n5jp6 stefanprodan/caddy:latest staging0001-01 Running Running 6 days ago
1ff406n2eqca mon_dockerd-exporter.ipumfp1ioq0ue6vmcw6y70he2 stefanprodan/caddy:latest staging0001-06.us-central1-c.c.development-305719.internal Running Running 6 days ago
vxj7ti0mvuth mon_dockerd-exporter.njs9res7cc75ny27qo9ixtsgy stefanprodan/caddy:latest staging0001-05.us-central1-b.c.development-305719.internal Running Running 6 days ago
ug2ri84zlfm6 mon_dockerd-exporter.pt2qgrl1usnb3iqbbal8v96h2 stefanprodan/caddy:latest staging0001-02 Running Running 6 days ago
docker logs mon_dockerd-exporter.frqgu019hd0jzohhvpsgf6s6v.q3d5yyzdkxcvqykdfr0kcqkqw
, which yields the log snippet above.I'd call out that 172.18.x addresses aren't in the LAN range for that Swarm.
@cpswan - Thanks for your input. I tried running "swarmprom" in the secondary Docker network, but it failed. Although I can ping the IP from the container, I cannot connect to port 9323.
Errors -
19/Jan/2024:13:33:32 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
19/Jan/2024:13:33:47 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
19/Jan/2024:13:34:02 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
19/Jan/2024:13:34:17 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
19/Jan/2024:13:34:32 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
staging0001-04 ~ # docker exec -it fe83602d5257 sh
/www # ping 172.18.0.1
PING 172.18.0.1 (172.18.0.1): 56 data bytes
64 bytes from 172.18.0.1: seq=0 ttl=64 time=0.239 ms
64 bytes from 172.18.0.1: seq=1 ttl=64 time=0.111 ms
64 bytes from 172.18.0.1: seq=2 ttl=64 time=0.126 ms
64 bytes from 172.18.0.1: seq=3 ttl=64 time=0.131 ms
64 bytes from 172.18.0.1: seq=4 ttl=64 time=0.111 ms
^C
--- 172.18.0.1 ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 0.111/0.143/0.239 ms
/www #
/www #
/www #
/www # telnet 172.18.0.1 9323
telnet: can't connect to remote host (172.18.0.1): Connection refused
/www #
/www #
/www # exit
staging0001-04 ~ #
staging0001-04 ~ #
staging0001-04 ~ # ping 172.18.0.1
PING 172.18.0.1 (172.18.0.1) 56(84) bytes of data.
64 bytes from 172.18.0.1: icmp_seq=1 ttl=64 time=0.192 ms
64 bytes from 172.18.0.1: icmp_seq=2 ttl=64 time=0.051 ms
64 bytes from 172.18.0.1: icmp_seq=3 ttl=64 time=0.071 ms
^C
--- 172.18.0.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2053ms
rtt min/avg/max/mdev = 0.051/0.104/0.192/0.062 ms
The IP it trying to connect is docker network
docker_gwbridge: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.18.0.1 netmask 255.255.0.0 broadcast 172.18.255.255
inet6 fe80::42:82ff:fe2f:5115 prefixlen 64 scopeid 0x20<link>
ether 02:42:82:2f:51:15 txqueuelen 0 (Ethernet)
RX packets 1126064 bytes 308309940 (294.0 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1351476 bytes 134159319 (127.9 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
cc: @athandle
Reduced SP and moved to next sprint
@kumarnarendra701 can you please try to get back into this and see if you can resolve the network issues.
@cpswan - I've tried using Swarmprom several times, but it looks like the repository was archived 4 years ago and there are very few blog posts about it. It seems like we might need to consider using other monitoring tools, but most tools are designed for Kubernetes with very few options for Docker swarm monitoring. If you know of any tools that can monitor a swarm cluster, please suggest them so that I can start implementing them. cc: @gkc
I've started running docker events
to a log on each swarm, I'll take a look at the output tomorrow
I did look at the output and all interesting events are being logged. There weren't any "too much memory being used" restarts when last I looked after a couple of days; I will look again at the weekend
docker events
has indeed been reporting container die
messages which include the exit code - i.e. docker events
produces enough information to allow creation of a script which listens to and acts on the event stream as described in the original description of this PR.
I will create a script during this sprint and do some testing via my atServer to verify it
Is your feature request related to a problem? Please describe.
We need a way to detect out-of-memory-related restart looping
Describe the solution you'd like
Have a tool which listens for events such as when the docker swarm manager has killed a container. Such a tool could then check if the container was killed due to container memory usage exceeding its cap, and could also check if it was previously killed within N (e.g. 10) minutes also due to container memory usage exceeding its cap
See https://docs.docker.com/engine/reference/commandline/events/
Describe alternatives you've considered
No response
Additional context
Linked to https://github.com/atsign-foundation/at_server/issues/1303