Closed anandhakumarpalanisamy closed 3 years ago
Observations :
1) Checking docker service status
primary manager machine
docker service ls gave the following error
root@hlf1:~# docker service ls Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again. root@hlf1:~# packet_write_wait: Connection to 157.230.111.128 port 22: Broken pipe
secondary manager machine
docker service ls and docker ps commands gave the following error
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Tried starting the docker daemon by running dockerd
docker daemon failed to start with the following error
INFO[2021-07-05T10:48:40.800594620Z] ClientConn switching balancer to "pick_first" module=grpc INFO[2021-07-05T10:48:40.800778252Z] manager selected by agent for new session: {tld00x72nnotw4cow1e3qs86l 157.230.111.128:2377} module=node/agent node.id=r2k5std2e3ul1ckm5lz84hym7 INFO[2021-07-05T10:48:40.805403579Z] waiting 0s before registering session module=node/agent node.id=r2k5std2e3ul1ckm5lz84hym7 ERRO[2021-07-05T10:48:40.914002957Z] agent: session failed backoff=100ms error="rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online." module=node/agent node.id=r2k5std2e3ul1ckm5lz84hym7
2) Debugging the reason for docker daemon stoppage in secondary manager
Running dmesg command in secondary manager showed the following output
[868647.134539] Out of memory: Kill process 20759 (dockerd) score 280 or sacrifice child [868647.137422] Killed process 20759 (dockerd) total-vm:2474268kB, anon-rss:1638528kB, file-rss:0kB, shmem-rss:0kB
Conclusion
Fix steps
Restart docker daemon in secondary manager
Following the suggestion from https://github.com/docker/swarmkit/issues/1985#issuecomment-283191356
ran the command rm -r /var/lib/docker/swarm in the secondary manager machine
restarted the docker daemon using the command "sudo systemctl start docker"
Now the docker daemon restarted successfully in secondary manager.
Purge swarm and restart the swarm
In the 014.purge_swarm.yml playbook, commented out the last task "rm -rf /root/hlft-store/*" to preserve old data
Ran the 014.purge_swarm.yml playbook and it was successful .
Ran the 014.spawn_swarm.yml playbook. But it failed at the last task of creating the overlay network in the primary manager.
Following the suggestion from https://github.com/docker/swarmkit/issues/1985#issuecomment-283191356
ran the command rm -r /var/lib/docker/swarm in the primary manager machine
restarted the docker daemon using the command "sudo systemctl restart docker" in the primary manager machine
Ran the 014.spawn_swarm.yml playbook and it was successful
Start all other services
Debugging the filesystem error :
This was caused due to the gluster fs volume in primary manager was not online.
The output of command "gluster volume status"
Following the suggestion from https://bobcares.com/blog/gluster-bring-brick-online/
ran the command "gluster volume start gfs0 force" -> This brought back the brick in pimary manager online.
Now all the playbooks ran successfully and all the docker services started successfully.
The old data was preserved in the glusterfs but could not be used.
Final Conclusion
The swarm setup got corrupted in both primary and secondary manager machines.
The docker daemon in secondary manager had stopped with the error "Out of memory"
Running dmesg command in secondary manager showed the following output
[868647.134539] Out of memory: Kill process 20759 (dockerd) score 280 or sacrifice child [868647.137422] Killed process 20759 (dockerd) total-vm:2474268kB, anon-rss:1638528kB, file-rss:0kB, shmem-rss:0kB
The out of memory could have occurred due to the observations specified here : https://github.com/bityoga/fabric_as_code/issues/45**
To avoid this issue in future, it was suggested to use machines with higher memory.
Hyperledger Network Crash - Agilia's Scenario
Glusterfs size - 13 GB
[swarm_manager_prime] hlf1 ansible_host=157.230.111.128 ansible_python_interpreter="/usr/bin/python3"
[swarm_managers] hlf1 ansible_host=157.230.111.128 ansible_python_interpreter="/usr/bin/python3" hlf2 ansible_host=157.230.104.94 ansible_python_interpreter="/usr/bin/python3"
[swarm_workers] hlf3 ansible_host=157.230.104.190 ansible_python_interpreter="/usr/bin/python3" hlf4 ansible_host=157.230.100.208 ansible_python_interpreter="/usr/bin/python3"