Hyperledger Network Crash - Agilia's Scenario

anandhakumarpalanisamy commented 3 years ago

Hyperledger Network Crash - Agilia's Scenario

Agilia reported that the network got crashed after some days of deployment.
Agilia has deployed the fabric network with the following configurations.
all machines have 1vCPU and 2GB RAM
Glusterfs size - 13 GB

[swarm_manager_prime] hlf1 ansible_host=157.230.111.128 ansible_python_interpreter="/usr/bin/python3"

[swarm_managers] hlf1 ansible_host=157.230.111.128 ansible_python_interpreter="/usr/bin/python3" hlf2 ansible_host=157.230.104.94 ansible_python_interpreter="/usr/bin/python3"

[swarm_workers] hlf3 ansible_host=157.230.104.190 ansible_python_interpreter="/usr/bin/python3" hlf4 ansible_host=157.230.100.208 ansible_python_interpreter="/usr/bin/python3"

anandhakumarpalanisamy commented 3 years ago

Observations :

1) Checking docker service status

primary manager machine
- docker service ls gave the following error
  
  root@hlf1:~# docker service ls Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again. root@hlf1:~# packet_write_wait: Connection to 157.230.111.128 port 22: Broken pipe
secondary manager machine
- docker service ls and docker ps commands gave the following error
  
  Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
- Tried starting the docker daemon by running dockerd
- docker daemon failed to start with the following error INFO[2021-07-05T10:48:40.800594620Z] ClientConn switching balancer to "pick_first" module=grpc INFO[2021-07-05T10:48:40.800778252Z] manager selected by agent for new session: {tld00x72nnotw4cow1e3qs86l 157.230.111.128:2377} module=node/agent node.id=r2k5std2e3ul1ckm5lz84hym7 INFO[2021-07-05T10:48:40.805403579Z] waiting 0s before registering session module=node/agent node.id=r2k5std2e3ul1ckm5lz84hym7 ERRO[2021-07-05T10:48:40.914002957Z] agent: session failed backoff=100ms error="rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online." module=node/agent node.id=r2k5std2e3ul1ckm5lz84hym7

2) Debugging the reason for docker daemon stoppage in secondary manager

Running dmesg command in secondary manager showed the following output

[868647.134539] Out of memory: Kill process 20759 (dockerd) score 280 or sacrifice child [868647.137422] Killed process 20759 (dockerd) total-vm:2474268kB, anon-rss:1638528kB, file-rss:0kB, shmem-rss:0kB

Conclusion

The swarm setup got corrupted in both primary and secondary manager machines
The docker daemon in secondary manager stopped with the error Out of memory

anandhakumarpalanisamy commented 3 years ago

Fix steps

Restart docker daemon in secondary manager
- Following the suggestion from https://github.com/docker/swarmkit/issues/1985#issuecomment-283191356
- ran the command rm -r /var/lib/docker/swarm in the secondary manager machine
- restarted the docker daemon using the command "sudo systemctl start docker"
- Now the docker daemon restarted successfully in secondary manager.
Purge swarm and restart the swarm
- In the 014.purge_swarm.yml playbook, commented out the last task "rm -rf /root/hlft-store/*" to preserve old data
- Ran the 014.purge_swarm.yml playbook and it was successful .
- Ran the 014.spawn_swarm.yml playbook. But it failed at the last task of creating the overlay network in the primary manager.
- Following the suggestion from https://github.com/docker/swarmkit/issues/1985#issuecomment-283191356
- ran the command rm -r /var/lib/docker/swarm in the primary manager machine
- restarted the docker daemon using the command "sudo systemctl restart docker" in the primary manager machine
- Ran the 014.spawn_swarm.yml playbook and it was successful
Start all other services
- 015.deploy_swarm_visualizer.yml playbook ran successfully
- 016.deploy_portainer.yml failed with the following filesystem error
Debugging the filesystem error :
- This was caused due to the gluster fs volume in primary manager was not online.
- The output of command "gluster volume status"
- Following the suggestion from https://bobcares.com/blog/gluster-bring-brick-online/
- ran the command "gluster volume start gfs0 force" -> This brought back the brick in pimary manager online.
Now all the playbooks ran successfully and all the docker services started successfully.
The old data was preserved in the glusterfs but could not be used.
- Reason : Since the swarm was purged and restarted, the docker container related to the stopped chaincodes was also removed. The newly started fabric services started new containers for each microservice and hence could not use the old data from the persistent storage .

anandhakumarpalanisamy commented 3 years ago

Final Conclusion

The swarm setup got corrupted in both primary and secondary manager machines.
The docker daemon in secondary manager had stopped with the error "Out of memory"

Running dmesg command in secondary manager showed the following output

[868647.134539] Out of memory: Kill process 20759 (dockerd) score 280 or sacrifice child [868647.137422] Killed process 20759 (dockerd) total-vm:2474268kB, anon-rss:1638528kB, file-rss:0kB, shmem-rss:0kB
The out of memory could have occurred due to the observations specified here : https://github.com/bityoga/fabric_as_code/issues/45**
To avoid this issue in future, it was suggested to use machines with higher memory.

bityoga / fabric_as_code

Hyperledger Network Crash - Agilia's Scenario #53