Open jpswinski opened 7 months ago
We need a way to check to see if the containers are running and healthy (at startup, and continually), and then reset the node if that is not the case.
Docker compose should be doing this, but maybe something is needed to watch docker compose, or maybe the settings aren't correct. Or maybe it needs to be baked into the AMI so that it is not dependent on the EC2 user data running to completion.
Docker compose isn't running which is why none of the containers are started.
Maybe we need to move the startup script logic into a bash script that has retries and things like that, and then the user data just calls this script.
When starting a cluster, very rarely we see an issue where a node comes up but does not register. When we ssh into the node, we see that the docker containers never come up, and the /var/log/cloud-init-output.log reports the following error: