SlideRuleEarth / sliderule

Server and client framework for on-demand science data processing in the cloud
https://slideruleearth.io
Other
27 stars 12 forks source link

Docker containers fail to start on EC2 instances #374

Open jpswinski opened 7 months ago

jpswinski commented 7 months ago

When starting a cluster, very rarely we see an issue where a node comes up but does not register. When we ssh into the node, we see that the docker containers never come up, and the /var/log/cloud-init-output.log reports the following error:

Error response from daemon: Get "https://742127912612.dkr.ecr.us-west-2.amazonaws.com/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2024-02-09 14:55:26,122 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
2024-02-09 14:55:26,123 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3.9/site-packages/cloudinit/config/cc_scripts_user.py'>) failed
Cloud-init v. 22.2.2 finished at Fri, 09 Feb 2024 14:55:26 +0000. Datasource DataSourceEc2.  Up 23.76 seconds
jpswinski commented 7 months ago

We need a way to check to see if the containers are running and healthy (at startup, and continually), and then reset the node if that is not the case.

Docker compose should be doing this, but maybe something is needed to watch docker compose, or maybe the settings aren't correct. Or maybe it needs to be baked into the AMI so that it is not dependent on the EC2 user data running to completion.

jpswinski commented 7 months ago

Docker compose isn't running which is why none of the containers are started.

Maybe we need to move the startup script logic into a bash script that has retries and things like that, and then the user data just calls this script.