Open JohnTigue opened 1 year ago
EC2 has built in health check which can detect if a machine is unhealthy in some cases. Seeming it can detect when A1111 goes off into the weeds, at least some times as shown here:
Since there's no way that machine could be brought back to health from within JupyterLab, this argues for simply working out a health check that ECS can use to kill off unhealthy machines. No amount of tuning will fix a machine in that state.
OK, what with the "hang the VM" crash that Alex was able to recreate this week, it is clear that the built-in health checks of EC2 are not sufficient to decide when a cluster rendered is sick (as shown in the screen shot below, taken while TakeFour was hung but not crashed and the EC2 health checks couldn't detect that). So, let's build a web test client that hits the API and requests an image. As long as ANY image comes back within a time limit, consider the machine healthy.
ECS built-in healthcheck via Amazon ECS container agent: Container instance health
D'oh! Via SO, ECS Health check failures AWS - copilot:
I feel pretty silly about this but pretty sure I found the solution. While I configured the port: 3000 correctly on the image in the manifest.yml, I needed an additional environment variable called PORT: 3000 in the variables for the manifest. This seemed to do the trick... like I said silly mistake!
Maybe a better way of doing the health check is via the CLI. So far I've been trying via the web UI but that is complicated. ECS seems to have CLI tools that are easier to use. The errors I've been seeing usually either have the web UI continue to work but the back end doesn't respond, or the website is unreachable. So, a "CLI only" test might just catch those (It would miss the web UI being down. Perhaps simply have a separate custom pinger (not baed on AWS tooling) as a extra sys-admin instrumentation is also a good idea…)
Ignoring the title, this still contains useful info: HealthCheck on ECS task without an ELB
Hmm… maybe the copilot primitive can be used to construct a simple healthcheck to the API: https://aws.github.io/copilot-cli/docs/manifest/lb-web-service/
Simply figure out how to express a simple prompt on the request URL a la: E.g.:
http:
healthcheck:
path: '/'
port: 8080
success_codes: '200'
healthy_threshold: 3
unhealthy_threshold: 2
interval: 15s
timeout: 10s
grace_period: 60s
Seems HEALTHCHECK
is a Docker thing, and ECS/copilot simply built atop that. Good for portability, if that happens.
Jupyter example:
# HEALTHCHECK documentation: https://docs.docker.com/engine/reference/builder/#healthcheck
# This healtcheck works well for `lab`, `notebook`, `nbclassic`, `server` and `retro` jupyter commands
# https://github.com/jupyter/docker-stacks/issues/915#issuecomment-1068528799
HEALTHCHECK --interval=5s --timeout=3s --start-period=5s --retries=3 \
CMD /etc/jupyter/docker_healthcheck.py || exit 1
It would be great to figure out a health check for the SD webui containers so ECS could kill them if unresponsive. Perhaps an HTTP request for a generated image. As long as ANY image comes back it's healthy?