MountaintopLotus / braintrust

A Dockerized platform for running Stable Diffusion, on AWS (for now)
Apache License 2.0
1 stars 2 forks source link

Docker healthcheck #76

Open JohnTigue opened 1 year ago

JohnTigue commented 1 year ago

It would be great to figure out a health check for the SD webui containers so ECS could kill them if unresponsive. Perhaps an HTTP request for a generated image. As long as ANY image comes back it's healthy?

JohnTigue commented 1 year ago

EC2 has built in health check which can detect if a machine is unhealthy in some cases. Seeming it can detect when A1111 goes off into the weeds, at least some times as shown here:

Screen Shot 2023-02-14 at 9 44 30 AM
JohnTigue commented 1 year ago

Since there's no way that machine could be brought back to health from within JupyterLab, this argues for simply working out a health check that ECS can use to kill off unhealthy machines. No amount of tuning will fix a machine in that state.

JohnTigue commented 1 year ago

OK, what with the "hang the VM" crash that Alex was able to recreate this week, it is clear that the built-in health checks of EC2 are not sufficient to decide when a cluster rendered is sick (as shown in the screen shot below, taken while TakeFour was hung but not crashed and the EC2 health checks couldn't detect that). So, let's build a web test client that hits the API and requests an image. As long as ANY image comes back within a time limit, consider the machine healthy.

Screen_Shot_2023-02-26_at_2 50 56_PM
JohnTigue commented 1 year ago

ECS built-in healthcheck via Amazon ECS container agent: Container instance health

JohnTigue commented 1 year ago

D'oh! Via SO, ECS Health check failures AWS - copilot:

I feel pretty silly about this but pretty sure I found the solution. While I configured the port: 3000 correctly on the image in the manifest.yml, I needed an additional environment variable called PORT: 3000 in the variables for the manifest. This seemed to do the trick... like I said silly mistake!

JohnTigue commented 1 year ago

Pass application load balancer health checks in Amazon ECS

JohnTigue commented 1 year ago

Maybe a better way of doing the health check is via the CLI. So far I've been trying via the web UI but that is complicated. ECS seems to have CLI tools that are easier to use. The errors I've been seeing usually either have the web UI continue to work but the back end doesn't respond, or the website is unreachable. So, a "CLI only" test might just catch those (It would miss the web UI being down. Perhaps simply have a separate custom pinger (not baed on AWS tooling) as a extra sys-admin instrumentation is also a good idea…)

JohnTigue commented 1 year ago

Ignoring the title, this still contains useful info: HealthCheck on ECS task without an ELB

JohnTigue commented 1 year ago

Hmm… maybe the copilot primitive can be used to construct a simple healthcheck to the API: https://aws.github.io/copilot-cli/docs/manifest/lb-web-service/

Simply figure out how to express a simple prompt on the request URL a la: E.g.:

http:
  healthcheck:
    path: '/'
    port: 8080
    success_codes: '200'
    healthy_threshold: 3
    unhealthy_threshold: 2
    interval: 15s
    timeout: 10s
    grace_period: 60s
JohnTigue commented 1 year ago

Seems HEALTHCHECK is a Docker thing, and ECS/copilot simply built atop that. Good for portability, if that happens. Jupyter example:

# HEALTHCHECK documentation: https://docs.docker.com/engine/reference/builder/#healthcheck
# This healtcheck works well for `lab`, `notebook`, `nbclassic`, `server` and `retro` jupyter commands
# https://github.com/jupyter/docker-stacks/issues/915#issuecomment-1068528799
HEALTHCHECK --interval=5s --timeout=3s --start-period=5s --retries=3 \
    CMD /etc/jupyter/docker_healthcheck.py || exit 1