f0cker / crackq

CrackQ: A Python Hashcat cracking queue system
MIT License
923 stars 100 forks source link

Add a healthcheck to the dockercompose for crackq to detect when GPU devices have disappeared #46

Open hkelley opened 1 month ago

hkelley commented 1 month ago

In an attempt to address the NVIDIA GPU flukiness (the crackq container sometimes loses the devices - https://github.com/NVIDIA/nvidia-container-toolkit/issues/48), I'm experimenting with:

1) Adding a healthcheck to the crackq service in docker-compose to detect when the GPUs go missing

    crackq:
        build:
            context: ./build
            dockerfile: Dockerfile
        image: "nvidia-ubuntu"
        ports:
            - "127.0.0.1:8080:8080"
        depends_on:
            - redis
        healthcheck:
          test: hashcat -I | grep 'Backend Device'
          interval: 5m
          retries: 1
          start_period: 60s
          timeout: 30s
        networks:
            - crackq_net

2) Once I'm confident the healthcheck is reliable, adding a service for https://hub.docker.com/r/willfarrell/autoheal/ to the docker-compose. This should be able to restart the crackq container. https://stackoverflow.com/questions/47088261/restarting-an-unhealthy-docker-container-based-on-healthcheck

I will update this issue as I make progress.

hkelley commented 3 weeks ago

Step one complete. Docker healthcheck correctly marked the container as unhealthy:

    #  sudo docker inspect --format='{{json .State.Health}}' crackq
    {
        "Status": "unhealthy",
        "FailingStreak": 38,
        "Log": [
            {
                "Start": "2024-08-23T14:03:59.238735966Z",
                "End": "2024-08-23T14:03:59.339752004Z",
                "ExitCode": 1,
                "Output": "\u001B[31mcuInit(): no CUDA-capable device is detected\u001B[0m\n\n\u001B[31mclGetPlatformIDs(): CL_PLATFORM_NOT_FOUND_KHR\u001B[0m\n\n\u001B[31mATTENTION! No OpenCL-compatible or CUDA-compatible platform found.\u001B[0m\n\n"
            },

We had previously been detecting this via log monitoring for the following:

SystemError: <method 'hashcat_session_execute' of 'pyhashcat.hashcat' objects> returned a result with an error set

On to step two, autoheal.