Add a healthcheck to the dockercompose for crackq to detect when GPU devices have disappeared

f0cker / crackq

CrackQ: A Python Hashcat cracking queue system

MIT License

923 stars 100 forks source link

In an attempt to address the NVIDIA GPU flukiness (the crackq container sometimes loses the devices - https://github.com/NVIDIA/nvidia-container-toolkit/issues/48), I'm experimenting with:

1) Adding a healthcheck to the crackq service in docker-compose to detect when the GPUs go missing

    crackq:
        build:
            context: ./build
            dockerfile: Dockerfile
        image: "nvidia-ubuntu"
        ports:
            - "127.0.0.1:8080:8080"
        depends_on:
            - redis
        healthcheck:
          test: hashcat -I | grep 'Backend Device'
          interval: 5m
          retries: 1
          start_period: 60s
          timeout: 30s
        networks:
            - crackq_net

2) Once I'm confident the healthcheck is reliable, adding a service for https://hub.docker.com/r/willfarrell/autoheal/ to the docker-compose. This should be able to restart the crackq container. https://stackoverflow.com/questions/47088261/restarting-an-unhealthy-docker-container-based-on-healthcheck

I will update this issue as I make progress.

# sudo docker inspect --format='{{json .State.Health}}' crackq { "Status": "unhealthy", "FailingStreak": 38, "Log": [ { "Start": "2024-08-23T14:03:59.238735966Z", "End": "2024-08-23T14:03:59.339752004Z", "ExitCode": 1, "Output": "\u001B[31mcuInit(): no CUDA-capable device is detected\u001B[0m\n\n\u001B[31mclGetPlatformIDs(): CL_PLATFORM_NOT_FOUND_KHR\u001B[0m\n\n\u001B[31mATTENTION! No OpenCL-compatible or CUDA-compatible platform found.\u001B[0m\n\n" },

f0cker / crackq

Add a healthcheck to the dockercompose for crackq to detect when GPU devices have disappeared #46