Open hkelley opened 1 month ago
Step one complete. Docker healthcheck correctly marked the container as unhealthy:
# sudo docker inspect --format='{{json .State.Health}}' crackq
{
"Status": "unhealthy",
"FailingStreak": 38,
"Log": [
{
"Start": "2024-08-23T14:03:59.238735966Z",
"End": "2024-08-23T14:03:59.339752004Z",
"ExitCode": 1,
"Output": "\u001B[31mcuInit(): no CUDA-capable device is detected\u001B[0m\n\n\u001B[31mclGetPlatformIDs(): CL_PLATFORM_NOT_FOUND_KHR\u001B[0m\n\n\u001B[31mATTENTION! No OpenCL-compatible or CUDA-compatible platform found.\u001B[0m\n\n"
},
We had previously been detecting this via log monitoring for the following:
SystemError: <method 'hashcat_session_execute' of 'pyhashcat.hashcat' objects> returned a result with an error set
On to step two, autoheal.
In an attempt to address the NVIDIA GPU flukiness (the crackq container sometimes loses the devices - https://github.com/NVIDIA/nvidia-container-toolkit/issues/48), I'm experimenting with:
1) Adding a healthcheck to the
crackq
service in docker-compose to detect when the GPUs go missing2) Once I'm confident the healthcheck is reliable, adding a service for https://hub.docker.com/r/willfarrell/autoheal/ to the docker-compose. This should be able to restart the crackq container. https://stackoverflow.com/questions/47088261/restarting-an-unhealthy-docker-container-based-on-healthcheck
I will update this issue as I make progress.