BUTSpeechFIT / safe_gpu

Avoids race condition when acquiring GPUs in exclusive mode
MIT License
15 stars 1 forks source link

not working correctly when only one gpu is available #3

Open f-fuchs opened 2 weeks ago

f-fuchs commented 2 weeks ago

Hey,

when I try to use 3 GPUs but only 2 are available the library behaves as expected and the third process crashes.

python gpu-acquisitor.py --backend pytorch --id 1 --nb-gpus 1 & python3 gpu-acquisitor.py --backend pytorch --id 2 --nb-gpus 1 & python3 gpu-acquisitor.py --backend pytorch --id 3 --nb-gpus 1
[1] 351
[2] 352
GPUOwner3 2024-08-26 10:36:12,762 [INFO] acquiring lock
GPUOwner3 2024-08-26 10:36:12,762 [INFO] lock acquired
GPUOwner1 2024-08-26 10:36:12,767 [INFO] acquiring lock
GPUOwner2 2024-08-26 10:36:12,803 [INFO] acquiring lock
GPUOwner3 2024-08-26 10:36:12,805 [INFO] Set CUDA_VISIBLE_DEVICES=0
GPUOwner3 2024-08-26 10:36:22,043 [INFO] lock released
GPUOwner3 2024-08-26 10:36:22,043 [INFO] Allocated devices: [0]
GPUOwner1 2024-08-26 10:36:22,044 [INFO] lock acquired
GPUOwner1 2024-08-26 10:36:22,081 [INFO] Set CUDA_VISIBLE_DEVICES=1
GPUOwner3 2024-08-26 10:36:31,044 [INFO] Finished
GPUOwner1 2024-08-26 10:36:31,214 [INFO] lock released
GPUOwner2 2024-08-26 10:36:31,214 [INFO] lock acquired
GPUOwner1 2024-08-26 10:36:31,214 [INFO] Allocated devices: [1]
GPUOwner2 2024-08-26 10:36:31,252 [INFO] lock released
Traceback (most recent call last):
  File "/home/fuchsfa/foundation-models/gpu-acquisitor.py", line 77, in <module>
    safe_gpu.claim_gpus(
  File "/home/fuchsfa/foundation-models/.venv/lib/python3.12/site-packages/safe_gpu/safe_gpu.py", line 153, in claim_gpus
    gpu_owner = GPUOwner(nb_gpus, placeholder_fn, logger, debug_sleep)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fuchsfa/foundation-models/.venv/lib/python3.12/site-packages/safe_gpu/safe_gpu.py", line 132, in __init__
    raise RuntimeError(f"Required {nb_gpus} GPUs, only found these free: {free_gpus}. Somebody didn't properly declare their resources?")
RuntimeError: Required 1 GPUs, only found these free: []. Somebody didn't properly declare their resources?
[2]+  Exit 1                  python3 gpu-acquisitor.py --backend pytorch --id 2 --nb-gpus 1

But when I try to use 2 GPUs when only 1 is available both processes get the one available GPU. Can I prevent this?

python gpu-acquisitor.py --backend pytorch --id 1 --nb-gpus 1 & python3 gpu-acquisitor.py --backend pytorch --id 2 --nb-gpus 1
[3] 6117
GPUOwner2 2024-08-26 10:35:35,666 [INFO] Running on a machine with single GPU used for actual display
GPUOwner2 2024-08-26 10:35:35,667 [INFO] Set CUDA_VISIBLE_DEVICES=0
GPUOwner1 2024-08-26 10:35:35,670 [INFO] Running on a machine with single GPU used for actual display
GPUOwner1 2024-08-26 10:35:35,670 [INFO] Set CUDA_VISIBLE_DEVICES=0
GPUOwner1 2024-08-26 10:35:44,742 [INFO] Allocated devices: [0]
GPUOwner2 2024-08-26 10:35:44,850 [INFO] Allocated devices: [0]
GPUOwner1 2024-08-26 10:35:53,743 [INFO] Finished
GPUOwner2 2024-08-26 10:35:53,850 [INFO] Finished
[2]   Done                    python gpu-acquisitor.py --backend pytorch --id 1 --nb-gpus 1
[3]+  Done                    python gpu-acquisitor.py --backend pytorch --id 1 --nb-gpus 1
ibenes commented 2 weeks ago

Hallo @f-fuchs, thanks for the bringing this up! I'm now super busy with other stuff (thesis deadline in fact ;-) ), I will look into this in September, hopefully next week already. Please poke me if I don't.