Open f-fuchs opened 2 weeks ago
Hey,
when I try to use 3 GPUs but only 2 are available the library behaves as expected and the third process crashes.
python gpu-acquisitor.py --backend pytorch --id 1 --nb-gpus 1 & python3 gpu-acquisitor.py --backend pytorch --id 2 --nb-gpus 1 & python3 gpu-acquisitor.py --backend pytorch --id 3 --nb-gpus 1 [1] 351 [2] 352 GPUOwner3 2024-08-26 10:36:12,762 [INFO] acquiring lock GPUOwner3 2024-08-26 10:36:12,762 [INFO] lock acquired GPUOwner1 2024-08-26 10:36:12,767 [INFO] acquiring lock GPUOwner2 2024-08-26 10:36:12,803 [INFO] acquiring lock GPUOwner3 2024-08-26 10:36:12,805 [INFO] Set CUDA_VISIBLE_DEVICES=0 GPUOwner3 2024-08-26 10:36:22,043 [INFO] lock released GPUOwner3 2024-08-26 10:36:22,043 [INFO] Allocated devices: [0] GPUOwner1 2024-08-26 10:36:22,044 [INFO] lock acquired GPUOwner1 2024-08-26 10:36:22,081 [INFO] Set CUDA_VISIBLE_DEVICES=1 GPUOwner3 2024-08-26 10:36:31,044 [INFO] Finished GPUOwner1 2024-08-26 10:36:31,214 [INFO] lock released GPUOwner2 2024-08-26 10:36:31,214 [INFO] lock acquired GPUOwner1 2024-08-26 10:36:31,214 [INFO] Allocated devices: [1] GPUOwner2 2024-08-26 10:36:31,252 [INFO] lock released Traceback (most recent call last): File "/home/fuchsfa/foundation-models/gpu-acquisitor.py", line 77, in <module> safe_gpu.claim_gpus( File "/home/fuchsfa/foundation-models/.venv/lib/python3.12/site-packages/safe_gpu/safe_gpu.py", line 153, in claim_gpus gpu_owner = GPUOwner(nb_gpus, placeholder_fn, logger, debug_sleep) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/fuchsfa/foundation-models/.venv/lib/python3.12/site-packages/safe_gpu/safe_gpu.py", line 132, in __init__ raise RuntimeError(f"Required {nb_gpus} GPUs, only found these free: {free_gpus}. Somebody didn't properly declare their resources?") RuntimeError: Required 1 GPUs, only found these free: []. Somebody didn't properly declare their resources? [2]+ Exit 1 python3 gpu-acquisitor.py --backend pytorch --id 2 --nb-gpus 1
But when I try to use 2 GPUs when only 1 is available both processes get the one available GPU. Can I prevent this?
python gpu-acquisitor.py --backend pytorch --id 1 --nb-gpus 1 & python3 gpu-acquisitor.py --backend pytorch --id 2 --nb-gpus 1 [3] 6117 GPUOwner2 2024-08-26 10:35:35,666 [INFO] Running on a machine with single GPU used for actual display GPUOwner2 2024-08-26 10:35:35,667 [INFO] Set CUDA_VISIBLE_DEVICES=0 GPUOwner1 2024-08-26 10:35:35,670 [INFO] Running on a machine with single GPU used for actual display GPUOwner1 2024-08-26 10:35:35,670 [INFO] Set CUDA_VISIBLE_DEVICES=0 GPUOwner1 2024-08-26 10:35:44,742 [INFO] Allocated devices: [0] GPUOwner2 2024-08-26 10:35:44,850 [INFO] Allocated devices: [0] GPUOwner1 2024-08-26 10:35:53,743 [INFO] Finished GPUOwner2 2024-08-26 10:35:53,850 [INFO] Finished [2] Done python gpu-acquisitor.py --backend pytorch --id 1 --nb-gpus 1 [3]+ Done python gpu-acquisitor.py --backend pytorch --id 1 --nb-gpus 1
Hallo @f-fuchs, thanks for the bringing this up! I'm now super busy with other stuff (thesis deadline in fact ;-) ), I will look into this in September, hopefully next week already. Please poke me if I don't.
Hey,
when I try to use 3 GPUs but only 2 are available the library behaves as expected and the third process crashes.
But when I try to use 2 GPUs when only 1 is available both processes get the one available GPU. Can I prevent this?