Open elgalu opened 2 years ago
@elgalu thank you for the FR!
It is looks like a usable feature, however I'd like to have some clarification here, if that's ok.
Can you please explain, why having 70% of each 8 GPUs is better for a training job, then, for example, having 5.6 full GPUs?
For example, those are equals in term of the metagpu allocation units:
resources:
limits:
cnvrg.io/metagpu: 70
cnvrg.io/multigpu: 8
and
resources:
limits:
cnvrg.io/metagpu: 560 # each GPU equal to 100 metagpus, so 70 * 8 = 560 => 5.6 GPUs
So, why the training job will prefer to have 70% of each 8 gpu, which is in total 560 units, rather having the same 560 units, but spawning on 5 full GPUs and 60% of the sixth?
Hi, thanks!. Because the multi-gpu training job doesn't need 100% of each GPU memory. It can parallelize well without occupying, for example, the entire A100 80GB memory on each device. It can't however occupy 100% of 80GB on 5 devices, I think the issue is how multi-gpu training works in the underlying frameworks like PyTorch.
Feature: to be able to request, for example, on a 8 GPU server 70% of each of them for a single training job, e.g. by adding a new
multigpu
limit:That way multi-gpu training is possible while allowing 30% x 8 free meta GPUs