cnvrg / metagpu

K8s device plugin for GPU sharing
https://cnvrg.io
MIT License
96 stars 9 forks source link

Allow to request spread meta gpus for fractioned multi-gpu training #1

Open elgalu opened 2 years ago

elgalu commented 2 years ago

Feature: to be able to request, for example, on a 8 GPU server 70% of each of them for a single training job, e.g. by adding a new multigpu limit:

        resources:
          limits:
            cnvrg.io/metagpu: 70
            cnvrg.io/multigpu: 8

That way multi-gpu training is possible while allowing 30% x 8 free meta GPUs

Dimss commented 2 years ago

@elgalu thank you for the FR! It is looks like a usable feature, however I'd like to have some clarification here, if that's ok.
Can you please explain, why having 70% of each 8 GPUs is better for a training job, then, for example, having 5.6 full GPUs?

For example, those are equals in term of the metagpu allocation units:

        resources:
          limits:
            cnvrg.io/metagpu: 70
            cnvrg.io/multigpu: 8

and

        resources:
          limits:
            cnvrg.io/metagpu: 560 #  each GPU equal to 100 metagpus, so 70 * 8 = 560 => 5.6 GPUs

So, why the training job will prefer to have 70% of each 8 gpu, which is in total 560 units, rather having the same 560 units, but spawning on 5 full GPUs and 60% of the sixth?

elgalu commented 2 years ago

Hi, thanks!. Because the multi-gpu training job doesn't need 100% of each GPU memory. It can parallelize well without occupying, for example, the entire A100 80GB memory on each device. It can't however occupy 100% of 80GB on 5 devices, I think the issue is how multi-gpu training works in the underlying frameworks like PyTorch.