Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.51k stars 3.39k forks source link

RFC: Deprecate `auto_select_gpus` Trainer argument #13079

Closed awaelchli closed 1 year ago

awaelchli commented 2 years ago

Proposed refactor

Deprecate the Trainer(auto_select_gpus=True|False) option and just enable it always when devices=int gets selected.

Motivation

Pitch

Keep the feature, but simply enable it by default and remove the flag from the Trainer. It is still useful to have the check against exclusivity, for example on managed clusters, and I can't think of a reason why it would be undesired. This would only apply when passing devices=int and not when indices get passed.

Alternatives

Additional context

13012


If you enjoy Lightning, check out our other projects! ⚡

cc @borda @justusschock @awaelchli @rohitgr7 @tchaton @kaushikb11 @carmocca

tanmoyio commented 2 years ago

@awaelchli I want to work on it

carmocca commented 2 years ago

Sounds good to me.

Should we allow customizing it by passing GPUAccelerator(auto_select=...)?

ananthsub commented 2 years ago

FYI this also came up in https://github.com/PyTorchLightning/pytorch-lightning/issues/10535

awaelchli commented 2 years ago

@carmocca auto_select might raise some questions given we already have the devices=auto syntax. I'm convinced that auto_select_gpus can always be set to True without the user having to think about anything. I just can't imagine a time when you would want to set it to False, given the current implementation.

@ananthsub indeed, I should have looked for the other issue first. This is a duplicate then but adds a bit more details.

mogwai commented 2 years ago

I agree that the name isn't generic and there has been a good trend in lightning the last few versions to address the naming not matching the accelerator.

It makes sense that if CUDA accelerator is being used devices have been specified that the Trainer would automatically find gpus that have 0% utilisation and 0% memory allocation from torch. This would then remove the need for auto_select_gpu param.

However, it still might be important for users to specify gpus they might want to use.

Whatever the solution is, it should work with default cuda settings and common setups to make it easier for people to work with this feature which would be super cool!

awaelchli commented 1 year ago

In my proposal above, I originally suggested that we should do this automatically when devices=int is supplied. However, after testing this in https://github.com/Lightning-AI/lightning/pull/13356 I am not convinced anymore. The time it takes check whether a GPU is available is quite high, multiple seconds on a 8 GPU machine. This is not great for debugging and devices=int is very common.

In summary, the arguments against keeping this functionality in Trainer are:

However, we can still keep the feature decoupled from Trainer in a simple function:

Trainer(devices=find_usable_cuda_devices(2))