awaelchli commented 2 years ago

Proposed refactor

Deprecate the Trainer(auto_select_gpus=True|False) option and just enable it always when devices=int gets selected.

Motivation

The name is misleading, we don't actually have an algorithm to select the gpus automatically in a smart way. A confusion by user was raised recently: #13012
The flag only applies to the GPU accelerator. There are no equivalent flags for tpus, ipus, etc.
What auto_select_gpus does is such a niche use case that it's almost not worth talking about. The implementation just runs through all available GPUs and tests if it can place a tensor in memory. This essentially tests wether the GPU is in exclusive mode or not. See the implementation here.
Internally, the flag is framed as "tuning" as the functions are under the tuner module, but this does not fall under the term "tuning" in my opinion.

Pitch

Keep the feature, but simply enable it by default and remove the flag from the Trainer. It is still useful to have the check against exclusivity, for example on managed clusters, and I can't think of a reason why it would be undesired. This would only apply when passing devices=int and not when indices get passed.

Alternatives

Keep it and come up with sophisticated methods to determine whether a GPU should be selected or not based on memory profile, utilization etc. This IMO is not feasible because not enough information is available before training.
Make it clearer in the documentation what this actually does.

Additional context

13012

If you enjoy Lightning, check out our other projects! ⚡

Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @borda @justusschock @awaelchli @rohitgr7 @tchaton @kaushikb11 @carmocca

tanmoyio commented 2 years ago

@awaelchli I want to work on it

carmocca commented 2 years ago

Sounds good to me.

Should we allow customizing it by passing GPUAccelerator(auto_select=...)?

ananthsub commented 2 years ago

FYI this also came up in https://github.com/PyTorchLightning/pytorch-lightning/issues/10535

awaelchli commented 2 years ago

@carmocca auto_select might raise some questions given we already have the devices=auto syntax. I'm convinced that auto_select_gpus can always be set to True without the user having to think about anything. I just can't imagine a time when you would want to set it to False, given the current implementation.

@ananthsub indeed, I should have looked for the other issue first. This is a duplicate then but adds a bit more details.

mogwai commented 2 years ago

I agree that the name isn't generic and there has been a good trend in lightning the last few versions to address the naming not matching the accelerator.

It makes sense that if CUDA accelerator is being used devices have been specified that the Trainer would automatically find gpus that have 0% utilisation and 0% memory allocation from torch. This would then remove the need for auto_select_gpu param.

However, it still might be important for users to specify gpus they might want to use.

Whatever the solution is, it should work with default cuda settings and common setups to make it easier for people to work with this feature which would be super cool!

awaelchli commented 1 year ago

In my proposal above, I originally suggested that we should do this automatically when devices=int is supplied. However, after testing this in https://github.com/Lightning-AI/lightning/pull/13356 I am not convinced anymore. The time it takes check whether a GPU is available is quite high, multiple seconds on a 8 GPU machine. This is not great for debugging and devices=int is very common.

In summary, the arguments against keeping this functionality in Trainer are:

Misunderstood feature: It's doing A but people interpret it doing B
Can't be supported in Jupyter notebooks (forking issue)
There are race conditions that can't be avoided
Slow, and hence probably can't be enabled by default
We can't see significant value and usage in the community (a search for usage in public projects)

However, we can still keep the feature decoupled from Trainer in a simple function:

Trainer(devices=find_usable_cuda_devices(2))

Lightning-AI / pytorch-lightning

RFC: Deprecate `auto_select_gpus` Trainer argument #13079