Closed awaelchli closed 1 year ago
@awaelchli I want to work on it
Sounds good to me.
Should we allow customizing it by passing GPUAccelerator(auto_select=...)
?
FYI this also came up in https://github.com/PyTorchLightning/pytorch-lightning/issues/10535
@carmocca auto_select might raise some questions given we already have the devices=auto syntax. I'm convinced that auto_select_gpus
can always be set to True without the user having to think about anything. I just can't imagine a time when you would want to set it to False, given the current implementation.
@ananthsub indeed, I should have looked for the other issue first. This is a duplicate then but adds a bit more details.
I agree that the name isn't generic and there has been a good trend in lightning the last few versions to address the naming not matching the accelerator.
It makes sense that if CUDA accelerator is being used devices have been specified that the Trainer would automatically find gpus that have 0% utilisation and 0% memory allocation from torch. This would then remove the need for auto_select_gpu param.
However, it still might be important for users to specify gpus they might want to use.
Whatever the solution is, it should work with default cuda settings and common setups to make it easier for people to work with this feature which would be super cool!
In my proposal above, I originally suggested that we should do this automatically when devices=int
is supplied. However, after testing this in https://github.com/Lightning-AI/lightning/pull/13356 I am not convinced anymore. The time it takes check whether a GPU is available is quite high, multiple seconds on a 8 GPU machine. This is not great for debugging and devices=int
is very common.
In summary, the arguments against keeping this functionality in Trainer are:
However, we can still keep the feature decoupled from Trainer in a simple function:
Trainer(devices=find_usable_cuda_devices(2))
Proposed refactor
Deprecate the
Trainer(auto_select_gpus=True|False)
option and just enable it always whendevices=int
gets selected.Motivation
Pitch
Keep the feature, but simply enable it by default and remove the flag from the Trainer. It is still useful to have the check against exclusivity, for example on managed clusters, and I can't think of a reason why it would be undesired. This would only apply when passing
devices=int
and not when indices get passed.Alternatives
Keep it and come up with sophisticated methods to determine whether a GPU should be selected or not based on memory profile, utilization etc. This IMO is not feasible because not enough information is available before training.
Make it clearer in the documentation what this actually does.
Additional context
13012
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
cc @borda @justusschock @awaelchli @rohitgr7 @tchaton @kaushikb11 @carmocca