Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.58k stars 3.31k forks source link

Does `Trainer(devices=1)` use all CPUs? #19595

Open MaximilienLC opened 4 months ago

MaximilienLC commented 4 months ago

Bug description

https://github.com/Lightning-AI/pytorch-lightning/blob/3740546899aedad77c80db6b57f194e68c455e28/src/lightning/fabric/accelerators/cpu.py#L75

cpu_cores being a list of integers will always raise an exception, which shouldn't according to the Trainer documentation/this function signature

What version are you seeing the problem on?

master

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment ``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): #- PyTorch Lightning Version (e.g., 1.5.0): #- Lightning App Version (e.g., 0.5.2): #- PyTorch Version (e.g., 2.0): #- Python version (e.g., 3.9): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): #- Running environment of LightningApp (e.g. local, cloud): ```

More info

No response

awaelchli commented 4 months ago

@MaximilienLC Do you mean this?

trainer = Trainer(
      accelerator="cpu",
      devices=[1, 3],
  )
TypeError: `devices` selected with `CPUAccelerator` should be an int > 0.

This is correct. It does not make sense to select device indices on a CPU. Device indices are meant for accelerators like CUDA GPUs or TPUs.

If you select devices=1 (int) on CPU, PyTorch will already use all cores and threads when appropriate. And devices > 1 just simulates DDP on CPU but it does not give you any benefits in terms of speed, it's only meant for debugging DDP.

awaelchli commented 4 months ago

If there is documentation that contradicts this, please point me to it so we can update it. Thanks!

MaximilienLC commented 4 months ago

https://github.com/Lightning-AI/pytorch-lightning/blob/3740546899aedad77c80db6b57f194e68c455e28/src/lightning/pytorch/trainer/trainer.py#L146

I guess this section could add that info for CPU.

So you mean devices=1 on CPU equates to using all CPUs?

awaelchli commented 4 months ago

So you mean devices=1 on CPU equates to using all CPUs?

Yes PyTorch will use all CPUs if it can parallelize the operation accordingly.

jneuendorf commented 4 months ago

I agree that this should not throw an exception assuming the current documentation is correct. For example, the device could come from an environment variable

trainer = Trainer(
    accelerator=os.environ.get("DEVICE", "cpu"),
    devices=-1,  # use all devices
)

So when deploying the code to different machines, it would break for CPU. Maybe a warning would be more adequate to notify the user that something is not optimally configured.

Would devices="auto" also use all devices?