Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28k stars 3.35k forks source link

Device parsing improvements #10265

Open awaelchli opened 2 years ago

awaelchli commented 2 years ago

Proposed refactoring or deprecation

Introduce a device selection dataclass that holds the device selection in a standardized format. Idea by @ananthsub

Motivation

We have a _parse_devices function used in the Trainer and Lite that returns a tuple of parsed device indices.

https://github.com/PyTorchLightning/pytorch-lightning/blob/9237106451f97393b17009a0ca571b6ff5ba5484/pytorch_lightning/trainer/trainer.py#L1459-L1470

From @ananthsub in https://github.com/PyTorchLightning/pytorch-lightning/pull/10230#discussion_r738806860

returning a tuple isn't going to scale well with more device types. it's not easy to tell which positional index maps to which device id type. it could be better to introduce a dataclass to represent the schema concretely. that would also naturally allow for extensions like IPUs

Pitch


@dataclass
class DeviceSelection

    devices: List[int] = []
    type: DeviceType = CPU

    def parse_input(gpus, tpu_cores, ipus, ...)
        # validate user inputs
        # map various input formats to standardized one in this dataclass
        ...
        return DeviceSelection(devices=..., type=...)

The AcceleratorConnector new gets as input the DeviceSelection instance instead of a growing list of arguments. It currently takes: devices, gpus, gpu_ids, tpu_cores, ipus, num_processes

Additional context

Alternative to #10231

If you enjoy Lightning, check out our other projects! ⚡

cc @justusschock @awaelchli @akihironitta @rohitgr7 @tchaton @borda

awaelchli commented 2 years ago

@ananthsub feel free to edit this issue/pitch

justusschock commented 2 years ago

@awaelchli Would this list every single cpu core?

awaelchli commented 2 years ago

I'm not sure what should be done for the CPU case. But since we cannot select on which CPU core to run a program, we probably shouldn't try to represent it in the list this way. For DDP on CPU, we can maybe just do [0] * num_devices.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!