Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.57k stars 3.31k forks source link

Add device information to the accelerator config message #17355

Open carmocca opened 1 year ago

carmocca commented 1 year ago

Description & Motivation

Revamp

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

To

GPU available: M1, using 1 devices
TPU available: v4-8, using 0 devices
IPU available: False, using 0 devices
HPU available: False, using 0 devices

The relevant code is: https://github.com/Lightning-AI/lightning/blob/f14ee9edbc8269054e12daf30b8681d530e73369/src/lightning/pytorch/trainer/setup.py#L145-L171

Pitch

If the accelerator is available, True changes to the actual name of the accelerator used. If it's unavailable, we still show False.

For GPUs, the cuda|mps field is gone, as it should be clear from the device.

I also propose that the GPU field shows the number of devices, instead of a used boolean.

We can get this info via

# CUDA
torch.cuda.get_device_name()

# TPU
from torch_xla.experimental import tpu
import torch_xla.core.xla_env_vars as xenv
# note: this needs a try-except as this will send a request
tpu.get_tpu_env()[xenv.ACCELERATOR_TYPE]

For MPS, HPU, IPU we would need to find out if we can get this information. In the meantime, we can still fallback to "True" for them.

This could be done by introducing an Accelerator.device_name(device) staticmethod

Alternatives

One caveat is that this might be misleading with heterogeneous devices, as only rank zero prints this information.

Additional context

No response

cc @borda @justusschock @awaelchli

awaelchli commented 1 year ago

I think this is a good improvement. For apple silicon, we would either have to show something generic or find a robust way to determine the name.

carmocca commented 1 year ago

Found this: https://stackoverflow.com/a/69997851 but it uses a pypi package

ishandutta0098 commented 1 year ago

@carmocca here is a minimal implementation for this, is this what is required?

def _log_device_info(trainer: "pl.Trainer") -> None:
    def get_device_name(accelerator, device=None) -> str:
        if isinstance(accelerator, CUDAAccelerator):
            return torch.cuda.get_device_name(device)
        elif isinstance(accelerator, TPUAccelerator):
            try:
                from torch_xla.experimental import tpu
                import torch_xla.core.xla_env_vars as xenv
                return tpu.get_tpu_env()[xenv.ACCELERATOR_TYPE]
            except:
                pass
        return "True"

    gpu_name = ""
    if isinstance(trainer.accelerator, (CUDAAccelerator, MPSAccelerator)):
        gpu_name = get_device_name(trainer.accelerator, trainer.device)
        gpu_name = f", using {trainer.num_devices} devices: {gpu_name}"

    rank_zero_info(f"GPU available: {CUDAAccelerator.is_available() or MPSAccelerator.is_available()}{gpu_name}")

    tpu_name = ""
    if isinstance(trainer.accelerator, TPUAccelerator):
        tpu_name = get_device_name(trainer.accelerator)
        tpu_name = f", using {trainer.num_devices} devices: {tpu_name}"

    rank_zero_info(f"TPU available: {TPUAccelerator.is_available()}{tpu_name}")

    num_ipus = trainer.num_devices if isinstance(trainer.accelerator, IPUAccelerator) else 0
    rank_zero_info(f"IPU available: {_IPU_AVAILABLE}, using: {num_ipus} IPUs")

    if _LIGHTNING_HABANA_AVAILABLE:
        from lightning_habana import HPUAccelerator

        num_hpus = trainer.num_devices if isinstance(trainer.accelerator, HPUAccelerator) else 0
    else:
        num_hpus = 0
    rank_zero_info(f"HPU available: {_HPU_AVAILABLE}, using: {num_hpus} HPUs")

     # TODO: Integrate MPS Accelerator here, once gpu maps to both
    if CUDAAccelerator.is_available() and not isinstance(trainer.accelerator, CUDAAccelerator):
        rank_zero_warn(
            "GPU available but not used. Set `accelerator` and `devices` using"
            f" `Trainer(accelerator='gpu', devices={CUDAAccelerator.auto_device_count()})`.",
            category=PossibleUserWarning,
        )

    if TPUAccelerator.is_available() and not isinstance(trainer.accelerator, TPUAccelerator):
        rank_zero_warn(
            "TPU available but not used. Set `accelerator` and `devices` using"
            f" `Trainer(accelerator='tpu', devices={TPUAccelerator.auto_device_count()})`."
        )

    if _IPU_AVAILABLE and not isinstance(trainer.accelerator, IPUAccelerator):
        rank_zero_warn(
            "IPU available but not used. Set `accelerator` and `devices` using"
            f" `Trainer(accelerator='ipu', devices={IPUAccelerator.auto_device_count()})`."
        )

    if _HPU_AVAILABLE:
        if not _LIGHTNING_HABANA_AVAILABLE:
            raise ModuleNotFoundError(
                "You are running on HPU machine but you have not installed `lightning-habana`"
                f" extension is  {str(_LIGHTNING_HABANA_AVAILABLE)}."
            )

        from lightning_habana import HPUAccelerator

        if not isinstance(trainer.accelerator, HPUAccelerator):
            rank_zero_warn(
                "HPU available but not used. Set `accelerator` and `devices` using"
                f" `Trainer(accelerator='hpu', devices={HPUAccelerator.auto_device_count()})`."
            )

    if MPSAccelerator.is_available() and not isinstance(trainer.accelerator, MPSAccelerator):
        rank_zero_warn(
            "MPS available but not used. Set `accelerator` and `devices` using"
            f" `Trainer(accelerator='mps', devices={MPSAccelerator.auto_device_count()})`."
        )

I tested this on a Kaggle Notebook the output looks like this for different cases:

  1. GPU is not available: GPU available: False

  2. GPU Available: GPU available: True, using 1 devices: Tesla P100-PCIE-16GB

  3. TPU Available: TPU available: True, using 8 devices: v3-8

carmocca commented 1 year ago

@ishandutta0098 That's the idea, but I suggest that this is done through an Accelerator.device_name staticmethod instead of adding the logic to get the name directly there.

Also, I suggest adhering to the original proposal where the device name goes first and is separate from the number of devices.