Improved control of device stats callbacks

EricWiener commented 2 years ago

Proposed refactor

Separate device stats monitoring into separate callbacks per-device, but sub-class from DeviceStatsMonitor. This preserves the desired change from #9032 to consolidate the interface, but also allows for fine-grained control.

Motivation

With #9032, all the accelerators were combined under a single DeviceStatsMonitor callback. This consolidated the API, but it also removed fine-grained control. For instance, the GPUStatsMonitor that is now being deprecated used to provide fine-grained control over the nvidia-smi stats that were tracked: https://github.com/PyTorchLightning/pytorch-lightning/blob/86b177ebe5427725b35fde1a8808a7b59b8a277a/pytorch_lightning/callbacks/gpu_stats_monitor.py#L87-L95

However, the new interface defaults to using torch memory stats (which provide less info than nvidia-smi): https://github.com/PyTorchLightning/pytorch-lightning/blob/86b177ebe5427725b35fde1a8808a7b59b8a277a/pytorch_lightning/accelerators/gpu.py#L73-L75

Regardless of whether GPU stats are changed to default to nvidia-smi, the user no longer has control over what metrics are monitored. Additionally, if https://github.com/PyTorchLightning/pytorch-lightning/pull/11795 is merged, there will be additional CPU stats monitoring + whatever accelerator is used.

Pitch 1

If the user was allowed to specify specific stats to monitor, this would require the callback to look like:

DeviceStatsMonitor(
    cpu_stats: Optional[Union[bool, Set[str]]] = None,
    gpu_stats: Optional[Union[bool, Set[str]]] = None,
    tpu_stats: Optional[Union[bool, Set[str]]] = None,
)

This builds on top of the suggestion in https://github.com/PyTorchLightning/pytorch-lightning/issues/11253#issuecomment-1004778058 where the values allowed are:

None: To know if the user passed a value
bool: To easily enable/disable a default set of stats
Set[str]: To enable and show this specific set of stats.

# enable cpu stats + stats for the current accelerator
DeviceStatsMonitor(cpu_stats=True)

# enable these cpu stats + stats for the current accelerator
DeviceStatsMonitor(cpu_stats={"ram", "temp"})

This design provides no argument validation via type checking/auto-complete.

Pitch 2

Have a common interface via a base class:

class DeviceStatsMonitor(Callback):

For each device, sub-class DeviceStatsMonitor and allow for configuration:

class GPUStatsMonitor(DeviceStatsMonitor):
  def __init__(
        self,
        memory_utilization: bool = True,
        gpu_utilization: bool = True,
        intra_step_time: bool = False,
        inter_step_time: bool = False,
        fan_speed: bool = False,
        temperature: bool = False,
    )

Add a CPUStatsMonitor.

If you want to track both CPU stats + another accelerator you can now pass:

trainer=Trainer(callbacks=[CPUStatsMonitor(), GPUStatsMonitor()])

Pitch 3

Use a single DeviceStatsMonitor with the option to specify cpu_stats=True and provide sensible default metrics. This will be a friendly generic interface for quickly tracking stats.

For other users, they should be able to access get_device_stats() from the accelerator class and get_device_stats should take optional arguments for configuration (i.e., get_device_stats() with no arguments should be sufficient, but it should also allow additional optional arguments to be passed that change per-device). This allows for customization of the stats without needing to make each device callback unique and highly customizable.

Currently, (in my opinion) it is a pain to make a Callback since you have to override multiple hooks even if you want the same/similar behavior per-hook. I instead propose adding a new DecoratedCallback class that derives from the Callback class that allows you to specify decorators in order to specify which hooks should be called without needing to define a lot of one-line functions. I also think _prefix_metric_keys should be made a public utility.

The user could now do:

class MyGPUStatsMonitor(DecoratedCallback):
    @pl_hook.on_train_batch_start
    @pl_hook.on_train_batch_end
    @pl_hook.on_val_batch_start
    @pl_hook.on_val_batch_end
    def log_batch_stats(
        key: str, 
        trainer: "pl.Trainer",
        pl_module: "pl.LightningModule",
        batch: Any,
        batch_idx: int,
        unused: Optional[int] = 0
    ):
        stats = GPUAccelerator.get_device_stats(use_nvidia_smi=True, metrics=["gpu.utilization", ...])
        prefixed_device_stats = prefix_metric_keys(device_stats, key)
        trainer.logger.log_metrics(prefixed_device_stats, step=trainer.global_step)

The current alternative to this would be:

class MyGPUStatsMonitor(Callback):
    def on_train_batch_start(
        trainer: "pl.Trainer",
        pl_module: "pl.LightningModule",
        batch: Any,
        batch_idx: int,
        unused: Optional[int] = 0
    ):
        stats = GPUAccelerator.get_device_stats(use_nvidia_smi=True, metrics=["gpu.utilization", ...])
        prefixed_device_stats = prefix_metric_keys(device_stats, "on_train_batch_start")
        trainer.logger.log_metrics(prefixed_device_stats, step=trainer.global_step)

    def on_train_batch_end(
        trainer: "pl.Trainer",
        pl_module: "pl.LightningModule",
        batch: Any,
        batch_idx: int,
        unused: Optional[int] = 0
    ):
        stats = GPUAccelerator.get_device_stats(use_nvidia_smi=True, metrics=["gpu.utilization", ...])
        prefixed_device_stats = prefix_metric_keys(device_stats, "on_train_batch_end")
        trainer.logger.log_metrics(prefixed_device_stats, step=trainer.global_step)

    # ...

Or if using a shared function it would be:

class MyGPUStatsMonitor(Callback):
    def _log_batch_stats(
        key: str, 
        trainer: "pl.Trainer",
        pl_module: "pl.LightningModule",
        batch: Any,
        batch_idx: int,
        unused: Optional[int] = 0
    ):
        stats = GPUAccelerator.get_device_stats(use_nvidia_smi=True, metrics=["gpu.utilization", ...])
        prefixed_device_stats = prefix_metric_keys(device_stats, key)
        trainer.logger.log_metrics(prefixed_device_stats, step=trainer.global_step)

    def on_train_batch_start(
        trainer: "pl.Trainer",
        pl_module: "pl.LightningModule",
        batch: Any,
        batch_idx: int,
        unused: Optional[int] = 0
    ):
        self._log_batch_stats("on_train_batch_start", trainer)

    def on_train_batch_end(
        trainer: "pl.Trainer",
        pl_module: "pl.LightningModule",
        batch: Any,
        batch_idx: int,
        unused: Optional[int] = 0
    ):
        self._log_batch_stats("on_train_batch_end", trainer)

    # ...

By providing utilities to get device metrics easily and making it faster/less LOC to create a Callback, it becomes less of a pain to migrate away from DeviceStatsMonitor when you need to customize.

cc @justusschock @awaelchli @akihironitta @rohitgr7 @tchaton @borda @kaushikb11 @ananthsub @daniellepintz @edward-io @mauvilsa

EricWiener commented 2 years ago

@cowwoc @twsl @daniellepintz @ananthsub @carmocca @mauvilsa could I please have your thoughts (I saw you were either involved with #9032 or discussed this in slack)

cowwoc commented 2 years ago

LGTM

ananthsub commented 2 years ago

IMO, the most important part of #9032 was deprecating log_gpu_memory from the Trainer constructor and the internal logic for logging GPU memory it resulted in, which offered no chance of extensibility.

I am fine with undeprecating GPUStatsMonitor/XLAStatsMonitor. Building off your proposal, the DeviceStatsMonitor base class could require a get_device_stats() method to be implemented while filling out the logging information. Each of the child classes could handle per-accelerator customization.

In general, think the callbacks demonstrate how easy it is to access & extend this information.

twsl commented 2 years ago

I'm all for pitch 2.

daniellepintz commented 2 years ago

Hi @EricWiener, thanks for the proposal!

First of all, I think we are conflating two things here - the move from nvidia-smi to use torch.cuda.memory_stats is completely separate from this issue, in fact there was a whole separate issue for it https://github.com/PyTorchLightning/pytorch-lightning/issues/8780, and it was simply done as part of #9032 for convenience. It is just an implementation detail of GPUAccelerator.get_device_stats and can easily be changed.

Regarding the rest of your proposal, I agree that now that we are adding CPUStats and users may want both CPUStats + another accelerator stats, we need to change something in the design. I am still thinking about which option I like best. One potential downside of Pitch 2 is that technically users could add a statsmonitor for an accelerator that they aren't using, like they could add a GPUStatsMonitor if they are using only CPU, etc, so we would have to handle that. Another downside is that it replicates the list of all the accelerators, i.e. we already have CPU, GPU, TPU, IPU Accelerators, now we will also need to have CPUDeviceStats, GPUDeviceStats, TPUDeviceStats, IPUDeviceStats.

Also one question, if we went with Pitch 2 would we deprecate the get_device_stats from the Accelerator class?

Another technicality, if we go with Pitch 2, IMO we shouldn't just undeprecate GPU/XLAStatsMonitors, because there was also some unification that went into https://github.com/PyTorchLightning/pytorch-lightning/issues/9032 to unify the interfaces, which we should keep.

twsl commented 2 years ago

Couldnt we just add a generic statsmonitor as an additional class that always loggs cpu/sys mem and accelerator data? that way we would have an easy to use default and could still allow advanced users to configure logging the device of choice regardless of the accelerator. cause you might want to use a certain device and log the stats even if it isnt your accelerator

EricWiener commented 2 years ago

First of all, I think we are conflating two things here - the move from nvidia-smi to use torch.cuda.memory_stats is completely separate from this issue, in fact there was a whole separate issue for it #8780, and it was simply done as part of #9032 for convenience. It is just an implementation detail of GPUAccelerator.get_device_stats and can easily be changed.

Sorry for the confusion. I had meant for this to be an example of when finer user-control would be nice (specifying whether to use nvidia-smi or torch.cuda.memory_stats / what stats to display). Regardless of what the default is set to, it seems the user should have more control, which would be possible if the device stats weren't all constrained to the same interface.

Regarding the rest of your proposal, I agree that now that we are adding CPUStats and users may want both CPUStats + another accelerator stats, we need to change something in the design. I am still thinking about which option I like best. One potential downside of Pitch 2 is that technically users could add a statsmonitor for an accelerator that they aren't using, like they could add a GPUStatsMonitor if they are using only CPU, etc, so we would have to handle that.

We could either raise an error if the device wasn't supported (already done) or log a warning and just ignore the callback. Either way seems fine to me. Right now if the user specifies DeviceStatsMonitor and is only using CPU they will also get an error (at least the code makes it seem this way - I have not verified this to be the case).

Another downside is that it replicates the list of all the accelerators, i.e. we already have CPU, GPU, TPU, IPU Accelerators, now we will also need to have CPUDeviceStats, GPUDeviceStats, TPUDeviceStats, IPUDeviceStats. It seems like the list of accelerators mainly serves to get device stats currently.

IPU doesn't do much https://github.com/PyTorchLightning/pytorch-lightning/blob/1d2878523ac36703b851f0646b0ef17f14c582cc/pytorch_lightning/accelerators/ipu.py#L21-L33
CPU didn't do much either before my other open PR https://github.com/PyTorchLightning/pytorch-lightning/blob/1203094a201bd38f0b8b77d93bc39fc95f06d8ae/pytorch_lightning/accelerators/cpu.py#L25-L44
TPU mainly handles stats https://github.com/PyTorchLightning/pytorch-lightning/blob/cec2d7946b9da07289025e27e57597538d2c50ec/pytorch_lightning/accelerators/tpu.py#L25-L49

It seems like the accelerators should be condensed vs. condensing the device stats monitors.

Also one question, if we went with Pitch 2 would we deprecate the get_device_stats from the Accelerator class?

This would probably make sense.

Another technicality, if we go with Pitch 2, IMO we shouldn't just undeprecate GPU/XLAStatsMonitors, because there was also some unification that went into #9032 to unify the interfaces, which we should keep.

Good point

EricWiener commented 2 years ago

Couldnt we just add a generic statsmonitor as an additional class that always loggs cpu/sys mem and accelerator data? that way we would have an easy to use default and could still allow advanced users to configure logging the device of choice regardless of the accelerator. cause you might want to use a certain device and log the stats even if it isnt your accelerator

I was working on doing that in #11795, but there wasn't a very nice way to do that and still allow customization for both CPU metrics + accelerator metrics. This would be pitch 1

mauvilsa commented 2 years ago

Regarding the rest of your proposal, I agree that now that we are adding CPUStats and users may want both CPUStats + another accelerator stats, we need to change something in the design. I am still thinking about which option I like best. One potential downside of Pitch 2 is that technically users could add a statsmonitor for an accelerator that they aren't using, like they could add a GPUStatsMonitor if they are using only CPU, etc, so we would have to handle that.

We could either raise an error if the device wasn't supported (already done) or log a warning and just ignore the callback. Either way seems fine to me. Right now if the user specifies DeviceStatsMonitor and is only using CPU they will also get an error (at least the code makes it seem this way - I have not verified this to be the case).

One thing I do not like about the current deprecated GPUStatsMonitor is that if I am not using a GPU, then the execution fails. In my code I have a modification of it such that if no GPU is used, then the callback does nothing. If this and the new DeviceStatsMonitor and derived callbacks do not work like this, then how is it supposed to be used? Am I expected to change my source code every time I change the hardware I run things on? To me this is bad practice since the source code should be stable. Am I required to conditionally add callbacks? This would add boilerplate and wouldn't work with LightningCLI as a persistent configurable callback.

I might be more in favor of a single DeviceStatsMonitor callback. What I would expect by default from such a callback is to log stats for all devices that were used. If only CPU, then I would see only CPU stats. If multiple devices, then the stats of those multiple devices. No need for me to tell the callback which devices, it is all the ones that were used. I would not expect this callback to make execution fail in any circumstance. Also I don't see why it should give a warning if some device type is not found. Surely I would hope the callback to be configurable. But a parameter such as gpu_stats=True would mean, if a GPU is used, then log stats for it. If no GPU used, then ignore without any warning. On the other hand, a gpu_stats=False would mean, if a GPU is used, then don't log stats for it.

This view is more focused on the user perspective. I have not looked at the code to understand how this would fit or what complications there might be.

EricWiener commented 2 years ago

Just added a new pitch 3 based on the above feedback where we keep a single DeviceStatsMonitor with limited customization but make it easier for users to create custom device monitoring callbacks.

cowwoc commented 2 years ago

One thing I do not like about the current deprecated GPUStatsMonitor is that if I am not using a GPU, then the execution fails

I actually prefer code to fail-fast than failing silently. If you're going to go down this path, please log "[INFO] No GPU detected. Disabling GPUStstsMonitor" so the failure is not as silent.

mauvilsa commented 2 years ago

I actually prefer code to fail-fast than failing silently.

I also prefer code that fails fast and no silent failures. But just to clarify. What I am saying is that the purpose of a DeviceStatsMonitor could be "Log stats for all devices used". Not finding or not using a certain device should not be considered a failure from the perspective of this callback.

If you're going to go down this path, please log "[INFO] No GPU detected. Disabling GPUStstsMonitor" so the failure is not as silent.

Lightning already prints which devices are detected and which ones are used, e.g.

GPU available: True, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs

If the callback ends up working as I suggested, would there be a need for this callback to show a message that a device is not detected? If a message is shown, maybe better the other way around, like [INFO] DeviceStatsMonitor: logging stats for devices: CPU, GPU.

mauvilsa commented 2 years ago

Just added a new pitch 3 based on the above feedback where we keep a single DeviceStatsMonitor with limited customization but make it easier for users to create custom device monitoring callbacks.

@EricWiener a single callback does not necessarily mean no per device type options. How about the following:

class DeviceStatsMonitor(Callback):
    def __init__(
        self,
        cpu_stats: bool = True,
        gpu_stats: bool = True,
        tpu_stats: bool = True,
        get_device_stats_gpu_kwargs: Dict[str, Any] = None,
        get_device_stats_tpu_kwargs: Dict[str, Any] = None,
    ):
        ...

When instantiating the callback, one could optionally give in get_device_stats_gpu_kwargs a dictionary with all options for GPUAccelerator.get_device_stats.

EricWiener commented 2 years ago

Just added a new pitch 3 based on the above feedback where we keep a single DeviceStatsMonitor with limited customization but make it easier for users to create custom device monitoring callbacks.

@EricWiener a single callback does not necessarily mean no per device type options. How about the following:
class DeviceStatsMonitor(Callback):

    def __init__(

        self,

        cpu_stats: bool = True,

        gpu_stats: bool = True,

        tpu_stats: bool = True,

        get_device_stats_gpu_kwargs: Dict[str, Any] = None,

        get_device_stats_tpu_kwargs: Dict[str, Any] = None,

    ):

        ...
When instantiating the callback, one could optionally give in get_device_stats_gpu_kwargs a dictionary with all options for GPUAccelerator.get_device_stats.

That would pretty much be pitch 1 (if I'm understanding you correctly).

mauvilsa commented 2 years ago

That would pretty much be pitch 1 (if I'm understanding you correctly).

Yes, it is similar to pitch 1. But you added pitch 3 as response to my feedback, which distracts a bit from the core of what I was saying.

daniellepintz commented 2 years ago

I agree with @mauvilsa, I think we should keep one DeviceStatsMonitor and it should log stats for all devices used. However, I do not think we need this complicated interface:

class DeviceStatsMonitor(Callback):
    def __init__(
        self,
        cpu_stats: bool = True,
        gpu_stats: bool = True,
        tpu_stats: bool = True,
        get_device_stats_gpu_kwargs: Dict[str, Any] = None,
        get_device_stats_tpu_kwargs: Dict[str, Any] = None,
    ):

daniellepintz commented 2 years ago

I actually think the best option here is the one proposed in https://github.com/PyTorchLightning/pytorch-lightning/issues/11253#issuecomment-1004778058 and the one you are adding in #11795

EricWiener commented 2 years ago

If we are no longer going to need to let the user choose whether they want to get GPU metrics from torch/nvidia-smi (which I understand is a separate issue - but it is the first example I have come across of the need for a per-accelerator flag), then that would reduce the number of flags needed to be passed to DeviceStatsMonitor. I'm good with pitch 1, but with the following assumptions/caveats:

The source that an accelerator gets its metrics from can't be changed (ex. it would be quite confusing to handle different metric keys if the user switches from using torch to nvidia-smi for metrics).

I also think we should have the flags look like this:

DeviceStatsMonitor(
cpu_stats: Optional[Union[bool, Set[str]]] = None,
gpu_stats: Optional[Union[bool, Set[str]]] = None,
tpu_stats: Optional[Union[bool, Set[str]]] = None,
)

(and not have get_device_stats_gpu_kwargs: Dict[str, Any] = None,, etc.).

No other per-accelerator configuration (besides the metrics being tracked) should be able to be passed as arguments to DeviceStatsMonitor. For any more customization, the user should use the corresponding get_device_stats function and create their own callback. I'm okay with adding more DeviceStatsMonitor configuration (i.e. specifying what hooks to log on), but this additional configuration should function the same regardless of what accelerator is used.
We log a warning if stats are requested to be tracked and that accelerator isn't used, but we don't raise an error. It would be a pain if you needed to comment out gpu_stats every time you ran on the CPU.

rohitgr7 commented 2 years ago

after some discussion with @carmocca

enabling this:

DeviceStatsMonitor(
    cpu_stats: Optional[Union[bool, Set[str]]] = None,
    gpu_stats: Optional[Union[bool, Set[str]]] = None,
    tpu_stats: Optional[Union[bool, Set[str]]] = None,
)

will let users track other accelerator stats (for eg GPU and TPU), even if they are not using those accelerators for their scripts.

thoughts @Lightning-AI/lai-frameworks ?

my thoughts:

if someone comes up with their custom accelerator, they might have to update the DeviceStatsMonitor and include another flag to make sure their stats are logged too because right now it's very hardware agnostic, but after including the flags, it won't be.
flag_count will increase with new accelerators.
Motivation for tracking the hardware stats during training and user is not even using?

Lightning-AI / pytorch-lightning