XuehaiPan / nvitop

An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
https://nvitop.readthedocs.io
Apache License 2.0
4.56k stars 144 forks source link

Update pytorch_lightning.py callback #84

Closed lkhphuc closed 1 year ago

lkhphuc commented 1 year ago

Issue Type

Description

Use new trainer property, and fix callback function arguments.

Motivation and Context

Deprecated property in lightning. https://github.com/Lightning-AI/lightning/pull/12072/files#diff-667a2513c158c2b01d73b479e4dea75587d8531f9cd286891d34570a2fd145dcR2117

XuehaiPan commented 1 year ago

Hello @lkhphuc, thank you for your contribution to this PR.

My preference would be to retain this callback for now, and potentially consider deprecating it or marking it as defunct. This is primarily due to the fact that this code has been in place for several years, making it challenging to cater to a broad specter of pytorch-lightning or lightning versions. I believe a more suitable approach would be to encourage users to utilize the first-party callbacks maintained by the Lightning-AI team. An example could be the lightning.pytorch.callbacks.DeviceStatsMonitor callback. I would appreciate your thoughts on this.

It's important to note that nvitop doesn't rely on pytorch-lightning or lightning. As such, we don't have a release schedule that's aligned with Lightning-AI/lightning. This autonomy means we occasionally struggle to keep up with upstream API alterations. Here's a list of regressions in the Lightning API you may find helpful: Lightning-AI/lightning/docs/source-pytorch/upgrade/sections.

lkhphuc commented 1 year ago

Hi.

For me the default DeviceStatsMonitor gives me 122 scalars chart just for "on train_batch_end" step and it's pretty much useless. That's why I came to nvitop in the first place.

I understand your concern about keeping up with Lightning's upstream. It's indeed should better be incorporated in their library if they ever want to.

One options might be to just provide the API forget_gpu_stats and move all the other callbacks into an example folder, maintained by community. These callbacks are not very complicated but some boilerplate to get it quickly running would be very helpful from user's perspective.

For now the simple changes in this PR is working fine for me, but feel free to close it.

XuehaiPan commented 1 year ago

For me the default DeviceStatsMonitor gives me 122 scalars chart just for "on train_batch_end" step and it's pretty much useless. That's why I came to nvitop in the first place.

@lkhphuc Have you ever tried nvitop.ResourceMetricCollector, it will give you a detailed list of metrics for host, device and process metrics.

One options might be to just provide the API for get_gpu_stats

You can access get_gpu_stats via:

from nvitop import Device
from nvitop.callbacks.utils import get_gpu_stats

gpu_stats = get_gpu_stats(Device.cuda.all())

You can also try nvitop.take_snapshots to implement your own get_gpu_stats-like APIs:

from nvitop import take_snapshots

print(take_snapshots())

I think ResourceMetricCollector may be better to fit your needs. Here is a minimal example to create a background monitor:

import time

from nvitop import Device, ResourceMetricCollector

def on_collect(metrics):
    print(metrics)
    return True

collector = ResourceMetricCollector(Device.cuda.all(), interval=1.0)
daemon = collector.daemonize(on_collect, interval=5.0)

time.sleep(3600)  # do something

For more examples, you can find them in the README or the online documentation.


For me the default DeviceStatsMonitor gives me 122 scalars chart just for "on train_batch_end" step and it's pretty much useless. That's why I came to nvitop in the first place. It's indeed should better be incorporated in their library if they ever want to.

The pytorch-lightning callback in nvitop.callbacks is a port version that refactored with nvitop APIs rather than nvidia-smi calls. I think it will not give more metrics than the original GpuStatsMonitor (now DeviceStatsMonitor). I can submit a PR to the Lightning upstream to add more metrics if you want to.

lkhphuc commented 1 year ago

Thanks. I will take a look at the ResourceMetricCollector.

For Lightning if you have time to contribute upstream that would be great. My issue with it is not that it doesn't log enough but that it logs too much, like this active_bytes /inactive_split_bytes, small_pool/large_pool, peak / free / allocated ... to a total of 122 charts for a step. Screenshot 2023-08-03 at 10 12 35

I'm going to close this and roll my own callback using your suggestion for now.