XuehaiPan / nvitop

An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
https://nvitop.readthedocs.io
Apache License 2.0
4.56k stars 144 forks source link

[Question] How to log GPU performance to `wandb` #109

Closed BitCalSaul closed 9 months ago

BitCalSaul commented 9 months ago

Required prerequisites

Motivation

Hey, I am a super fan of the nvitop. I usually used another monitor to see my GPU performance with time. But it's hard to keep it for a record. Thus, I want to use the nvitop with wandb. However, I don't know how to set up it. I'm wondering if you could provide an example for this work, thanks!

Solution

No response

Alternatives

No response

Additional context

No response

XuehaiPan commented 9 months ago

Hi @BitCalSaul, thanks for raising this. The usage of logging metrics to wandb is similar to the TensorBoard. You can read the example in the section Resource Metric Collector.

import os

import torch
import torch.nn as nn
import torch.nn.functional as F
import wandb

from nvitop import CudaDevice, ResourceMetricCollector

# Build networks and prepare datasets
...

# Logger and status collector
writer = SummaryWriter()
collector = ResourceMetricCollector(devices=CudaDevice.all(),  # log all visible CUDA devices and use the CUDA ordinal
                                    root_pids={os.getpid()},   # only log the descendant processes of the current process
                                    interval=1.0)              # snapshot interval for background daemon thread

# W&B Session
run = wandb.init()

# Start training
global_step = 0
for epoch in range(num_epoch):
    with collector(tag='train'):
        for batch in train_dataset:
            with collector(tag='batch'):
                algorithm_metrics = train(net, batch)

                # Collect batch level resource metrics
                resource_metrics = collector.collect()  # {'train/batch/<name>': value, ...}

                # Add a prefix if necessary
                algorithm_metrics = {
                    f'train/{key}': value for key, value in algorithm_metrics.items()
                }
                resource_metrics = {
                    f'resources/{key}': value for key, value in resource_metrics.items()
                }

                global_step += 1

                # Log metrics to W&B
                metrics = {**algorithm_metrics, **resource_metrics}
                run.log(metrics, step=global_step)

        # Collect epoch level resource metrics
        resource_metrics = collector.collect()  # {'train/<name>': value, ...}
        # Add a prefix if necessary
        resource_metrics = {f'resources/{key}': value for key, value in resource_metrics.items()}
        run.log(resource_metrics, step=epoch)

    with collector(tag='validate'):
        algorithm_metrics = validate(net, validation_dataset)

        # Collect epoch level resource metrics
        resource_metrics = collector.collect()  # {'validate/<name>': value, ...}

        # Add a prefix if necessary
        algorithm_metrics = {f'validate/{key}': value for key, value in algorithm_metrics.items()}
        resource_metrics = {f'resources/{key}': value for key, value in resource_metrics.items()}

        # Log metrics to W&B
        metrics = {**algorithm_metrics, **resource_metrics}
        run.log(metrics, step=epoch)

You can also send the collector to run in a background daemon thread. See the README for more details.

BitCalSaul commented 9 months ago

Thanks for the example and you hard work :)