Lightning-AI / torchmetrics

Machine learning metrics for distributed, scalable PyTorch applications.
https://lightning.ai/docs/torchmetrics/
Apache License 2.0
2.13k stars 407 forks source link

Metric not moved to device and invalids the cpu-gpu offloading when combining with DeepSpeed #2473

Open qingquansong opened 7 months ago

qingquansong commented 7 months ago

🐛 Bug

version 1.3.1

1) Similar issue as: #531 when using the following code:

when running the following code:

class MyModel(LightningModule):
  def __init__(self):
            self.metrics: ModuleDict[str, MetricCollection] = ModuleDict(
                {
                    "train_metric": MetricCollection(
                        {
                            "train_accuracy_micro": Accuracy(
                                task="multiclass", num_classes=3, average="micro"
                            )
                        }
                    ),
                    "val_metric": MetricCollection(
                        {
                            "val_accuracy_micro": Accuracy(
                                task="multiclass", num_classes=3, average="micro"
                            ),
                            "val_auroc": ClasswiseWrapper(
                                AUROC(task="multiclass", num_classes=3, average=None), labels=["1", "2", "3"],
                            )
                        }
                    ),
                }
            )

  def forward(self, input):
    print(f"self.device: {self.device}")
    print(f"metric device: {self.metrics['train_metric']['train_accuracy_micro'].device}")

I got:

self.device: cuda:0
self.accuracy.device: cpu

2) When running with deep speed strategy, it gives me: Invalidate trace cache @ step 327: expected module 365, but got module 365, which seems to also slow down the deep speed evaluation. (tried both one or multiple GPUs with the following config and both have the same alert)

The deep speed config is:

{
    "bf16": {
        "enabled": "auto"
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "zero_hpz_partition_size": 4,
        "zero_quantized_gradients": true,
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "steps_per_print": 2000,
    "train_micro_batch_size_per_gpu": 16,
    "wall_clock_breakdown": false
}

Trainer created via:

trainer = L.Trainer(
            # Hardware Setup
            # --------------------------------
            devices=self.num_gpus_per_node,
            num_nodes=self.num_nodes,
            accelerator="gpu",
            # Training Configuration
            # --------------------------------
            strategy=DeepSpeedStrategy(config=self.args.deepspeed),  # the path to the json config above
        )

trainer.model is the model containing the metrics above

Expected behavior

1) Expect the metric to be on cuda:0. 2) No warning alert appears like: Invalidate trace cache @ step 327: expected module 365, but got module 365

Environment

CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7"

github-actions[bot] commented 7 months ago

Hi! thanks for your contribution!, great first issue!