Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.15k stars 3.37k forks source link

Trainer does not work when accelerator="mps" (but works fine if using accelerator="cpu" or "gpu") #18597

Open plannaAlain opened 1 year ago

plannaAlain commented 1 year ago

Bug description

We are training a DETR model using transformers and it works well on any machine with a GPU+CUDA. Running it on a Mac only works if we use the "cpu" accelerator. With 'mps' it throws an error (see full stack below): ValueError: boxes1 must be in [x0, y0, x1, y1] (corner) format, but got tensor([], device='mps:0', size=(0, 4))

Using Lightning v2.0.9 on a MacBook Pro M2 Max 64GB

What version are you seeing the problem on?

v2.0

How to reproduce the bug

trainer = Trainer(gradient_clip_val=0.1, max_epochs=300, callbacks=[early_stop_callback],accelerator="mps")
  trainer.fit(model)

Error messages and logs

Traceback (most recent call last):
  File "/Users/user/dev/project/company/server/ai/od_detr/training/train.py", line 235, in <module>
    trainer.fit(model)
  File "/Users/user/dev/miniconda3/envs/pytorch2/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/Users/user/dev/miniconda3/envs/pytorch2/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/Users/user/dev/miniconda3/envs/pytorch2/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/Users/user/dev/miniconda3/envs/pytorch2/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run
    results = self._run_stage()
  File "/Users/user/dev/miniconda3/envs/pytorch2/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1021, in _run_stage
    self._run_sanity_check()
  File "/Users/user/dev/miniconda3/envs/pytorch2/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1050, in _run_sanity_check
    val_loop.run()
  File "/Users/user/dev/miniconda3/envs/pytorch2/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 181, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/Users/user/dev/miniconda3/envs/pytorch2/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 115, in run
    self._evaluation_step(batch, batch_idx, dataloader_idx)
  File "/Users/user/dev/miniconda3/envs/pytorch2/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 376, in _evaluation_step
    output = call._call_strategy_hook(trainer, hook_name, *step_kwargs.values())
  File "/Users/user/dev/miniconda3/envs/pytorch2/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 294, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/Users/user/dev/miniconda3/envs/pytorch2/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 393, in validation_step
    return self.model.validation_step(*args, **kwargs)
  File "/Users/user/dev/project/company/server/ai/od_detr/training/train.py", line 131, in validation_step
    loss, loss_dict = self.common_step(batch, batch_idx)
  File "/Users/user/dev/project/company/server/ai/od_detr/training/train.py", line 113, in common_step
    outputs = self.model(pixel_values=pixel_values, pixel_mask=pixel_mask, labels=labels)
  File "/Users/user/dev/miniconda3/envs/pytorch2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/user/dev/miniconda3/envs/pytorch2/lib/python3.10/site-packages/transformers/models/detr/modeling_detr.py", line 1625, in forward
    loss_dict = criterion(outputs_loss, labels)
  File "/Users/user/dev/miniconda3/envs/pytorch2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/user/dev/miniconda3/envs/pytorch2/lib/python3.10/site-packages/transformers/models/detr/modeling_detr.py", line 2238, in forward
    losses.update(self.get_loss(loss, outputs, targets, indices, num_boxes))
  File "/Users/user/dev/miniconda3/envs/pytorch2/lib/python3.10/site-packages/transformers/models/detr/modeling_detr.py", line 2208, in get_loss
    return loss_map[loss](outputs, targets, indices, num_boxes)
  File "/Users/user/dev/miniconda3/envs/pytorch2/lib/python3.10/site-packages/transformers/models/detr/modeling_detr.py", line 2149, in loss_boxes
    generalized_box_iou(center_to_corners_format(source_boxes), center_to_corners_format(target_boxes))
  File "/Users/user/dev/miniconda3/envs/pytorch2/lib/python3.10/site-packages/transformers/models/detr/modeling_detr.py", line 2410, in generalized_box_iou
    raise ValueError(f"boxes1 must be in [x0, y0, x1, y1] (corner) format, but got {boxes1}")
ValueError: boxes1 must be in [x0, y0, x1, y1] (corner) format, but got tensor([], device='mps:0', size=(0, 4))

Environment

Current environment ``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): #- PyTorch Lightning Version (e.g., 1.5.0): #- Lightning App Version (e.g., 0.5.2): #- PyTorch Version (e.g., 2.0): #- Python version (e.g., 3.9): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): #- Running environment of LightningApp (e.g. local, cloud): ```

More info

No response

cc @justusschock

awaelchli commented 1 year ago

Hey @plannaAlain This doesn't look like a Lightning issue to me. Could you report it to PyTorch? It looks like an issue with indexing the tensor.