Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.53k stars 3.39k forks source link

Support AMD GPUs with the MPS backend #15861

Open dbl001 opened 2 years ago

dbl001 commented 2 years ago

πŸš€ Feature

MPS support on MacOS Ventura with an AMD Radeon Pro 5700 XT GPU

Motivation

MisconfigurationException: MPSAccelerator can not run on your system since the accelerator is not available. The following accelerator(s) is available and can be passed into accelerator argument of Trainer: ['cpu']. [...]. If this is related to another GitHub issue, please link it here -->

Pitch

trainer = Trainer(accelerator="mps", devices=1)

Alternatives

Additional context


If you enjoy Lightning, check out our other projects! ⚑

cc @akihironitta @justusschock

awaelchli commented 1 year ago

Hey @dbl001 How have you installed PyTorch and Lightning? And which version?

Note that your Python interpreter must run natively and not through Rosetta, otherwise it won't detect the M1 hardware correctly. If you are using conda for example, you can double check this by running

conda info

and your output should say something like

platform : osx-arm64

If it shows intel x86, then re-install the correct conda version with M1 support.

Borda commented 1 year ago

cc: @justusschock :otter:

dbl001 commented 1 year ago

I am running on iMac 27” Intel with an AMD GPU (not M1). Will β€˜Lightning’ support this configuration

awaelchli commented 1 year ago

Since you don't have an M1, accelerator="mps" is not correct. If you want to use the AMD GPU, you need to install pytorch with ROCm support. Select it here in the installation matrix (fifth row).

While I can't test it myself (don't have an AMD GPU), the expectation is that torch will detect it. The cuda semantics in torch for AMD GPUs are the same, meaning torch.cuda.device_count() will return 1 for you.

So once you have pytorch installed with ROCm, you should be able to use

Trainer(accelerator="gpu", devices=1)

Again, can't verify but this is the expected case based on torch's documentation.

dbl001 commented 1 year ago

ROCm only runs on Linux (if not not mistaken. I’m running MacOS Ventura 13.01. β€˜MPS’ is currently working in PyTorch 14 nightly as well as Tensorflow macOS/tensorflow metal.

I was interested whether you will support the β€˜MPS’ (e.g. Metal) interface.

Thanks in advance.

On Dec 5, 2022, at 10:22 AM, Adrian WΓ€lchli @.***> wrote:

Since you don't have an M1, accelerator="mps" is not correct. If you want to use the AMD GPU, you need to install pytorch with ROCm support. Select it here in the installation matrix https://pytorch.org/ (fifth row).

While I can't test it myself (don't have an AMD GPU), the expectation is that torch will detect it. The cuda semantics in torch for AMD GPUs are the same, meaning torch.cuda.device_count() will return 1 for you.

So once you have pytorch installed with ROCm, you should be able to use

Trainer(accelerator="gpu", devices=1) Again, can't verify but this is the expected case based on torch's documentation.

β€” Reply to this email directly, view it on GitHub https://github.com/Lightning-AI/lightning/issues/15861#issuecomment-1337899995, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXWFWZ3BWMBQ6VSFLZXT5LWLYXG5ANCNFSM6AAAAAASOYAP2U. You are receiving this because you were mentioned.

awaelchli commented 1 year ago

@dbl001 I understand now what you mean. I couldn't find any official reference from PyTorch regarding support of MPS on AMD hardware. https://pytorch.org/docs/stable/notes/mps.html But there are some users reporting that it works.

If you @dbl001 or someone from the community has the hardware setup to test this, please feel free to send a PR with the necessary changes to Lightning to enable this. The main change probably needs to be in the availability check here: https://github.com/Lightning-AI/lightning/blob/32cf1faa07bf9b6d774cb724d4e35328bbf48b57/src/lightning_lite/accelerators/mps.py#L61-L66

dbl001 commented 1 year ago

https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/

https://developer.apple.com/metal/pytorch/

Here’s one of the PyTorch β€˜MPS’ threads on Github:

General MPS op coverage tracking issue #77764

https://github.com/pytorch/pytorch/issues/77764

dbl001 commented 1 year ago

I have the hardware/software required to test this. 'PYSR' uses pytorch_lightning. Here's what happened when I tried to run a model:

Screenshot 2022-12-07 at 6 30 03 AM
awaelchli commented 1 year ago

Does torch.backends.mps.is_available() return True on this machine?

If yes, could you try with modifying the code that I posted https://github.com/Lightning-AI/lightning/issues/15861#issuecomment-1338348087. The condition there probably needs to drop the platform.processor() in ("arm", "arm64"). This isn't the proper fix but at least you could then try to run the Trainer on the device (maybe).

dbl001 commented 1 year ago

After adjusting the code as per your recommendation,

torch.backends.mps.is_available()
True

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Users/davidlaxer/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/trainer/setup.py:200: UserWarning: MPS available but not used. Set `accelerator` and `devices` using `Trainer(accelerator='mps', devices=1)`.
  rank_zero_warn(

The model is training on the CPU. I do not see that the GPU is 'active' in the Activity Monitor.

Screenshot 2022-12-08 at 7 49 21 AM

I can try pytorch_lightning accessing the AMD GPU via MPS on other 'lightning' examples.

dbl001 commented 1 year ago

I tried the 'BERT' model ...

seed_everything(42)

dm = GLUEDataModule(model_name_or_path="albert-base-v2", task_name="cola")
dm.setup("fit")
model = GLUETransformer(
    model_name_or_path="albert-base-v2",
    num_labels=dm.num_labels,
    eval_splits=dm.eval_splits,
    task_name=dm.task_name,
)

trainer = Trainer(
    max_epochs=1,
    accelerator="auto",
    devices=1 if torch.backends.mps.is_available() else None,  # limiting got iPython runs
)
trainer.fit(model, datamodule=dm)
Global seed set to 42
Found cached dataset glue (/Users/davidlaxer/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 420.47it/s]
Loading cached processed dataset at /Users/davidlaxer/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-20965da3ce0503bd.arrow
Loading cached processed dataset at /Users/davidlaxer/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-7d22f08182fc38e7.arrow
  0%|                                                      | 0/2 [00:00<?, ?ba/s]/Users/davidlaxer/anaconda3/envs/pysr/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:2304: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).
  warnings.warn(
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 38.85ba/s]
Some weights of the model checkpoint at albert-base-v2 were not used when initializing AlbertForSequenceClassification: ['predictions.bias', 'predictions.dense.bias', 'predictions.LayerNorm.weight', 'predictions.decoder.weight', 'predictions.LayerNorm.bias', 'predictions.decoder.bias', 'predictions.dense.weight']
- This IS expected if you are initializing AlbertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
---------------------------------------------------------------------------
MisconfigurationException                 Traceback (most recent call last)
Cell In [8], line 12
      4 dm.setup("fit")
      5 model = GLUETransformer(
      6     model_name_or_path="albert-base-v2",
      7     num_labels=dm.num_labels,
      8     eval_splits=dm.eval_splits,
      9     task_name=dm.task_name,
     10 )
---> 12 trainer = Trainer(
     13     max_epochs=1,
     14     accelerator="auto",
     15     devices=1 if torch.backends.mps.is_available() else None,  # limiting got iPython runs
     16 )
     17 trainer.fit(model, datamodule=dm)

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/utilities/argparse.py:340, in _defaults_from_env_vars.<locals>.insert_env_defaults(self, *args, **kwargs)
    337 kwargs = dict(list(env_variables.items()) + list(kwargs.items()))
    339 # all args were already moved to kwargs
--> 340 return fn(self, **kwargs)

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:408, in Trainer.__init__(self, logger, enable_checkpointing, callbacks, default_root_dir, gradient_clip_val, gradient_clip_algorithm, num_nodes, num_processes, devices, gpus, auto_select_gpus, tpu_cores, ipus, enable_progress_bar, overfit_batches, track_grad_norm, check_val_every_n_epoch, fast_dev_run, accumulate_grad_batches, max_epochs, min_epochs, max_steps, min_steps, max_time, limit_train_batches, limit_val_batches, limit_test_batches, limit_predict_batches, val_check_interval, log_every_n_steps, accelerator, strategy, sync_batchnorm, precision, enable_model_summary, num_sanity_val_steps, resume_from_checkpoint, profiler, benchmark, deterministic, reload_dataloaders_every_n_epochs, auto_lr_find, replace_sampler_ddp, detect_anomaly, auto_scale_batch_size, plugins, amp_backend, amp_level, move_metrics_to_cpu, multiple_trainloader_mode, inference_mode)
    405 # init connectors
    406 self._data_connector = DataConnector(self, multiple_trainloader_mode)
--> 408 self._accelerator_connector = AcceleratorConnector(
    409     num_processes=num_processes,
    410     devices=devices,
    411     tpu_cores=tpu_cores,
    412     ipus=ipus,
    413     accelerator=accelerator,
    414     strategy=strategy,
    415     gpus=gpus,
    416     num_nodes=num_nodes,
    417     sync_batchnorm=sync_batchnorm,
    418     benchmark=benchmark,
    419     replace_sampler_ddp=replace_sampler_ddp,
    420     deterministic=deterministic,
    421     auto_select_gpus=auto_select_gpus,
    422     precision=precision,
    423     amp_type=amp_backend,
    424     amp_level=amp_level,
    425     plugins=plugins,
    426 )
    427 self._logger_connector = LoggerConnector(self)
    428 self._callback_connector = CallbackConnector(self)

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:213, in AcceleratorConnector.__init__(self, devices, num_nodes, accelerator, strategy, plugins, precision, amp_type, amp_level, sync_batchnorm, benchmark, replace_sampler_ddp, deterministic, auto_select_gpus, num_processes, tpu_cores, ipus, gpus)
    210 elif self._accelerator_flag == "gpu":
    211     self._accelerator_flag = self._choose_gpu_accelerator_backend()
--> 213 self._set_parallel_devices_and_init_accelerator()
    215 # 3. Instantiate ClusterEnvironment
    216 self.cluster_environment: ClusterEnvironment = self._choose_and_init_cluster_environment()

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:547, in AcceleratorConnector._set_parallel_devices_and_init_accelerator(self)
    543 self._tpu_cores = self._devices_flag if not self._tpu_cores else self._tpu_cores
    545 self._set_devices_flag_if_auto_select_gpus_passed()
--> 547 self._devices_flag = accelerator_cls.parse_devices(self._devices_flag)
    548 if not self._parallel_devices:
    549     self._parallel_devices = accelerator_cls.get_parallel_devices(self._devices_flag)

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/accelerators/mps.py:48, in MPSAccelerator.parse_devices(devices)
     45 @staticmethod
     46 def parse_devices(devices: Union[int, str, List[int]]) -> Optional[List[int]]:
     47     """Accelerator device parsing logic."""
---> 48     parsed_devices = _parse_gpu_ids(devices, include_mps=True)
     49     return parsed_devices

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/lightning_lite/utilities/device_parser.py:104, in _parse_gpu_ids(gpus, include_cuda, include_mps)
    101 # Check that GPUs are unique. Duplicate GPUs are not supported by the backend.
    102 _check_unique(gpus)
--> 104 return _sanitize_gpu_ids(gpus, include_cuda=include_cuda, include_mps=include_mps)

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/lightning_lite/utilities/device_parser.py:136, in _sanitize_gpu_ids(gpus, include_cuda, include_mps)
    134 for gpu in gpus:
    135     if gpu not in all_available_gpus:
--> 136         raise MisconfigurationException(
    137             f"You requested gpu: {gpus}\n But your machine only has: {all_available_gpus}"
    138         )
    139 return gpus

MisconfigurationException: You requested gpu: [0]
 But your machine only has: []
​

This function in "lightning_lite/utilities/device_parser.py" is returning []:

se []
    mps_gpus = accelerators.mps._get_all_available_mps_gpus() if include_mps else []
dbl001 commented 1 year ago

It appears that accelerators.mps._get_all_available_mps_gpus() is returning an empty list:

import lightning_lite.accelerators as accelerators
accelerators.mps._get_all_available_mps_gpus()
[]

When you 'force' _get_all_available_mps_gpus to return '[0]' pytorch_lightning utilizes the AMD Radeon Pro 5700 XT GPU from 'MPS': E.g

def _get_all_available_mps_gpus() -> List[int]:
    """
    Returns:
        A list of all available MPS GPUs
    """
    return [0]
    #return [0] if MPSAccelerator.is_available() else []

The 'lightning' BERT example runs until it get an exception trying to convert to Float64 - which 'MPS' does not support. This would also happen with the M1 & M2 hardware. E.g.

ypeError                                 Traceback (most recent call last)
Cell In [6], line 17
      5 model = GLUETransformer(
      6     model_name_or_path="albert-base-v2",
      7     num_labels=dm.num_labels,
      8     eval_splits=dm.eval_splits,
      9     task_name=dm.task_name,
     10 )
     12 trainer = Trainer(
     13     max_epochs=1,
     14     accelerator="mps",
     15     devices=1
     16 )
---> 17 trainer.fit(model, datamodule=dm)

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:582, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    580     raise TypeError(f"`Trainer.fit()` requires a `LightningModule`, got: {model.__class__.__qualname__}")
    581 self.strategy._lightning_module = model
--> 582 call._call_and_handle_interrupt(
    583     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    584 )

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py:38, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     36         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
     37     else:
---> 38         return trainer_fn(*args, **kwargs)
     40 except _TunerExitException:
     41     trainer._call_teardown_hook()

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:624, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    617 ckpt_path = ckpt_path or self.resume_from_checkpoint
    618 self._ckpt_path = self._checkpoint_connector._set_ckpt_path(
    619     self.state.fn,
    620     ckpt_path,  # type: ignore[arg-type]
    621     model_provided=True,
    622     model_connected=self.lightning_module is not None,
    623 )
--> 624 self._run(model, ckpt_path=self.ckpt_path)
    626 assert self.state.stopped
    627 self.training = False

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1061, in Trainer._run(self, model, ckpt_path)
   1057 self._checkpoint_connector.restore_training_state()
   1059 self._checkpoint_connector.resume_end()
-> 1061 results = self._run_stage()
   1063 log.detail(f"{self.__class__.__name__}: trainer tearing down")
   1064 self._teardown()

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1140, in Trainer._run_stage(self)
   1138 if self.predicting:
   1139     return self._run_predict()
-> 1140 self._run_train()

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1163, in Trainer._run_train(self)
   1160 self.fit_loop.trainer = self
   1162 with torch.autograd.set_detect_anomaly(self._detect_anomaly):
-> 1163     self.fit_loop.run()

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py:199, in Loop.run(self, *args, **kwargs)
    197 try:
    198     self.on_advance_start(*args, **kwargs)
--> 199     self.advance(*args, **kwargs)
    200     self.on_advance_end()
    201     self._restarting = False

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py:267, in FitLoop.advance(self)
    265 self._data_fetcher.setup(dataloader, batch_to_device=batch_to_device)
    266 with self.trainer.profiler.profile("run_training_epoch"):
--> 267     self._outputs = self.epoch_loop.run(self._data_fetcher)

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py:200, in Loop.run(self, *args, **kwargs)
    198     self.on_advance_start(*args, **kwargs)
    199     self.advance(*args, **kwargs)
--> 200     self.on_advance_end()
    201     self._restarting = False
    202 except StopIteration:

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py:251, in TrainingEpochLoop.on_advance_end(self)
    249 if should_check_val:
    250     self.trainer.validating = True
--> 251     self._run_validation()
    252     self.trainer.training = True
    254 # update plateau LR scheduler after metrics are logged

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py:310, in TrainingEpochLoop._run_validation(self)
    307 self.val_loop._reload_evaluation_dataloaders()
    309 with torch.no_grad():
--> 310     self.val_loop.run()

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py:206, in Loop.run(self, *args, **kwargs)
    203         break
    204 self._restarting = False
--> 206 output = self.on_run_end()
    207 return output

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py:180, in EvaluationLoop.on_run_end(self)
    177 self.trainer._logger_connector.epoch_end_reached()
    179 # hook
--> 180 self._evaluation_epoch_end(self._outputs)
    181 self._outputs = []  # free memory
    183 # hook

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py:288, in EvaluationLoop._evaluation_epoch_end(self, outputs)
    286 # call the model epoch end
    287 hook_name = "test_epoch_end" if self.trainer.testing else "validation_epoch_end"
--> 288 self.trainer._call_lightning_module_hook(hook_name, output_or_outputs)

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1305, in Trainer._call_lightning_module_hook(self, hook_name, pl_module, *args, **kwargs)
   1302 pl_module._current_fx_name = hook_name
   1304 with self.profiler.profile(f"[LightningModule]{pl_module.__class__.__name__}.{hook_name}"):
-> 1305     output = fn(*args, **kwargs)
   1307 # restore current_fx when nested context
   1308 pl_module._current_fx_name = prev_fx_name

Cell In [5], line 66, in GLUETransformer.validation_epoch_end(self, outputs)
     64 loss = torch.stack([x["loss"] for x in outputs]).mean()
     65 self.log("val_loss", loss, prog_bar=True)
---> 66 self.log_dict(self.metric.compute(predictions=preds, references=labels), prog_bar=True)

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/core/module.py:511, in LightningModule.log_dict(self, dictionary, prog_bar, logger, on_step, on_epoch, reduce_fx, enable_graph, sync_dist, sync_dist_group, add_dataloader_idx, batch_size, rank_zero_only)
    477 """Log a dictionary of values at once.
    478 
    479 Example::
   (...)
    508         would produce a deadlock as not all processes would perform this log call.
    509 """
    510 for k, v in dictionary.items():
--> 511     self.log(
    512         name=k,
    513         value=v,
    514         prog_bar=prog_bar,
    515         logger=logger,
    516         on_step=on_step,
    517         on_epoch=on_epoch,
    518         reduce_fx=reduce_fx,
    519         enable_graph=enable_graph,
    520         sync_dist=sync_dist,
    521         sync_dist_group=sync_dist_group,
    522         add_dataloader_idx=add_dataloader_idx,
    523         batch_size=batch_size,
    524         rank_zero_only=rank_zero_only,
    525     )

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/core/module.py:405, in LightningModule.log(self, name, value, prog_bar, logger, on_step, on_epoch, reduce_fx, enable_graph, sync_dist, sync_dist_group, add_dataloader_idx, batch_size, metric_attribute, rank_zero_only)
    399 if "/dataloader_idx_" in name:
    400     raise MisconfigurationException(
    401         f"You called `self.log` with the key `{name}`"
    402         " but it should not contain information about `dataloader_idx`"
    403     )
--> 405 value = apply_to_collection(value, (torch.Tensor, numbers.Number), self.__to_tensor, name)
    407 if self.trainer._logger_connector.should_reset_tensors(self._current_fx_name):
    408     # if we started a new epoch (running its first batch) the hook name has changed
    409     # reset any tensors for the new hook name
    410     results.reset(metrics=False, fx=self._current_fx_name)

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/lightning_utilities/core/apply_func.py:47, in apply_to_collection(data, dtype, function, wrong_dtype, include_none, *args, **kwargs)
     45 # Breaking condition
     46 if isinstance(data, dtype) and (wrong_dtype is None or not isinstance(data, wrong_dtype)):
---> 47     return function(data, *args, **kwargs)
     49 elem_type = type(data)
     51 # Recursively apply to collection items

File ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/core/module.py:541, in LightningModule.__to_tensor(self, value, name)
    537 def __to_tensor(self, value: Union[torch.Tensor, numbers.Number], name: str) -> Tensor:
    538     value = (
    539         value.clone().detach().to(self.device)
    540         if isinstance(value, torch.Tensor)
--> 541         else torch.tensor(value, device=self.device)
    542     )
    543     if not torch.numel(value) == 1:
    544         raise ValueError(
    545             f"`self.log({name}, {value})` was called, but the tensor must have a single element."
    546             f" You can try doing `self.log({name}, {value}.mean())`"
    547         )

TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.
_get_all_available_mps_gpus
awaelchli commented 1 year ago

@dbl001 Thanks for investigating! So in summary, these are the changes we need:

  1. We need to revise the availability check: https://github.com/Lightning-AI/lightning/blob/32cf1faa07bf9b6d774cb724d4e35328bbf48b57/src/lightning_lite/accelerators/mps.py#L61-L66 where platform.processor() in ("arm", "arm64") is not general enough.

  2. We need to update the _get_all_available_mps_gpus to parse the (rocm) "cuda" devices.

Still open for investigation is whether it would be possible to also use multiple GPUs.

dbl001 commented 1 year ago

Also, 'MPS' only supports 'torch.float32' tensors, so I had to change this line in 'module.py': $ vi +541 ~/anaconda3/envs/pysr/lib/python3.9/site-packages/pytorch_lightning/core/module.py

 else torch.tensor(value, device=self.device, dtype=torch.float32)
dbl001 commented 1 year ago

And ...

Training: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]

Previously reported? https://github.com/Lightning-AI/lightning/issues/5039

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!