Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.35k stars 3.39k forks source link

MPSAccelerator not being detected on M1 macbook #16124

Closed Pouyaexe closed 9 months ago

Pouyaexe commented 1 year ago

Bug description

The Lightning AI library is not detecting the MPSAccelerator on my machine when using a Jupyter notebook in VS code.

How to reproduce the bug:

  1. Set up a Jupyter notebook in VS code
  2. Try to run a Lightning AI script that utilizes the GPU in a Jupyter notebook.

How to reproduce the bug

import lightning as L

trainer = L.Trainer(accelerator="mps", devices=1, max_epochs=2)

Error messages and logs

MisconfigurationException                 Traceback (most recent call last)
Cell In [1], line 3
      1 import lightning as L
----> 3 trainer = L.Trainer(accelerator="mps", devices=1, max_epochs=2)

File ~/miniconda3/envs/lightning/lib/python3.9/site-packages/lightning/pytorch/utilities/argparse.py:340, in _defaults_from_env_vars.<locals>.insert_env_defaults(self, *args, **kwargs)
    337 kwargs = dict(list(env_variables.items()) + list(kwargs.items()))
    339 # all args were already moved to kwargs
--> 340 return fn(self, **kwargs)

File ~/miniconda3/envs/lightning/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py:408, in Trainer.__init__(self, logger, enable_checkpointing, callbacks, default_root_dir, gradient_clip_val, gradient_clip_algorithm, num_nodes, num_processes, devices, gpus, auto_select_gpus, tpu_cores, ipus, enable_progress_bar, overfit_batches, track_grad_norm, check_val_every_n_epoch, fast_dev_run, accumulate_grad_batches, max_epochs, min_epochs, max_steps, min_steps, max_time, limit_train_batches, limit_val_batches, limit_test_batches, limit_predict_batches, val_check_interval, log_every_n_steps, accelerator, strategy, sync_batchnorm, precision, enable_model_summary, num_sanity_val_steps, resume_from_checkpoint, profiler, benchmark, deterministic, reload_dataloaders_every_n_epochs, auto_lr_find, replace_sampler_ddp, detect_anomaly, auto_scale_batch_size, plugins, amp_backend, amp_level, move_metrics_to_cpu, multiple_trainloader_mode, inference_mode)
    405 # init connectors
    406 self._data_connector = DataConnector(self, multiple_trainloader_mode)
--> 408 self._accelerator_connector = AcceleratorConnector(
    409     num_processes=num_processes,
    410     devices=devices,
    411     tpu_cores=tpu_cores,
    412     ipus=ipus,
    413     accelerator=accelerator,
    414     strategy=strategy,
    415     gpus=gpus,
    416     num_nodes=num_nodes,
    417     sync_batchnorm=sync_batchnorm,
    418     benchmark=benchmark,
...
    538     )
    540 self._set_devices_flag_if_auto_passed()
    542 self._gpus = self._devices_flag if not self._gpus else self._gpus

MisconfigurationException: `MPSAccelerator` can not run on your system since the accelerator is not available. The following accelerator(s) is available and can be passed into `accelerator` argument of `Trainer`: ['cpu'].

Environment

More info

I don't have a problem using the Pytorch and utilizing the GPU. it works just fine:

import torch
torch.backends.mps.is_available()

returns

True

Also

import platform
# should print "arm"
print(platform. Processor())

returns

arm

and this shell command

conda config --show subdir

returns

subdir: osx-arm64

Also, running the script from How to produce the bog section in a normal Jupyter nb (In the browser) works just fine.

Web capture_20-12-2022_85420_localhost

cc @justusschock

awaelchli commented 1 year ago

@Pouyaexe Can you paste the output of conda info here? It is strange that you have pytorch-lightning: 0.8.5 in your environment. This version is too old, could you uninstall it?

When you instantiate the Trainer, you should hit this line of code before the error is shown. Could you check the values of these conditions please?

https://github.com/Lightning-AI/lightning/blob/14f441c393583e25c6e711b0320159d2dc40907c/src/lightning_lite/accelerators/mps.py#L64-L66

Pouyaexe commented 1 year ago

I made some changes because after uninstalling the old lightning package, my entire environment crashed. To fix this, I created a new environment using the CONDA_SUBDIR=osx-arm64 conda create -n larm command. When I run the collect_env_details.py script inside a Jupyter notebook cell in VS Code, the following is a list of packages in the new environment:

* CUDA:
    - GPU:               None
    - available:         False
    - version:           None
* Lightning:
    - lightning:         1.8.5.post0
    - lightning-cloud:   0.5.13
    - lightning-utilities: 0.4.2
    - torch:             1.13.1
    - torchmetrics:      0.11.0
* Packages:
    - aiohttp:           3.8.3
    - aiosignal:         1.3.1
    - anyio:             3.6.2
    - appnope:           0.1.2
    - arrow:             1.2.3
    - asttokens:         2.0.5
    - async-timeout:     4.0.2
    - attrs:             22.1.0
    - backcall:          0.2.0
    - beautifulsoup4:    4.11.1
    - blessed:           1.19.1
    - certifi:           2022.9.24
    - charset-normalizer: 2.1.1
    - click:             8.1.3
    - commonmark:        0.9.1
    - croniter:          1.3.8
    - debugpy:           1.5.1
    - decorator:         5.1.1
    - deepdiff:          6.2.2
    - dnspython:         2.2.1
    - email-validator:   1.3.0
    - entrypoints:       0.4
    - executing:         0.8.3
    - fastapi:           0.88.0
    - frozenlist:        1.3.3
    - fsspec:            2022.11.0
    - h11:               0.14.0
    - httpcore:          0.16.3
    - httptools:         0.5.0
    - httpx:             0.23.1
    - idna:              3.4
    - inquirer:          3.1.1
    - ipykernel:         6.15.2
    - ipython:           8.7.0
    - itsdangerous:      2.1.2
    - jedi:              0.18.1
    - jinja2:            3.1.2
    - jupyter-client:    7.4.7
    - jupyter-core:      4.11.2
    - lightning:         1.8.5.post0
    - lightning-cloud:   0.5.13
    - lightning-utilities: 0.4.2
    - markupsafe:        2.1.1
    - matplotlib-inline: 0.1.6
    - multidict:         6.0.3
    - nest-asyncio:      1.5.5
    - numpy:             1.24.0
    - ordered-set:       4.1.0
    - orjson:            3.8.3
    - packaging:         21.3
    - parso:             0.8.3
    - pexpect:           4.8.0
    - pickleshare:       0.7.5
    - pip:               22.3.1
    - prompt-toolkit:    3.0.20
    - protobuf:          3.20.1
    - psutil:            5.9.0
    - ptyprocess:        0.7.0
    - pure-eval:         0.2.2
    - pydantic:          1.10.2
    - pygments:          2.11.2
    - pyjwt:             2.6.0
    - pyparsing:         3.0.9
    - python-dateutil:   2.8.2
    - python-dotenv:     0.21.0
    - python-editor:     1.0.4
    - python-multipart:  0.0.5
    - pyyaml:            6.0
    - pyzmq:             23.2.0
    - readchar:          4.0.3
    - requests:          2.28.1
    - rfc3986:           1.5.0
    - rich:              12.6.0
    - setuptools:        65.5.0
    - six:               1.16.0
    - sniffio:           1.3.0
    - soupsieve:         2.3.2.post1
    - stack-data:        0.2.0
    - starlette:         0.22.0
    - starsessions:      1.3.0
    - tensorboardx:      2.5.1
    - torch:             1.13.1
    - torchmetrics:      0.11.0
    - tornado:           6.2
    - tqdm:              4.64.1
    - traitlets:         5.7.1
    - typing-extensions: 4.4.0
    - ujson:             5.6.0
    - urllib3:           1.26.13
    - uvicorn:           0.20.0
    - uvloop:            0.17.0
    - watchfiles:        0.18.1
    - wcwidth:           0.2.5
    - websocket-client:  1.4.2
    - websockets:        10.4
    - wheel:             0.37.1
    - yarl:              1.8.2
* System:
    - OS:                Darwin
    - architecture:
        - 64bit
        - 
    - processor:         i386
    - python:            3.10.8
    - version:           Darwin Kernel Version 22.1.0: Sun Oct  9 20:14:30 PDT 2022; root:xnu-8792.41.9~2/RELEASE_ARM64_T8103

Running the collect_env_details.py script in the same enviroment, as a .py file:

* CUDA:
        - GPU:               None
        - available:         False
        - version:           None
* Lightning:
        - lightning:         1.8.5.post0
        - lightning-cloud:   0.5.13
        - lightning-utilities: 0.4.2
        - torch:             1.13.1
        - torchmetrics:      0.11.0
* Packages:
        - Same as before
* System:
        - OS:                Darwin
        - architecture:
                - 64bit
                - 
        - processor:         arm <-- this one is different 
        - python:            3.10.8
        - version:           Darwin Kernel Version 22.1.0: Sun Oct  9 20:14:30 PDT 2022; root:xnu-8792.41.9~2/RELEASE_ARM64_T8103

Also, the result of conda info

     active environment : larm
    active env location : /Users/pouya/miniconda3/envs/larm
            shell level : 3
       user config file : /Users/pouya/.condarc
 populated config files : 
          conda version : 22.11.1
    conda-build version : not installed
         python version : 3.9.12.final.0
       virtual packages : __archspec=1=arm64
                          __osx=13.0=0
                          __unix=0=0
       base environment : /Users/pouya/miniconda3  (writable)
      conda av data dir : /Users/pouya/miniconda3/etc/conda
  conda av metadata url : None
           channel URLs : https://repo.anaconda.com/pkgs/main/osx-arm64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/osx-arm64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /Users/pouya/miniconda3/pkgs
                          /Users/pouya/.conda/pkgs
       envs directories : /Users/pouya/miniconda3/envs
                          /Users/pouya/.conda/envs
               platform : osx-arm64
             user-agent : conda/22.11.1 requests/2.28.1 CPython/3.9.12 Darwin/22.1.0 OSX/13.0
                UID:GID : 501:20
             netrc file : None
           offline mode : False

Checking the values you asked, when I call the trainer from a .py file, they are all true but when I call it from inside a cell in VSC, it's _TORCH_GREATER_EQUAL_1_12 == True torch.backends.mps.is_available() == True platform.processor() in ("arm", "arm64") ==True platform.processor() ==arm

And from a Jupyter NB cell: _TORCH_GREATER_EQUAL_1_12 == True‍‍‍‍‍ torch.backends.mps.is_available() == True platform.processor() in ("arm", "arm64") == False platform.processor() == i386

it seems to me that the issue lies with the Jupyter notebook inside the VS Code, as it is only occurring when I use it and it appears to be unable to correctly detect the platform.processor(). This suggests that there may be a problem on their end.

awaelchli commented 1 year ago

Yes exactly. If Python returns platform.processor() == i386, then that means you are in an environment where Rosetta is emulating Python (pretending to be on an Intel processor and translating instructions to ARM).

As long as you stay inside that conda environment (it correctly returns platform : osx-arm64), you should be fine. Your VSCode plugin for the Jupyter notebook must be using a different environment. It can probably be selected somewhere, I'm not familiar with it.

Pouyaexe commented 1 year ago

Thanks. I'll try to re-install the VS code!

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

ringohoffman commented 11 months ago

Related: https://github.com/microsoft/vscode-python/issues/22614?