Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.35k stars 3.38k forks source link

Why can't I run with official examples? #18076

Closed InfernalAzazel closed 1 year ago

InfernalAzazel commented 1 year ago

Bug description

image

I found that using poetry can lead to anomalies

What version are you seeing the problem on?

v2.0

How to reproduce the bug

Installation dependencies

poetry add torchvision
poetry add lightning

code:

import os, torch, torch.nn as nn, torch.utils.data as data, torchvision as tv
import lightning as L

encoder = nn.Sequential(nn.Linear(28 * 28, 128), nn.ReLU(), nn.Linear(128, 3))
decoder = nn.Sequential(nn.Linear(3, 128), nn.ReLU(), nn.Linear(128, 28 * 28))

class LitAutoEncoder(L.LightningModule):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder, self.decoder = encoder, decoder

    def training_step(self, batch, batch_idx):
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = nn.functional.mse_loss(x_hat, x)
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=1e-3)

dataset = tv.datasets.MNIST(".", download=True, transform=tv.transforms.ToTensor())
trainer = L.Trainer()
trainer.fit(LitAutoEncoder(encoder, decoder), data.DataLoader(dataset, batch_size=64))

Error messages and logs

/home/V01/extittivns03/.cache/pypoetry/virtualenvs/lightning-demo-01-WPyJ1-ve-py3.10/bin/python /work/code/github/lightning_demo_01/main.py 
Traceback (most recent call last):
  File "/home/V01/extittivns03/.cache/pypoetry/virtualenvs/lightning-demo-01-WPyJ1-ve-py3.10/lib/python3.10/site-packages/torch/__init__.py", line 168, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/home/V01/extittivns03/.pyenv/versions/3.10.10/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcurand.so.10: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/work/code/github/lightning_demo_01/main.py", line 1, in <module>
    import os, torch, torch.nn as nn, torch.utils.data as data, torchvision as tv
  File "/home/V01/extittivns03/.cache/pypoetry/virtualenvs/lightning-demo-01-WPyJ1-ve-py3.10/lib/python3.10/site-packages/torch/__init__.py", line 228, in <module>
    _load_global_deps()
  File "/home/V01/extittivns03/.cache/pypoetry/virtualenvs/lightning-demo-01-WPyJ1-ve-py3.10/lib/python3.10/site-packages/torch/__init__.py", line 189, in _load_global_deps
    _preload_cuda_deps(lib_folder, lib_name)
  File "/home/V01/extittivns03/.cache/pypoetry/virtualenvs/lightning-demo-01-WPyJ1-ve-py3.10/lib/python3.10/site-packages/torch/__init__.py", line 154, in _preload_cuda_deps
    raise ValueError(f"{lib_name} not found in the system path {sys.path}")
ValueError: libcublas.so.*[0-9] not found in the system path ['/work/code/github/lightning_demo_01', '/work/code/github/lightning_demo_01', '/home/V01/extittivns03/.local/share/JetBrains/Toolbox/apps/PyCharm-P/ch-0/231.9161.41/plugins/python/helpers/pycharm_display', '/home/V01/extittivns03/.pyenv/versions/3.10.10/lib/python310.zip', '/home/V01/extittivns03/.pyenv/versions/3.10.10/lib/python3.10', '/home/V01/extittivns03/.pyenv/versions/3.10.10/lib/python3.10/lib-dynload', '/home/V01/extittivns03/.cache/pypoetry/virtualenvs/lightning-demo-01-WPyJ1-ve-py3.10/lib/python3.10/site-packages', '/home/V01/extittivns03/.local/share/JetBrains/Toolbox/apps/PyCharm-P/ch-0/231.9161.41/plugins/python/helpers/pycharm_matplotlib_backend']

Process finished with exit code 1

Environment

Current environment ``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): #- PyTorch Lightning Version (e.g., 1.5.0): #- Lightning App Version (e.g., 0.5.2): #- PyTorch Version (e.g., 2.0): #- Python version (e.g., 3.9): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): #- Running environment of LightningApp (e.g. local, cloud): ```
Current environment * CUDA: - GPU: None - available: False - version: 11.7 * Lightning: - lightning: 2.0.5 - lightning-cloud: 0.5.37 - lightning-utilities: 0.9.0 - pytorch-lightning: 2.0.5 - torch: 2.0.1 - torchmetrics: 1.0.0 - torchvision: 0.15.2 * Packages: - aiohttp: 3.8.4 - aiosignal: 1.3.1 - anyio: 3.7.1 - arrow: 1.2.3 - async-timeout: 4.0.2 - attrs: 23.1.0 - backoff: 2.2.1 - beautifulsoup4: 4.12.2 - blessed: 1.20.0 - certifi: 2023.5.7 - charset-normalizer: 3.2.0 - click: 8.1.4 - cmake: 3.26.4 - croniter: 1.4.1 - dateutils: 0.6.12 - deepdiff: 6.3.1 - exceptiongroup: 1.1.2 - fastapi: 0.100.0 - filelock: 3.12.2 - frozenlist: 1.4.0 - fsspec: 2023.6.0 - h11: 0.14.0 - idna: 3.4 - inquirer: 3.1.3 - itsdangerous: 2.1.2 - jinja2: 3.1.2 - lightning: 2.0.5 - lightning-cloud: 0.5.37 - lightning-utilities: 0.9.0 - lit: 16.0.6 - markdown-it-py: 3.0.0 - markupsafe: 2.1.3 - mdurl: 0.1.2 - mpmath: 1.3.0 - multidict: 6.0.4 - networkx: 3.1 - numpy: 1.25.1 - nvidia-cublas-cu11: 11.10.3.66 - nvidia-cuda-cupti-cu11: 11.7.101 - nvidia-cuda-nvrtc-cu11: 11.7.99 - nvidia-cuda-runtime-cu11: 11.7.99 - nvidia-cudnn-cu11: 8.5.0.96 - nvidia-cufft-cu11: 10.9.0.58 - nvidia-curand-cu11: 10.2.10.91 - nvidia-cusolver-cu11: 11.4.0.1 - nvidia-cusparse-cu11: 11.7.4.91 - nvidia-nccl-cu11: 2.14.3 - nvidia-nvtx-cu11: 11.7.91 - ordered-set: 4.1.0 - packaging: 23.1 - pillow: 10.0.0 - pip: 22.3.1 - psutil: 5.9.5 - pydantic: 1.10.11 - pygments: 2.15.1 - pyjwt: 2.7.0 - python-dateutil: 2.8.2 - python-editor: 1.0.4 - python-multipart: 0.0.6 - pytorch-lightning: 2.0.5 - pytz: 2023.3 - pyyaml: 6.0 - readchar: 4.0.5 - requests: 2.31.0 - rich: 13.4.2 - setuptools: 65.5.0 - six: 1.16.0 - sniffio: 1.3.0 - soupsieve: 2.4.1 - starlette: 0.27.0 - starsessions: 1.3.0 - sympy: 1.12 - torch: 2.0.1 - torchmetrics: 1.0.0 - torchvision: 0.15.2 - tqdm: 4.65.0 - traitlets: 5.9.0 - triton: 2.0.0 - typing-extensions: 4.7.1 - unicorn: 2.0.1.post1 - unicornafl: 2.0.2 - urllib3: 2.0.3 - uvicorn: 0.22.0 - wcwidth: 0.2.6 - websocket-client: 1.6.1 - websockets: 11.0.3 - wheel: 0.40.0 - yarl: 1.9.2 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.10.10 - release: 5.15.0-72-generic - version: #79~20.04.1-Ubuntu SMP Thu Apr 20 22:12:07 UTC 2023

More info

I understand. Later on, I found that using Poetry would cause anomalies

awaelchli commented 1 year ago

@InfernalAzazel This looks like a problem with the torch installation right? If you look at the error, it fails here:

File "/home/V01/extittivns03/.cache/pypoetry/virtualenvs/lightning-demo-01-WPyJ1-ve-py3.10/lib/python3.10/site-packages/torch/init.py", line 228, in
_load_global_deps()
File "/home/V01/extittivns03/.cache/pypoetry/virtualenvs/lightning-demo-01-WPyJ1-ve-py3.10/lib/python3.10/site-packages/torch/init.py", line 189, in _load_global_deps

I recommend that you make a fresh environment and only install torch. You should then see the same error. Can you confirm this?

InfernalAzazel commented 1 year ago

I am using Mac max m2 normally

image
awaelchli commented 1 year ago

Does that mean you were able to fix the issue and run the example?

InfernalAzazel commented 1 year ago

Does that mean you were able to fix the issue and run the example?

Ubuntu philosophy 20.04 lts is still no good