BlinkDL / RWKV-LM

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
Apache License 2.0
11.99k stars 825 forks source link

demo-training-prepare libcudart woes #217

Closed micsthepick closed 5 months ago

micsthepick commented 5 months ago

getting below error in a conda environment with python 3.10.13

mike@pop-os:~/source/repos$ conda create -n rwkv python=3.10
mike@pop-os:~/source/repos$ conda activate rwkv
(rwkv) mike@pop-os:~/source/repos$ pip install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
The file is already fully retrieved; nothing to do.

Traceback (most recent call last):
  File "/home/mike/source/repos/RWKV-LM/RWKV-v5/train.py", line 10, in <module>
    from pytorch_lightning import Trainer
  File "/home/mike/miniconda3/envs/rwkv/lib/python3.10/site-packages/pytorch_lightning/__init__.py", line 35, in <module>
    from pytorch_lightning.callbacks import Callback  # noqa: E402
  File "/home/mike/miniconda3/envs/rwkv/lib/python3.10/site-packages/pytorch_lightning/callbacks/__init__.py", line 14, in <module>
    from pytorch_lightning.callbacks.batch_size_finder import BatchSizeFinder
  File "/home/mike/miniconda3/envs/rwkv/lib/python3.10/site-packages/pytorch_lightning/callbacks/batch_size_finder.py", line 24, in <module>
    from pytorch_lightning.callbacks.callback import Callback
  File "/home/mike/miniconda3/envs/rwkv/lib/python3.10/site-packages/pytorch_lightning/callbacks/callback.py", line 25, in <module>
    from pytorch_lightning.utilities.types import STEP_OUTPUT
  File "/home/mike/miniconda3/envs/rwkv/lib/python3.10/site-packages/pytorch_lightning/utilities/__init__.py", line 23, in <module>
    from pytorch_lightning.utilities.imports import (  # noqa: F401
  File "/home/mike/miniconda3/envs/rwkv/lib/python3.10/site-packages/pytorch_lightning/utilities/imports.py", line 28, in <module>
    _TORCHMETRICS_GREATER_EQUAL_0_11 = compare_version("torchmetrics", operator.ge, "0.11.0")  # using new API with task
  File "/home/mike/miniconda3/envs/rwkv/lib/python3.10/site-packages/lightning_utilities/core/imports.py", line 77, in compare_version
    pkg = importlib.import_module(package)
  File "/home/mike/miniconda3/envs/rwkv/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/home/mike/miniconda3/envs/rwkv/lib/python3.10/site-packages/torchmetrics/__init__.py", line 22, in <module>
    from torchmetrics import functional  # noqa: E402
  File "/home/mike/miniconda3/envs/rwkv/lib/python3.10/site-packages/torchmetrics/functional/__init__.py", line 14, in <module>
    from torchmetrics.functional.audio._deprecated import _permutation_invariant_training as permutation_invariant_training
  File "/home/mike/miniconda3/envs/rwkv/lib/python3.10/site-packages/torchmetrics/functional/audio/__init__.py", line 14, in <module>
    from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate
  File "/home/mike/miniconda3/envs/rwkv/lib/python3.10/site-packages/torchmetrics/functional/audio/pit.py", line 22, in <module>
    from torchmetrics.utilities import rank_zero_warn
  File "/home/mike/miniconda3/envs/rwkv/lib/python3.10/site-packages/torchmetrics/utilities/__init__.py", line 14, in <module>
    from torchmetrics.utilities.checks import check_forward_full_state_property
  File "/home/mike/miniconda3/envs/rwkv/lib/python3.10/site-packages/torchmetrics/utilities/checks.py", line 25, in <module>
    from torchmetrics.metric import Metric
  File "/home/mike/miniconda3/envs/rwkv/lib/python3.10/site-packages/torchmetrics/metric.py", line 30, in <module>
    from torchmetrics.utilities.data import (
  File "/home/mike/miniconda3/envs/rwkv/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 22, in <module>
    from torchmetrics.utilities.imports import _TORCH_GREATER_EQUAL_1_12, _TORCH_GREATER_EQUAL_1_13, _XLA_AVAILABLE
  File "/home/mike/miniconda3/envs/rwkv/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 51, in <module>
    _TORCHAUDIO_GREATER_EQUAL_0_10: Optional[bool] = compare_version("torchaudio", operator.ge, "0.10.0")
  File "/home/mike/miniconda3/envs/rwkv/lib/python3.10/site-packages/lightning_utilities/core/imports.py", line 77, in compare_version
    pkg = importlib.import_module(package)
  File "/home/mike/miniconda3/envs/rwkv/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/home/mike/.local/lib/python3.10/site-packages/torchaudio/__init__.py", line 1, in <module>
    from . import (  # noqa: F401
  File "/home/mike/.local/lib/python3.10/site-packages/torchaudio/_extension/__init__.py", line 45, in <module>
    _load_lib("libtorchaudio")
  File "/home/mike/.local/lib/python3.10/site-packages/torchaudio/_extension/utils.py", line 64, in _load_lib
    torch.ops.load_library(path)
  File "/home/mike/miniconda3/envs/rwkv/lib/python3.10/site-packages/torch/_ops.py", line 573, in load_library
    ctypes.CDLL(path)
  File "/home/mike/miniconda3/envs/rwkv/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcudart.so.12: cannot open shared object file: No such file or directory
BlinkDL commented 5 months ago

reinstall CUDA

micsthepick commented 5 months ago

@BlinkDL just fyi, latest NVCC in Ubuntu/POP!_OS with apt is out of date, so just "[reinstalling] CUDA" isn't going to work here, but probably using a CUDA docker, or updating over the system install.