bitsandbytes-foundation / bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.
https://huggingface.co/docs/bitsandbytes/main/en/index
MIT License
6.28k stars 630 forks source link

Getting errors when attempting to run on remote cluster #530

Closed Thresher12 closed 10 months ago

Thresher12 commented 1 year ago

I'm attempting to run a training job from a program that uses bitsandbytes on a remote SLURM computer cluster. From the errors it looks like its some issue with CUDA maybe. I've tried to look up solutions but I couldn't find any straightforward solution.

Heres the error when attempting to start training

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
ERROR: python: undefined symbol: cudaRuntimeGetVersion
CUDA SETUP: libcudart.so path is None
CUDA SETUP: Is seems that your cuda installation is not in your path. See https://github.com/TimDettmers/bitsandbytes/issues/85 for more information.
CUDA SETUP: CUDA version lower than 11 are currently not supported for LLM.int8(). You will be only to use 8-bit optimizers and quantization routines!!
CUDA SETUP: Highest compute capability among GPUs detected: 3.7
CUDA SETUP: Detected CUDA version 00
CUDA SETUP: Loading binary /bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
Disabled distributed training.
Loading from ./models/tortoise/dvae.pth
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:139: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
Traceback (most recent call last):
  File "/bigdata/mlgroup/userA1/avc/./modules/dlas/dlas/train.py", line 485, in <module>
    trainer.do_training()
  File "/bigdata/mlgroup/userA1/avc/./modules/dlas/dlas/train.py", line 408, in do_training
    metric = self.do_step(train_data)
  File "/bigdata/mlgroup/userA1/avc/./modules/dlas/dlas/train.py", line 271, in do_step
    gradient_norms_dict = self.model.optimize_parameters(
  File "/bigdata/mlgroup/userA1/avc/modules/dlas/dlas/trainer/ExtensibleTrainer.py", line 396, in optimize_parameters
    self.consume_gradients(state, step, it)
  File "/bigdata/mlgroup/userA1/avc/modules/dlas/dlas/trainer/ExtensibleTrainer.py", line 445, in consume_gradients
    step.do_step(it)
  File "/bigdata/mlgroup/userA1/avc/modules/dlas/dlas/trainer/steps.py", line 398, in do_step
    self.scaler.step(opt)
  File "/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 315, in step
    return optimizer.step(*args, **kwargs)
  File "/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/optim/optimizer.py", line 269, in step
    self.update_step(group, p, gindex, pindex)
  File "/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/optim/optimizer.py", line 517, in update_step
    F.optimizer_update_8bit_blockwise(
  File "/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/functional.py", line 1213, in optimizer_update_8bit_blockwise
    optim_func = str2optimizer8bit_blockwise[optimizer_name][0]
NameError: name 'str2optimizer8bit_blockwise' is not defined

Here is the report from -m bitsandbytes


(venv) userA1@sparrow:~/bigdata/avc$ python -m bitsandbytes

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
CUDA SETUP: Loading binary /bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ BUG REPORT INFORMATION ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++ /usr/local CUDA PATHS +++++++++++++++++++

+++++++++++++++ WORKING DIRECTORY CUDA PATHS +++++++++++++++
/bigdata/mlgroup/userA1/avc/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda_linalg.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda115.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda111.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda120_nocublaslt.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda111_nocublaslt.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda112.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121_nocublaslt.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda120.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118_nocublaslt.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116_nocublaslt.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda115_nocublaslt.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/cuda/cuda.cpython-39-x86_64-linux-gnu.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/cuda/ccuda.cpython-39-x86_64-linux-gnu.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/cuda/tests/test_ccuda.cpython-39-x86_64-linux-gnu.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/cuda/tests/test_ccudart.cpython-39-x86_64-linux-gnu.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/cuda/ccudart.cpython-39-x86_64-linux-gnu.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/cuda/cudart.cpython-39-x86_64-linux-gnu.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/cuda/_lib/ccudart/ccudart.cpython-39-x86_64-linux-gnu.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/cuda/_cuda/ccuda.cpython-39-x86_64-linux-gnu.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/torch/lib/libc10_cuda.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/torch/lib/libtorch_cuda_linalg.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda115.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda111.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda120_nocublaslt.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda111_nocublaslt.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda121.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda113.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda110_nocublaslt.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda114_nocublaslt.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda112.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda114.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda121_nocublaslt.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda120.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda116.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118_nocublaslt.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda110.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda116_nocublaslt.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so
/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda115_nocublaslt.so

++++++++++++++++++ LD_LIBRARY CUDA PATHS +++++++++++++++++++
 /opt/linux/rocky/8.x/x86_64/pkgs/slurm/23.02.2/lib CUDA PATHS 

 /opt/linux/rocky/8.x/x86_64/pkgs/openmpi/4.1.2_slurm-23.02.2_mpi1-compat/lib/openmpi CUDA PATHS 

 /opt/linux/rocky/8.x/x86_64/pkgs/openmpi/4.1.2_slurm-23.02.2_mpi1-compat/lib CUDA PATHS 

 /opt/linux/rocky/8.x/x86_64/pkgs/java/17.0.2/lib/server/ CUDA PATHS 

 /opt/linux/rocky/8.x/x86_64/pkgs/java/17.0.2/lib CUDA PATHS 

 /opt/linux/rocky/8.x/x86_64/pkgs/R/4.2.2/lib64/R/lib CUDA PATHS 

++++++++++++++++++++++++++ OTHER +++++++++++++++++++++++++++
COMPILED_WITH_CUDA = False
Traceback (most recent call last):
  File "/opt/linux/rocky/8.x/x86_64/pkgs/miniconda3/py39_4.12.0/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/linux/rocky/8.x/x86_64/pkgs/miniconda3/py39_4.12.0/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/__main__.py", line 106, in <module>
    print(f"COMPUTE_CAPABILITIES_PER_GPU = {get_compute_capabilities(cuda)}")
  File "/bigdata/mlgroup/userA1/avc/venv/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py", line 349, in get_compute_capabilities
    check_cuda_result(cuda, cuda.cuDeviceGetCount(ct.byref(nGpus)))
AttributeError: 'NoneType' object has no attribute 'cuDeviceGetCount'
(venv) userA1@sparrow:~/bigdata/avc$ 

And here is my job submission header in case its relevant.


#!/bin/bash
#SBATCH -p gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1 # gpus per node?
#SBATCH --time 10:00:00
#SBATCH --mem=50GB
#SBATCH --job-name RemoteJob
#SBATCH --mail-user=userA1@mail.edu
#SBATCH --mail-type=ALL
#SBATCH --output=my.stdout
TimDettmers commented 1 year ago

This is a problem with the cuda driver not being found. There is a work-around that I will implement soon that should fix this issue. Currently, the best way is to find libcuda.so on your system and make it visible to the slurm jobs.

icyblade commented 1 year ago

It can also happen when sharing the same Docker image between GPU and CPU models.

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.