[Bug] Exception: cublasLt ran into an error! during fine-tuning LLM in 8bit mode

NanoCode012 commented 1 year ago

Problem

Hello, I'm getting this weird cublasLt error on a lambdalabs H100 with cuda 118, pytorch 2.0.1, python3.10 Miniconda while trying to fine-tune a 3B param open-llama using LORA with 8bit loading. This only happens if we turn on 8bit loading. Lora alone or 4bit loading (qlora) works.

The same commands did work 2 weeks ago and stopped working a week ago.

I've tried bitsandbytes version 0.39.0 and 0.39.1 as prior versions don't work with H100. Source gives me a different issue as mentioned in Env section.

Expected

No error

Reproduce

Setup Miniconda then follow https://github.com/OpenAccess-AI-Collective/axolotl 's readme on lambdalabs and run the default open llama lora config.

Trace

0.39.0

File "/home/ubuntu/axolotl/scripts/finetune.py", line 352, in <module>
    fire.Fire(train)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)   
  File "/home/ubuntu/axolotl/scripts/finetune.py", line 337, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/trainer.py", line 1531, in train
    return inner_training_loop(
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/trainer.py", line 1795, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/trainer.py", line 2640, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/trainer.py", line 2665, in compute_loss
    outputs = model(**inputs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs) 
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/accelerate/utils/operations.py", line 553, in forward
    return model_forward(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/accelerate/utils/operations.py", line 541, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/peft/peft_model.py", line 827, in forward
    return self.base_model(
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs) 
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 691, in forward
    outputs = self.model(
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs) 
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 579, in forward
    layer_outputs = decoder_layer(
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs) 
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 293, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs) 
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 195, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs) 
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/peft/tuners/lora.py", line 942, in forward
    result = super().forward(x)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 402, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 562, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 400, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1781, in igemmlt
    raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!

Env

python -m bitsandbytes

on main branch: I get error same as here https://github.com/TimDettmers/bitsandbytes/issues/382
on 0.39.0

bin /home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12
.0'] files: {PosixPath('/home/ubuntu/miniconda3/envs/py310/lib/libcudart.so.11.0'), PosixPath('/home/ubuntu/miniconda3/envs/py310/lib/libcudart.so')}.. We'll flip a coin and try one of
 these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0
'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /home/ubuntu/miniconda3/envs/py310/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 9.0
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ BUG REPORT INFORMATION ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++ ANACONDA CUDA PATHS ++++++++++++++++++++
/home/ubuntu/miniconda3/envs/py310/lib/libcudart.so
/home/ubuntu/miniconda3/envs/py310/lib/stubs/libcuda.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/lib/libtorch_cuda_linalg.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/lib/libc10_cuda.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda110.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda114.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda120.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda111_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda120_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda114_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda112.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda111.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda110_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda115_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda115.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so
/home/ubuntu/miniconda3/envs/py310/nsight-compute/2023.1.1/target/linux-desktop-glibc_2_19_0-ppc64le/libcuda-injection.so
/home/ubuntu/miniconda3/envs/py310/nsight-compute/2023.1.1/target/linux-desktop-glibc_2_11_3-x64/libcuda-injection.so
/home/ubuntu/miniconda3/envs/py310/nsight-compute/2023.1.1/target/linux-desktop-t210-a64/libcuda-injection.so

++++++++++++++++++ /usr/local CUDA PATHS +++++++++++++++++++

+++++++++++++++ WORKING DIRECTORY CUDA PATHS +++++++++++++++

++++++++++++++++++ LD_LIBRARY CUDA PATHS +++++++++++++++++++
+++++++++++ /usr/lib/x86_64-linux-gnu CUDA PATHS +++++++++++
/usr/lib/x86_64-linux-gnu/libcudart.so
/usr/lib/x86_64-linux-gnu/stubs/libcuda.so
/usr/lib/x86_64-linux-gnu/libcuda.so

++++++++++++++++++++++++++ OTHER +++++++++++++++++++++++++++
COMPILED_WITH_CUDA = True
COMPUTE_CAPABILITIES_PER_GPU = ['9.0']
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++ DEBUG INFO END ++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Running a quick check that:
    + library is importable
    + CUDA function is callable

WARNING: Please be sure to sanitize sensible info from any such env vars!

SUCCESS!
Installation was successful!

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

Misc

All related issues:

lambdalabs h100 https://github.com/tloen/alpaca-lora/issues/174
older gpu: https://github.com/oobabooga/text-generation-webui/issues/379
says due to card issue: https://github.com/Facico/Chinese-Vicuna/issues/3
3070: https://github.com/Facico/Chinese-Vicuna/issues/41
3060: https://github.com/oobabooga/text-generation-webui/issues/569
no resolution: https://github.com/Lightning-AI/lit-llama/issues/315
low vram https://github.com/deep-diver/LLM-As-Chatbot/issues/16
no resolution: https://github.com/huggingface/transformers/issues/21371

Also tried install cudatoolkit via conda.

jvhoffbauer commented 1 year ago

I have the same issue - it occurs when running an 8bit model in the following docker container

FROM nvidia/cuda:11.7.0-cudnn8-devel-ubuntu22.04

RUN apt update
RUN apt install git -y 
RUN apt install wget -y 
RUN apt install python3 python3-pip -y

# Install dependencies (one-by-one for better caching)
#RUN pip install --upgrade pip
RUN pip install torch
RUN pip install transformers
RUN pip install datasets
RUN pip install evaluate
RUN pip install xformers
RUN pip install wandb
RUN pip install peft 
RUN pip install trl 
RUN pip install scipy 
RUN pip install accelerate 
RUN pip install scikit-learn
RUN pip install pandas 
RUN pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt

RUN git clone https://github.com/EleutherAI/lm-evaluation-harness
RUN pip install -e lm-evaluation-harness

RUN git clone https://github.com/timdettmers/bitsandbytes.git
# CUDA_VERSIONS in {110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 120}
# make argument in {cuda110, cuda11x, cuda12x}
# if you do not know what CUDA you have, try looking at the output of: python -m bitsandbytes
ENV CUDA_VERSION=117
RUN cd bitsandbytes && git checkout ac5550a0238286377ee3f58a85aeba1c40493e17
RUN cd bitsandbytes && make cuda11x
RUN cd bitsandbytes && python3 setup.py install
#RUN pip install bitsandbytes
#RUN python3 check_bnb_install.py

# Init wandb
#COPY ./wandb /wandb
ENV WANDB_CONFIG_DIR=/wandb

ENV HF_DATASETS_CACHE="/hf_cache/datasets"
ENV HUGGINGFACE_HUB_CACHE="/hf_cache/hub"

# Copy the code
COPY . /code

# Set the working directory
WORKDIR /code

# Install a useful helper to check bitsandbytes installation. Only works at runtime.
RUN wget https://gist.githubusercontent.com/TimDettmers/1f5188c6ee6ed69d211b7fe4e381e713/raw/4d17c3d09ccdb57e9ab7eca0171f2ace6e4d2858/check_bnb_install.py

sumukshashidhar commented 1 year ago

+1ing this. I notice it with local conda on H100 lambdalabs. Although I'm unsure whether this is a bitsandbytes error or something to do with CUDA for the H100s.

pribadihcr commented 1 year ago

+1

TimDettmers commented 1 year ago

This is the same error as #533. The problem was that I forgot to compile CUDA 11.8 for sm_90, which are H100 GPUs. The error message basically says that the code is not compiled for your GPU. I will fix this soon. Please continue the discussion in issue #533 until I have fixed this issue.

Ar770 commented 1 year ago

Trying to run today on a H100 instance, confirmed installation of 0.40.1 that I saw that was supposed to work now with this GPU, I still get:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-23-3435b262f1ae> in <module>
----> 1 trainer.train()

~/.local/lib/python3.8/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1643             self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
   1644         )
-> 1645         return inner_training_loop(
   1646             args=args,
   1647             resume_from_checkpoint=resume_from_checkpoint,

~/.local/lib/python3.8/site-packages/transformers/trainer.py in _inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1936 
   1937                 with self.accelerator.accumulate(model):
-> 1938                     tr_loss_step = self.training_step(model, inputs)
   1939 
   1940                 if (

~/.local/lib/python3.8/site-packages/transformers/trainer.py in training_step(self, model, inputs)
   2757 
   2758         with self.compute_loss_context_manager():
-> 2759             loss = self.compute_loss(model, inputs)
   2760 
   2761         if self.args.n_gpu > 1:

~/.local/lib/python3.8/site-packages/transformers/trainer.py in compute_loss(self, model, inputs, return_outputs)
   2782         else:
   2783             labels = None
-> 2784         outputs = model(**inputs)
   2785         # Save past state if it exists
   2786         # TODO: this needs to be fixed and made cleaner later.

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

~/.local/lib/python3.8/site-packages/accelerate/utils/operations.py in forward(*args, **kwargs)
    579 
    580     def forward(*args, **kwargs):
--> 581         return model_forward(*args, **kwargs)
    582 
    583     # To act like a decorator so that it can be popped when doing `extract_model_from_parallel`

~/.local/lib/python3.8/site-packages/accelerate/utils/operations.py in __call__(self, *args, **kwargs)
    567 
    568     def __call__(self, *args, **kwargs):
--> 569         return convert_to_fp32(self.model_forward(*args, **kwargs))
    570 
    571     def __getstate__(self):

/usr/lib/python3/dist-packages/torch/amp/autocast_mode.py in decorate_autocast(*args, **kwargs)
     12     def decorate_autocast(*args, **kwargs):
     13         with autocast_instance:
---> 14             return func(*args, **kwargs)
     15     decorate_autocast.__script_unsupported = '@autocast() decorator is not supported in script mode'  # type: ignore[attr-defined]
     16     return decorate_autocast

~/.local/lib/python3.8/site-packages/peft/peft_model.py in forward(self, *args, **kwargs)
    413         Forward pass of the model.
    414         """
--> 415         return self.get_base_model()(*args, **kwargs)
    416 
    417     def _get_base_model_class(self, is_prompt_tuning=False):

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

~/.local/lib/python3.8/site-packages/accelerate/hooks.py in new_forward(*args, **kwargs)
    163                 output = old_forward(*args, **kwargs)
    164         else:
--> 165             output = old_forward(*args, **kwargs)
    166         return module._hf_hook.post_forward(module, output)
    167 

~/.local/lib/python3.8/site-packages/transformers/models/whisper/modeling_whisper.py in forward(self, input_features, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1417                 )
   1418 
-> 1419         outputs = self.model(
   1420             input_features,
   1421             attention_mask=attention_mask,

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

~/.local/lib/python3.8/site-packages/accelerate/hooks.py in new_forward(*args, **kwargs)
    163                 output = old_forward(*args, **kwargs)
    164         else:
--> 165             output = old_forward(*args, **kwargs)
    166         return module._hf_hook.post_forward(module, output)
    167 

~/.local/lib/python3.8/site-packages/transformers/models/whisper/modeling_whisper.py in forward(self, input_features, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, decoder_inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
   1266             input_features = self._mask_input_features(input_features, attention_mask=attention_mask)
   1267 
-> 1268             encoder_outputs = self.encoder(
   1269                 input_features,
   1270                 head_mask=head_mask,

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

~/.local/lib/python3.8/site-packages/accelerate/hooks.py in new_forward(*args, **kwargs)
    163                 output = old_forward(*args, **kwargs)
    164         else:
--> 165             output = old_forward(*args, **kwargs)
    166         return module._hf_hook.post_forward(module, output)
    167 

~/.local/lib/python3.8/site-packages/transformers/models/whisper/modeling_whisper.py in forward(self, input_features, attention_mask, head_mask, output_attentions, output_hidden_states, return_dict)
    854                         return custom_forward
    855 
--> 856                     layer_outputs = torch.utils.checkpoint.checkpoint(
    857                         create_custom_forward(encoder_layer),
    858                         hidden_states,

/usr/lib/python3/dist-packages/torch/utils/checkpoint.py in checkpoint(function, use_reentrant, *args, **kwargs)
    247 
    248     if use_reentrant:
--> 249         return CheckpointFunction.apply(function, preserve, *args)
    250     else:
    251         return _checkpoint_without_reentrant(

/usr/lib/python3/dist-packages/torch/autograd/function.py in apply(cls, *args, **kwargs)
    504             # See NOTE: [functorch vjp and autograd interaction]
    505             args = _functorch.utils.unwrap_dead_wrappers(args)
--> 506             return super().apply(*args, **kwargs)  # type: ignore[misc]
    507 
    508         if cls.setup_context == _SingleLevelFunction.setup_context:

/usr/lib/python3/dist-packages/torch/utils/checkpoint.py in forward(ctx, run_function, preserve_rng_state, *args)
    105 
    106         with torch.no_grad():
--> 107             outputs = run_function(*args)
    108         return outputs
    109 

~/.local/lib/python3.8/site-packages/transformers/models/whisper/modeling_whisper.py in custom_forward(*inputs)
    850                     def create_custom_forward(module):
    851                         def custom_forward(*inputs):
--> 852                             return module(*inputs, output_attentions)
    853 
    854                         return custom_forward

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

~/.local/lib/python3.8/site-packages/accelerate/hooks.py in new_forward(*args, **kwargs)
    163                 output = old_forward(*args, **kwargs)
    164         else:
--> 165             output = old_forward(*args, **kwargs)
    166         return module._hf_hook.post_forward(module, output)
    167 

~/.local/lib/python3.8/site-packages/transformers/models/whisper/modeling_whisper.py in forward(self, hidden_states, attention_mask, layer_head_mask, output_attentions)
    429         residual = hidden_states
    430         hidden_states = self.self_attn_layer_norm(hidden_states)
--> 431         hidden_states, attn_weights, _ = self.self_attn(
    432             hidden_states=hidden_states,
    433             attention_mask=attention_mask,

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

~/.local/lib/python3.8/site-packages/accelerate/hooks.py in new_forward(*args, **kwargs)
    163                 output = old_forward(*args, **kwargs)
    164         else:
--> 165             output = old_forward(*args, **kwargs)
    166         return module._hf_hook.post_forward(module, output)
    167 

~/.local/lib/python3.8/site-packages/transformers/models/whisper/modeling_whisper.py in forward(self, hidden_states, key_value_states, past_key_value, attention_mask, layer_head_mask, output_attentions)
    288 
    289         # get query proj
--> 290         query_states = self.q_proj(hidden_states) * self.scaling
    291         # get key, value proj
    292         # `past_key_value[0].shape[2] == key_value_states.shape[1]`

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

~/.local/lib/python3.8/site-packages/peft/tuners/lora.py in forward(self, x)
   1052 
   1053         def forward(self, x: torch.Tensor):
-> 1054             result = super().forward(x)
   1055 
   1056             if self.disable_adapters or self.active_adapter not in self.lora_A.keys():

~/.local/lib/python3.8/site-packages/bitsandbytes/nn/modules.py in forward(self, x)
    412             self.bias.data = self.bias.data.to(x.dtype)
    413 
--> 414         out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
    415 
    416         if not self.state.has_fp16_weights:

~/.local/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py in matmul(A, B, out, state, threshold, bias)
    561     if threshold > 0.0:
    562         state.threshold = threshold
--> 563     return MatMul8bitLt.apply(A, B, out, bias, state)
    564 
    565 

/usr/lib/python3/dist-packages/torch/autograd/function.py in apply(cls, *args, **kwargs)
    504             # See NOTE: [functorch vjp and autograd interaction]
    505             args = _functorch.utils.unwrap_dead_wrappers(args)
--> 506             return super().apply(*args, **kwargs)  # type: ignore[misc]
    507 
    508         if cls.setup_context == _SingleLevelFunction.setup_context:

~/.local/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py in forward(ctx, A, B, out, bias, state)
    399         if using_igemmlt:
    400             C32A, SA = F.transform(CA, "col32")
--> 401             out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
    402             if bias is None or bias.dtype == torch.float16:
    403                 # we apply the fused bias here

~/.local/lib/python3.8/site-packages/bitsandbytes/functional.py in igemmlt(A, B, SA, SB, out, Sout, dtype)
   1790     if has_error == 1:
   1791         print(f'A: {shapeA}, B: {shapeB}, C: {Sout[0]}; (lda, ldb, ldc): {(lda, ldb, ldc)}; (m, n, k): {(m, n, k)}')
-> 1792         raise Exception('cublasLt ran into an error!')
   1793 
   1794     torch.cuda.set_device(prev_device)

Exception: cublasLt ran into an error!

So frustrating...
Please help, Thank you for the great work!

piperino11 commented 1 year ago

Same error for me

basteran commented 1 year ago

Hello,

any news? Same error here, I cannot find anything useful in order to use the 8 bit quantization on the H100 GPUs.

shashank140195 commented 1 year ago

This is the same error as #533. The problem was that I forgot to compile CUDA 11.8 for sm_90, which are H100 GPUs. The error message basically says that the code is not compiled for your GPU. I will fix this soon. Please continue the discussion in issue #533 until I have fixed this issue.

Hi @TimDettmers Do we have the fix yet?

shashank140195 commented 1 year ago

Hello,

any news? Same error here, I cannot find anything useful in order to use the 8 bit quantization on the H100 GPUs.

@basteran Did you find the fix? @TimDettmers Any updates?

mikecipolla commented 1 year ago

are there any updates here? am I missing something or did they just "forget" to support H100 GPUs and even months later this hasn't been fixed? has anyone found a workaround? @TimDettmers ?

TimDettmers commented 1 year ago

This is actually a more complicated issue. The 8-bit implementation uses cuBLASLt which uses special format for 8-bit matrix multiplication. There are special formats for Ampere, Turning, and now Hopper GPUs. Hopper GPUs do not support Ampere or Turing formats. This means multiple CUDA kernels and the cuBLASLt integration need to be implemented to make 8-bit work on Hopper GPUs.

I think for now, the more realistic thing is to throw and error to let the user know that this features is currently not supported.

swumagic commented 1 year ago

Bitsandbytes was not supported windows before, but my method can support windows.（yuhuang） 1 open folder J:\StableDiffusion\sdwebui，Click the address bar of the folder and enter CMD or WIN+R, CMD 。enter，cd /d J:\StableDiffusion\sdwebui 2 J:\StableDiffusion\sdwebui\py310\python.exe -m pip uninstall bitsandbytes

3 J:\StableDiffusion\sdwebui\py310\python.exe -m pip uninstall bitsandbytes-windows

4 J:\StableDiffusion\sdwebui\py310\python.exe -m pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.1-py3-none-win_amd64.whl

Replace your SD venv directory file（python.exe Folder） here（J:\StableDiffusion\sdwebui\py310）

swumagic commented 1 year ago

OR you are Linux distribution (Ubuntu, MacOS, etc.)system ,AND CUDA Version: 11.X.

Bitsandbytes can support ubuntu.（yuhuang） 1 open folder J:\StableDiffusion\sdwebui，Click the address bar of the folder and enter CMD or WIN+R, CMD 。enter，cd /d J:\StableDiffusion\sdwebui 2 J:\StableDiffusion\sdwebui\py310\python.exe -m pip uninstall bitsandbytes

3 J:\StableDiffusion\sdwebui\py310\python.exe -m pip uninstall bitsandbytes-windows

4 J:\StableDiffusion\sdwebui\py310\python.exe -m pip install https://github.com/TimDettmers/bitsandbytes/releases/download/0.41.0/bitsandbytes-0.41.0-py3-none-any.whl

Replace your SD venv directory file（python.exe Folder） here（J:\StableDiffusion\sdwebui\py310）

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

PyroGenesis commented 11 months ago

Can we please keep this issue (or #383 or #599 ) open? I still want to see this issue resolved, if possible.

adrian-branescu commented 10 months ago

This is actually a more complicated issue. The 8-bit implementation uses cuBLASLt which uses special format for 8-bit matrix multiplication. There are special formats for Ampere, Turning, and now Hopper GPUs. Hopper GPUs do not support Ampere or Turing formats. This means multiple CUDA kernels and the cuBLASLt integration need to be implemented to make 8-bit work on Hopper GPUs.

I think for now, the more realistic thing is to throw and error to let the user know that this features is currently not supported.

@TimDettmers could you use https://github.com/NVIDIA/TransformerEngine ?

At the first sight the exposed API seems too high-level for your needs, but their building blocks are tailored for Hopper (H100) and Ada (RTX4090) architectures, e.g. https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/common/gemm/cublaslt_gemm.cu

monk1337 commented 9 months ago

+1ing this. I notice it with local conda on H100 lambdalabs. Although I'm unsure whether this is a bitsandbytes error or something to do with CUDA for the H100s.

This error is related to H100, I tried loading the model on H100 and got the error, the same load8bit was tried on A100 and it's working fine.

0-hero commented 8 months ago

Anyone able to resolve this?

hayoung-jeremy commented 8 months ago

Is still not available on H100 GPU instance?

0-hero commented 8 months ago

Not yet unfortunately

ionutmodo commented 6 months ago

do you guys have some solution for this?

ZhouFang-Intel commented 6 months ago

Observing the same issue with H100, too.

FoolPlayer commented 6 months ago

Also with H800.

khayamgondal commented 5 months ago

This is actually a more complicated issue. The 8-bit implementation uses cuBLASLt which uses special format for 8-bit matrix multiplication. There are special formats for Ampere, Turning, and now Hopper GPUs. Hopper GPUs do not support Ampere or Turing formats. This means multiple CUDA kernels and the cuBLASLt integration need to be implemented to make 8-bit work on Hopper GPUs.

I think for now, the more realistic thing is to throw and error to let the user know that this features is currently not supported.

Any plan to fix this?

suzewei commented 5 months ago

The same problem comes for H20

zhuconv commented 3 months ago

The same with H800

matthewdouglas commented 3 months ago

Hi all,

I will keep this issue open, but please be aware that for now that 8bit is not supported in bitsandbytes on Hopper. It is recommended to use nf4 or fp4 instead.

RaccoonOnion commented 3 months ago

Just want to add to this thread. Tried in H100 and not working. really hope bitesandbytes team and support this feature given that more and more ppl is gonna switch to newer version GPUs

NuoJohnChen commented 3 months ago

Same to me. Not work after changing to bf16, fp16, fp4, or else.

surdarla commented 3 months ago

Having same issue with H100E

Boltzmachine commented 2 months ago

Same problem

crinoiddream commented 1 month ago

The same with H800 and H100

zihaohe123 commented 1 month ago

Still having the same issue

suhyeok-jang commented 3 weeks ago

Still having the same issue on H100

sreemanti-abacusai commented 3 weeks ago

Still having same issue on H100

krjoha commented 3 weeks ago

Well, just came here to say I also ran into this issue using 8bit and H100. Would be very useful to have this working!

matthewdouglas commented 2 weeks ago

Hi all! We are currently working on LLM.int8 support for Hopper in PR #1401. I cannot give an accurate ETA for a release at the moment, but it will be supported soon!