Model not able to quantize

System Info

Accelerate version: 0.34.2
Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
accelerate bash location: ~/miniconda3/envs/trl/bin/accelerate
Python version: 3.10.14
Numpy version: 2.1.1
PyTorch version (GPU?): 2.4.1+cu121 (False)
PyTorch XPU available: False
PyTorch NPU available: False
PyTorch MLU available: False
PyTorch MUSA available: False
System RAM: 377.69 GB
Accelerate default config: Not found

Reproduction

# !/bin/bash

conda activate trl
cd trl-test/
pip install huggingface-hub accelerate
huggingface-cli login --token hf_xxx
git clone https://github.com/huggingface/trl.git

#### Multi GPU
yes "y" | ACCELERATE_LOG_LEVEL=info accelerate launch  \
  --config_file ./accelerate_configs/multi_gpu.yaml \
  trl/examples/scripts/sft.py \
  --model_name_or_path inceptionai/Jais-family-256m \
  --trust_remote_code \
  --dataset_name AbderrahmanSkiredj1/ArQuAD_train14k_test_1k6 \
  --dataset_text_field context \
  --output_dir ./jais256m-sft-ArQuAD \
  --load_in_8bit true \
  --use_peft true \
  --lora_r 16 \
  --lora_alpha 32 \
  --lora_dropout 0.05 \
  --lora_target_modules "all-linear" \
  --lora_task_type 'CAUSAL_LM' \
  --learning_rate 3e-5 \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 4

Gives the following error :

  warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
cuBLAS API failed with status 15
error detected/nfs_users/users/ali.filali/miniconda3/envs/trl/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:316: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
  warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
A: torch.Size([263, 1088]), B: torch.Size([3264, 1088]), C: (263, 3264); (lda, ldb, ldc): (c_int(8416), c_int(104448), c_int(8416)); (m, n, k): (c_int(263), c_int(3264), c_int(1088))
cuBLAS API failed with status 15
error detectedcuBLAS API failed with status 15
error detectedA: torch.Size([195, 1088]), B: torch.Size([3264, 1088]), C: (195, 3264); (lda, ldb, ldc): (c_int(6240), c_int(104448), c_int(6240)); (m, n, k): (c_int(195), c_int(3264), c_int(1088))
A: torch.Size([216, 1088]), B: torch.Size([3264, 1088]), C: (216, 3264); (lda, ldb, ldc): (c_int(6912), c_int(104448), c_int(6912)); (m, n, k): (c_int(216), c_int(3264), c_int(1088))
cuBLAS API failed with status 15
error detectedA: torch.Size([125, 1088]), B: torch.Size([3264, 1088]), C: (125, 3264); (lda, ldb, ldc): (c_int(4000), c_int(104448), c_int(4000)); (m, n, k): (c_int(125), c_int(3264), c_int(1088))
[rank3]: Traceback (most recent call last):

Expected behavior

I expect the training to start and finish in about 5 minutes similar to what happen when i run the following code with no --load_in_8bit true flag :

yes "y" | ACCELERATE_LOG_LEVEL=info accelerate launch  \
  --config_file ./accelerate_configs/multi_gpu.yaml \
  trl/examples/scripts/sft.py \
  --model_name_or_path inceptionai/Jais-family-256m \
  --trust_remote_code \
  --dataset_name AbderrahmanSkiredj1/ArQuAD_train14k_test_1k6 \
  --dataset_text_field context \
  --output_dir ./jais256m-sft-ArQuAD \
  --use_peft true \
  --lora_r 16 \
  --lora_alpha 32 \
  --lora_dropout 0.05 \
  --lora_target_modules "all-linear" \
  --lora_task_type 'CAUSAL_LM' \
  --learning_rate 3e-5 \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 4

bitsandbytes-foundation / bitsandbytes

Model not able to quantize #1354

System Info

Reproduction

Expected behavior