bitsandbytes-foundation / bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.
https://huggingface.co/docs/bitsandbytes/main/en/index
MIT License
6k stars 606 forks source link

8 bit quantization and finetuning with lora is not working - receiving runtime error #1097

Open solomonmanuelraj opened 6 months ago

solomonmanuelraj commented 6 months ago

System Info

Hi Team,

when i am running the above qlora code for owl-vit model (google/owlvit-base-patch32) with below 4 bits bnbconfig , the fine tuning is taking place without any error.

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 )

once i change the config with below information

bnb_config = BitsAndBytesConfig( load_in_8bit=True )

i receive the following error trace.

######################################################################################### RuntimeError Traceback (most recent call last) Cell In[25], line 1 ----> 1 trainer.train()

File ~/miniconda3/envs/testenv/lib/python3.10/site-packages/transformers/trainer.py:1537, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs) 1535 hf_hub_utils.enable_progress_bars() 1536 else: -> 1537 return inner_training_loop( 1538 args=args, 1539 resume_from_checkpoint=resume_from_checkpoint, 1540 trial=trial, 1541 ignore_keys_for_eval=ignore_keys_for_eval, 1542 )

File ~/miniconda3/envs/testenv/lib/python3.10/site-packages/transformers/trainer.py:1854, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval) 1851 self.control = self.callback_handler.on_step_begin(args, self.state, self.control) 1853 with self.accelerator.accumulate(model): -> 1854 tr_loss_step = self.training_step(model, inputs) 1856 if ( 1857 args.logging_nan_inf_filter 1858 and not is_torch_tpu_available() 1859 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step)) 1860 ): 1861 # if loss is nan or inf simply add the average of previous logged losses 1862 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File ~/miniconda3/envs/testenv/lib/python3.10/site-packages/transformers/trainer.py:2744, in Trainer.training_step(self, model, inputs) 2742 scaled_loss.backward() 2743 else: -> 2744 self.accelerator.backward(loss) 2746 return loss.detach() / self.args.gradient_accumulation_steps

File ~/miniconda3/envs/testenv/lib/python3.10/site-packages/accelerate/accelerator.py:1907, in Accelerator.backward(self, loss, kwargs) 1905 return 1906 elif self.scaler is not None: -> 1907 self.scaler.scale(loss).backward(kwargs) 1908 else: 1909 loss.backward(**kwargs)

File ~/miniconda3/envs/testenv/lib/python3.10/site-packages/torch/_tensor.py:492, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs) 482 if has_torch_function_unary(self): 483 return handle_torch_function( 484 Tensor.backward, 485 (self,), (...) 490 inputs=inputs, 491 ) --> 492 torch.autograd.backward( 493 self, gradient, retain_graph, create_graph, inputs=inputs 494 )

File ~/miniconda3/envs/testenv/lib/python3.10/site-packages/torch/autograd/init.py:251, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs) 246 retain_graph = create_graph 248 # The reason we repeat the same comment below is that 249 # some Python versions print out the first line of a multi-line function 250 # calls in the traceback and some print out the last line --> 251 Variable.execution_engine.run_backward( # Calls into the C++ engine to run the backward pass 252 tensors, 253 grad_tensors, 254 retain_graph, 255 create_graph, 256 inputs, 257 allow_unreachable=True, 258 accumulate_grad=True, 259 )

File ~/miniconda3/envs/testenv/lib/python3.10/site-packages/torch/autograd/function.py:288, in BackwardCFunction.apply(self, args) 282 raise RuntimeError( 283 "Implementing both 'backward' and 'vjp' for a custom " 284 "Function is not allowed. You should only implement one " 285 "of them." 286 ) 287 user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn --> 288 return user_fn(self, args)

File ~/miniconda3/envs/testenv/lib/python3.10/site-packages/bitsandbytes/autograd/functions.py:491, in MatMul8bitLt.backward(ctx, grad_output) 485 print("state.CxB",state.CxB) 486 print("State ",state) 488 CB = ( 489 undo_layout(state.CxB, state.tile_indices) 490 .to(ctx.dtype_A) --> 491 .mul(state.SCB.unsqueeze(1).mul(1.0 / 127.0)) 492 ) 493 grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype_A) 494 else:

RuntimeError: The size of tensor a (32) must match the size of tensor b (4) at non-singleton dimension 0

####################################################################################### need your help.

thanks.

Reproduction

Need your help to solve this problem

Expected behavior

it is working fine for 4 bits quantization and finetuning but same code, dataset with 8 bit it is not working. it gives runtime error ( RuntimeError: The size of tensor a (32) must match the size of tensor b (4) at non-singleton dimension 0)

maywind23 commented 5 months ago

The same issue as you, and the model to be fine-tuned is mixtral

mkusy commented 5 months ago

I confirm that this problem also occurs in the llamafactory library (as bitsandbytes is a dependency of it).

I would kindly ask to prioritise this bug as it is holding up my work on an important project.

divisionblur commented 4 months ago

I had the same problem

nguyenvo09 commented 3 months ago

got same issue here, please fix it

Titus-von-Koeller commented 1 month ago

Thanks for raising this and your friendly tone!

We'll look into this and provide a fix. Unfortunately our resources (just 1 person, me) were really bound up in this multi-platform backend refactor (providing other backends than CUDA, e.g. Intel and AMD). But we just had a new maintainer start this month and we can better address such issues going forward.

cc @matthewdouglas putting this on the backlog as priority item

matthewdouglas commented 1 month ago

Hi all,

Is there a full example (e.g. notebook or code snippet) that can be used to reproduce this? Additional environmental information could be helpful as well (e.g. versions of PyTorch, Transformers, Accelerate, PEFT, bitsandbytes and hardware being used).

laaqira commented 3 days ago

Hi @matthewdouglas, I got the same error while fine tuning Mixtral8x7B with 8-bit quantization.

###########################################

Model and tokenizer

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1" bnb_config = BitsAndBytesConfig( load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto", token=token, attn_implementation="flash_attention_2", torch_dtype=torch.float16 )

tokenizer = AutoTokenizer.from_pretrained( model_id, padding_side="left", add_eos_token=True, add_bos_token=True, use_fast=True, token=token ) tokenizer.pad_token = tokenizer.eos_token

LoRA config :

model.gradient_checkpointing_enable() model = prepare_model_for_kbit_training(model)

config = LoraConfig( r=32, lora_alpha=64, target_modules=[ "q_proj", "k_proj", "v_proj", "out_proj", "up_proj", "gate_proj"

"w1",

    # "w3",
],
layers_to_transform=[i for i in range(32) if i >= 16],
bias="none",
lora_dropout=0.05,  # Conventionnel
task_type="CAUSAL_LM",

)

model = get_peft_model(model, config)

Trainer:

trainer = Trainer( model=model, train_dataset=train_data['train'], eval_dataset= eval_data['eval'], args=TrainingArguments( warmup_steps=1, per_device_train_batch_size=2, eval_strategy="steps", logging_strategy="steps", gradient_accumulation_steps=1, gradient_checkpointing=True, max_steps=300, learning_rate=2e-4, optim="paged_adamw_8bit", logging_steps=25,
logging_dir="./logs",
save_strategy="steps",
save_steps=25,
output_dir="Mixtral-Finetuned", do_eval=True,

remove_unused_columns=False,

    fp16=True
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)

) model.config.use_cache = False trainer.train()

Error description

RuntimeError Traceback (most recent call last) Cell In[24], line 1 ----> 1 trainer.train()

File ~/venv/lib/python3.12/site-packages/transformers/trainer.py:1948, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs) 1946 hf_hub_utils.enable_progress_bars() 1947 else: -> 1948 return inner_training_loop( 1949 args=args, 1950 resume_from_checkpoint=resume_from_checkpoint, 1951 trial=trial, 1952 ignore_keys_for_eval=ignore_keys_for_eval, 1953 )

File ~/venv/lib/python3.12/site-packages/transformers/trainer.py:2289, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval) 2286 self.control = self.callback_handler.on_step_begin(args, self.state, self.control) 2288 with self.accelerator.accumulate(model): -> 2289 tr_loss_step = self.training_step(model, inputs) 2291 if ( 2292 args.logging_nan_inf_filter 2293 and not is_torch_xla_available() 2294 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step)) 2295 ): 2296 # if loss is nan or inf simply add the average of previous logged losses 2297 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File ~/venv/lib/python3.12/site-packages/transformers/trainer.py:3359, in Trainer.training_step(failed resolving arguments) 3357 scaled_loss.backward() 3358 else: -> 3359 self.accelerator.backward(loss, **kwargs) 3361 return loss.detach() / self.args.gradient_accumulation_steps

File ~/venv/lib/python3.12/site-packages/accelerate/accelerator.py:2155, in Accelerator.backward(self, loss, kwargs) 2153 return 2154 elif self.scaler is not None: -> 2155 self.scaler.scale(loss).backward(kwargs) 2156 elif learning_rate is not None and self.has_lomo_optimizer: 2157 self.lomo_backward(loss, learning_rate)

File ~/venv/lib/python3.12/site-packages/torch/_tensor.py:521, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs) 511 if has_torch_function_unary(self): 512 return handle_torch_function( 513 Tensor.backward, 514 (self,), (...) 519 inputs=inputs, 520 ) --> 521 torch.autograd.backward( 522 self, gradient, retain_graph, create_graph, inputs=inputs 523 )

File ~/venv/lib/python3.12/site-packages/torch/autograd/init.py:289, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs) 284 retain_graph = create_graph 286 # The reason we repeat the same comment below is that 287 # some Python versions print out the first line of a multi-line function 288 # calls in the traceback and some print out the last line --> 289 _engine_run_backward( 290 tensors, 291 gradtensors, 292 retain_graph, 293 create_graph, 294 inputs, 295 allow_unreachable=True, 296 accumulate_grad=True, 297 )

File ~/venv/lib/python3.12/site-packages/torch/autograd/graph.py:768, in _engine_run_backward(t_outputs, *args, *kwargs) 766 unregister_hooks = _register_logging_hooks_on_whole_graph(t_outputs) 767 try: --> 768 return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass 769 t_outputs, args, **kwargs 770 ) # Calls into the C++ engine to run the backward pass 771 finally: 772 if attach_logging_hooks:

File ~/venv/lib/python3.12/site-packages/torch/autograd/function.py:306, in BackwardCFunction.apply(self, args) 300 raise RuntimeError( 301 "Implementing both 'backward' and 'vjp' for a custom " 302 "Function is not allowed. You should only implement one " 303 "of them." 304 ) 305 user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn --> 306 return user_fn(self, args)

File ~/venv/lib/python3.12/site-packages/torch/utils/checkpoint.py:313, in CheckpointFunction.backward(ctx, *args) 308 if len(outputs_with_grad) == 0: 309 raise RuntimeError( 310 "none of output has requires_grad=True," 311 " this checkpoint() is not necessary" 312 ) --> 313 torch.autograd.backward(outputs_with_grad, args_with_grad) 314 grads = tuple( 315 inp.grad if isinstance(inp, torch.Tensor) else None 316 for inp in detached_inputs 317 ) 319 return (None, None) + grads

File ~/venv/lib/python3.12/site-packages/torch/autograd/init.py:289, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs) 284 retain_graph = create_graph 286 # The reason we repeat the same comment below is that 287 # some Python versions print out the first line of a multi-line function 288 # calls in the traceback and some print out the last line --> 289 _engine_run_backward( 290 tensors, 291 gradtensors, 292 retain_graph, 293 create_graph, 294 inputs, 295 allow_unreachable=True, 296 accumulate_grad=True, 297 )

File ~/venv/lib/python3.12/site-packages/torch/autograd/graph.py:768, in _engine_run_backward(t_outputs, *args, *kwargs) 766 unregister_hooks = _register_logging_hooks_on_whole_graph(t_outputs) 767 try: --> 768 return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass 769 t_outputs, args, **kwargs 770 ) # Calls into the C++ engine to run the backward pass 771 finally: 772 if attach_logging_hooks:

File ~/venv/lib/python3.12/site-packages/torch/autograd/function.py:306, in BackwardCFunction.apply(self, args) 300 raise RuntimeError( 301 "Implementing both 'backward' and 'vjp' for a custom " 302 "Function is not allowed. You should only implement one " 303 "of them." 304 ) 305 user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn --> 306 return user_fn(self, args)

File ~/venv/lib/python3.12/site-packages/bitsandbytes/autograd/_functions.py:479, in MatMul8bitLt.backward(ctx, grad_output) 474 grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype_A) 475 elif state.CxB is not None: 476 CB = ( 477 undo_layout(state.CxB, state.tile_indices) 478 .to(ctx.dtypeA) --> 479 .mul(state.SCB.unsqueeze(1).mul(1.0 / 127.0)) 480 ) 481 grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype_A) 482 else:

RuntimeError: The size of tensor a (32) must match the size of tensor b (8) at non-singleton dimension 0

#######################################################

pip list

torch 2.4.0 transformers 4.44.0 peft 0.12.0 bitsandbytes 0.43.1 accelerate 0.33.0

Additional info

GPU instance: A100 80G

Note:

I still get the same error when using load_in_8bit=True inside AutoModelForCausalLM.from_pretrained() and not using BitsAndBytesConfig. I am using my own custom dataset.