ImportError: Using `bitsandbytes` 8-bit quantization requires Accelerate: `pip install accelerate` and the latest version of bitsandbytes: `pip install -i https://pypi.org/simple/ bitsandbytes`

Kushalamummigatti commented 6 months ago

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Should be able to run the inference on the trained model.

Current behaviour

when i try to run the command accelerate launch -m axolotl.cli.inference test.yaml --lora_model_dir= "/home/aion/axolotl/lora-out" for the inference am getting the error:

ImportError: Using bitsandbytes 8-bit quantization requires Accelerate: pip install accelerate and the latest version of bitsandbytes: pip install -i https://pypi.org/simple/ bitsandbytes

Steps to reproduce

Finetuned the codellama model using command python -m axolotl.cli.preprocess test.yaml
Tried to do the inference using the command officially given accelerate launch -m axolotl.cli.inference test.yaml --lora_model_dir= "/home/aion/axolotl/lora-out" which is producing the error ImportError: Using bitsandbytes 8-bit quantization requires Accelerate: pip install accelerate and the latest version of bitsandbytes: pip install -i https://pypi.org/simple/ bitsandbytes

Config yaml

base_model: codellama/CodeLlama-7b-hf
model_type: LlamaForCausalLM
tokenizer_type: CodeLlamaTokenizer
is_llama_derived_model: true

load_in_8bit: true
load_in_4bit: false
strict: false

datasets:
  - path: dataset_summaries.jsonl
    ds_type: json
    type: alpaca
dataset_prepared_path:
val_set_size: 0.05
output_dir: ./lora-out

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

adapter: lora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
s2_attention:

warmup_steps: 10
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

Possible solution

No response

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.10.13

axolotl branch-commit

OpenAccess-AI-Collective / axolotl

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

NanoCode012 commented 6 months ago

Can you retry following the instructions to pip install on readme?

ann-brown commented 6 months ago

As a note, I have been having similar issues basically everywhere, including the axolotl docker image, inference API on Huggingface, and installs on other Python images on Runpod. Tools like https://github.com/mlabonne/llm-autoeval that previously worked for the 8-bit quantization broke with the same error.

NanoCode012 commented 6 months ago

Hm, so, training worked, but inference failed?

ann-brown commented 6 months ago

Some means of inference, including the ones Axolotl seems to be using. Not quite every means. If I load the model this way:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_path = "/path/to/model"

# Load the model
model = AutoModelForCausalLM.from_pretrained(model_path)

... It gives me the warning of "Detected the presence of a quantization_config attribute in the model's configuration but you don't have the correct bitsandbytes version to support int8 serialization. Please install the latest version of bitsandbytes with pip install --upgrade bitsandbytes."

I am at bitsandbytes 0.42.0 in this environment and the given command does not upgrade it further, even though it looks like they made a version 0.43.0 recently.

However, I can interact normally with the model from there on my Macbook in my python environment, with, eg:

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Generate text using the model
prompt = "### Instruction: Write a poem about an acorn. ### Response:"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

output = model.generate(input_ids, max_length=100, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

successfully running inference and returning output "### Instruction: Write a poem about an acorn. ### Response: The acorn is a small, hard seed that can grow into a mighty oak tree. It symbolizes growth, strength, and resilience. The poem should include vivid imagery and metaphors to convey the tree's significance." (Though it takes a while, I'd guess I'm not making use of Apple Metal in this circumstance.)

And yes, the training runs normally (on a Runpod GPU, not the Macbook). The inference errors started also on previously trained models that were able to manage inferencing in the given environments before, and not just in Axolotl.

ann-brown commented 6 months ago

Wondering if this warning on the docker image when running preprocessing is relevant - /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.

NanoCode012 commented 5 months ago

Hey, the last warning is possibly due to running without GPU (running with CPU). However, your above command shows otherwise. Did you manage to solve this?

ann-brown commented 5 months ago

Hey, the last warning is possibly due to running without GPU (running with CPU). However, your above command shows otherwise. Did you manage to solve this?

Not unless it solved itself incidentally; I've just been avoiding 8 bit quantization. It's definitely running with GPU when training at least in float 16 or 32. adamw_bnb_8bit optimizer seems to work fine still (in those precisions).

brijesh24bs commented 4 months ago

I got the similar error while training in the google-colab. But i found that i wasn't using google GPU. I changed the runtime and it worked.

axolotl-ai-cloud / axolotl