ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.15k stars 9.34k forks source link

KeyError: 'I8' when trying to convert finetuned 8bit model to GGUF #4199

Closed Lue-C closed 5 months ago

Lue-C commented 10 months ago

Prerequisites

Hi there, I am finetuning the model https://huggingface.co/jphme/em_german_7b_v01 using own data (I just replaced the questions and answers by dots to keep it short and simple). The model is loaded in 8 bit and a peft adapter is added, which is then trained. After merging the weights of the trained adapter and the original model and saving as a full model, I want to convert this model using convert.py.

Expected Behavior

The model is converted to GGUF and saved as a file.

Actual Behavior

Get a key error KeyError: 'I8'

Environment and context

I am running the following code in colab. The relevant versions are given by the pip commands.

!pip install -q peft
!pip install -q accelerate
!pip install -q optimum
!pip install -q bitsandbytes
!pip install transfromers==4.30.0

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, TextStreamer, TrainingArguments, Trainer

!git clone https://huggingface.co/jphme/em_german_7b_v01

model_name='em_german_7b_v01'

model=AutoModelForCausalLM.from_pretrained(model_name,low_cpu_mem_usage=True, local_files_only=True, load_in_8bit=True)
tokenizer=AutoTokenizer.from_pretrained(model_name, local_files_only=True)
tokenizer.pad_token_id=tokenizer.eos_token_id

from datasets import Dataset, DatasetDict

title_1 = ...
question_1 = ...
answer_1 = ...
title_2 = ...
question_2 = ....
answer_2 = ...
title_3 = ...
question_3 = ...
answer_3 = ...

my_dict = {"train":{'q_id':[0, 1], 'title':[title_1, title_2], 'selftext':[question_1, question_2], 'answers.text':[answer_1, answer_2], 'answers.score':[10, 10]},
           "test":{'q_id':[2], 'title':[title_3], 'selftext':[question_3], 'answers.text':[answer_3], 'answers.score':[10]}}

my_eli_train = Dataset.from_dict(my_dict["train"])
my_eli_test = Dataset.from_dict(my_dict["test"])
my_eli = DatasetDict({'train':my_eli_train, 'test':my_eli_test})

def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

tokenized_my_eli = my_eli.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=my_eli["train"].column_names,
)

block_size = 4

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

my_dataset = tokenized_my_eli.map(group_texts, batched=True, num_proc=4)

from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

from peft import LoraConfig

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

model.add_adapter(peft_config)

training_args = TrainingArguments(
    output_dir="my_finetuned_model",
    evaluation_strategy="epoch",
    num_train_epochs=15,
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=my_dataset["train"],
    eval_dataset=my_dataset["test"],
    data_collator=data_collator,
)

trainer.train()

# save finetuned adapter
trainer.save_model("my_new_adapter")

from peft import PeftModel

base_model_name='em_german_7b_v01'
model=AutoModelForCausalLM.from_pretrained(base_model_name,low_cpu_mem_usage=True, local_files_only=True, load_in_8bit=True)

peft_model_id = "my_new_adapter"

# construct model with adapter
model = PeftModel.from_pretrained(model, peft_model_id)

# merge weights
merged_model = model.merge_and_unload()

# save full model
merged_model.save_pretrained("new_model")

Now I want to convert the merged model to GGUF using

!python llama.cpp/convert.py new_model \
  --outfile my_7b.gguf \
  --outtype q8_0

and get

Loading model file new_one/model-00001-of-00002.safetensors
Traceback (most recent call last):
  File "/content/llama.cpp/convert.py", line 1228, in <module>
    main()
  File "/content/llama.cpp/convert.py", line 1161, in main
    model_plus = load_some_model(args.model)
  File "/content/llama.cpp/convert.py", line 1076, in load_some_model
    models_plus.append(lazy_load_file(path))
  File "/content/llama.cpp/convert.py", line 753, in lazy_load_file
    return lazy_load_safetensors_file(fp, path)
  File "/content/llama.cpp/convert.py", line 732, in lazy_load_safetensors_file
    model = {name: convert(info) for (name, info) in header.items() if name != '__metadata__'}
  File "/content/llama.cpp/convert.py", line 732, in <dictcomp>
    model = {name: convert(info) for (name, info) in header.items() if name != '__metadata__'}
  File "/content/llama.cpp/convert.py", line 720, in convert
    data_type = SAFETENSORS_DATA_TYPES[info['dtype']]
KeyError: 'I8'

Considering that the error indicates a data type problem and that converting the original model to GGUF works fine, I think that the problem is due to the 8 bit quantization.

Did I forget some option when converting or loading the merged model? How can I convert the merged model to GGUF?

KerfuffleV2 commented 10 months ago

You can only convert to GGUF format from models with data in float16, bfloat16 or float32 formats. You can't convert models that are already quantized to a non-GGML format.

What you can do if you're willing to accept the quality loss of requantizing is convert the quantized tensors in your model to one of the formats I mentioned and then convert it to GGUF. Just keep in mind you'll be quantizing, unquantizing, then quantizing again and quantizing is a lossy process.

arbitropy commented 9 months ago

You can only convert to GGUF format from models with data in float16, bfloat16 or float32 formats. You can't convert models that are already quantized to a non-GGML format.

I have used the same code above to load and fine tune the model, this is my bits and bytes config bnb_config = BitsAndBytesConfig( load_in_8bit=True, bnb_8bit_compute_dtype="float16" ) for loading the model. at which part do i change the model so that the model is compatible with gguf formatting from the beginning without requantizing?

KerfuffleV2 commented 9 months ago

I think you'd have to do your finetuning at 16bit or above, which likely isn't an option since it would at least double memory requirements. So basically you probably have to convert the tensor back up to f16, not sure there's anything else you can do. I am not that familiar with finetuning stuff though.

arbitropy commented 9 months ago

how do i convert the tensor back up to fp16 or a compatible format? And also I am not sure where this compatibility issue is occurring (I don't understand the internals of models), this is part of my config.json of the finetuned merged model: "quantization_config": { "bnb_4bit_compute_dtype": "float32", "bnb_4bit_quant_type": "fp4", "bnb_4bit_use_double_quant": false, "llm_int8_enable_fp32_cpu_offload": false, "llm_int8_has_fp16_weight": false, "llm_int8_skip_modules": null, "llm_int8_threshold": 6.0, "load_in_4bit": false, "load_in_8bit": true, "quant_method": "bitsandbytes" }, "torch_dtype": "float16",

-so the dtype here is set as float16, and quantization dtype is float32, both of which seem to be a compatible type for conversion.

Lue-C commented 9 months ago

You can only convert to GGUF format from models with data in float16, bfloat16 or float32 formats. You can't convert models that are already quantized to a non-GGML format.

What you can do if you're willing to accept the quality loss of requantizing is convert the quantized tensors in your model to one of the formats I mentioned and then convert it to GGUF. Just keep in mind you'll be quantizing, unquantizing, then quantizing again and quantizing is a lossy process.

Thanks for the reply, I see the problem. I did not give it back converting a try because of the assumed quality loss. Here is what I did instead: I just used the finetuning from llama.cpp with a GGUF as a base model. Afterwards I used the "export-lora" to merge the adapter with the base model as a GGUF. This can be used like any other GGUF in langchain, which was the goal to me. I did the finetuning with the example text (shakespeare) but unfortunately do not know in which format I have to give training data for a question answering/causal LM. Does anyone have an idea?

KerfuffleV2 commented 9 months ago

I'm not sure exactly, if Torch suppports that 8 bit quantization then you could possibly load the model, then use Torch operations to convert it to the correct format. I think it would be something like model["tensorname"] = model["tensorname"].to(dtype = torch.float16). Note this is just a hint in the direction that might help you, I don't know enough to give you the exact command. Anyway, if you can find and convert the tensors that are in the wrong format then you can torch.save() to a different file and then possibly convert it to GGUF format.

Unfortunately, you basically need to know some Python/Torch stuff to pull it off so if you don't then your best bet is to latch on to someone who does. (Not me though!)

arbitropy commented 9 months ago

I fixed it by merging the lora with the full base model instead of the bits and bytes quantized one. I just reloaded the model, and then merged. it worked without any error.

znelson32 commented 9 months ago

I have the same issue except my error is "U8". Spent hours trying to figure this out and this thread saved me, thanks!

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

balaji-2k1 commented 4 months ago

I encountered the same "I8" error while converting my fine-tuned Mixtral Model to GGUF file. Fortunately, I found a thread that helped me resolve the issue. Thank you!