ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.43k stars 9.68k forks source link

Bug: Gemma2 adapter weights `lm_head` skipped on gguf conversion #9065

Closed ltoniazzi closed 1 month ago

ltoniazzi commented 2 months ago

What happened?

The lm_head layer for a Gemma2 LoRA adapter is not converted by convert_lora_to_gguf.py, and therefore not applied at inference (ruining performance of the adapter).


How to reproduce:

Expand
1. LoRA fine-tune Gemma2 with `pytorch`/`peft` including `lm_head` in the `target_modules` param: ```python config = LoraConfig(target_modules=["lm_head"], ...) ``` 2. Save the adapter. 3. Convert the adapter debugging ```bash python convert_lora_to_gguf.py --base --outtype f32 ``` then the `lm_head` layer is skipped by [this line in `convert_hf_to_gguf.py`](https://github.com/ggerganov/llama.cpp/blob/4b9afbbe9037f8a2d659097c0c7d9fce32c6494c/convert_hf_to_gguf.py#L2648) (and no error is raised): ```python if name == "lm_head.weight": logger.debug(f"Skipping get tensor {name!r} in safetensors so that convert can end normally.") return [] ``` 4. Run `llama-cli` to check that indeed no lora layer is applied in the [respective line in llama.cpp](https://github.com/ggerganov/llama.cpp/blob/8b3befc0e2ed8fb18b903735831496b8b0c80949/src/llama.cpp#L12021): ```bash ./llama-cli -m base/model/path/Base-F32.gguf \ --lora lora/model/path/Lora-F32-LoRA.gguf \ -p "Hello Gemma2" -n 50 ```


Expected behaviour

I think this is a bug because a user might have trained an adapter that is applied to the the lm_head layer, so skipping it on conversion will destroy the adapter's performance. I think the code should either:

or

Comments


Name and Version

version: 3524 (bc0f887e) built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.4.0

What operating system are you seeing the problem on?

MacOS, but it should be a platform-independent problem.

Relevant log output

No response

josharian commented 2 months ago

I have ~zero context here but I will note that the Gemma family of models uses a reversible embedding, so the lm_head layer is tied to be identical to the embedding layer.

qnixsynapse commented 2 months ago

Gemma 2 uses weight tying so lm_head weights are same as token embedding weights.

ltoniazzi commented 2 months ago

Gemma 2 uses weight tying so lm_head weights are same as token embedding weights.

@qnixsynapse @josharian Thanks yes I suspect this might be the reason why these weights are skipped, but I still feel the conversion should not succeed, as the adapter will not work correctly.

ngxson commented 2 months ago

Yes that's correct, the adapter should not have lm_head. In fact, we actively ignore this tensor in convert_hf_to_gguf.py

The simple solution for now is to add a check in convert_lora_to_gguf.py, in side the function modify_tensors. We can check if the super() call returns empty for lm_head:

https://github.com/ggerganov/llama.cpp/blob/2339a0be1c8e31fcf4531427183b94f2ef019e56/convert_lora_to_gguf.py#L365-L369

Add the check:

def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
    dest = super().modify_tensors(data_torch, name, bid)
    if name == "lm_head.weight" and len(dest) == 0:
        raise ValueError(f"lm_head is present in adapter, but is ignored in base model")

What do you think? @compilade

compilade commented 2 months ago

I think it's still applied in the compute graph (where the token embeddings tensor is duplicated for the output), so it should probably not be ignored from LoRA adapters.

https://github.com/ggerganov/llama.cpp/blob/2339a0be1c8e31fcf4531427183b94f2ef019e56/src/llama.cpp#L12020-L12021

Although this doesn't affect tok_embd.weight (is that a problem?).

Not sure how to bypass the check. Maybe something like "don't call super().modify_tensors for Gemma2 when the tensor name is lm_head.weight", or maybe even generalize this to all model architectures.

Or yes, maybe an error could be appropriate.

ngxson commented 2 months ago

I think it's still applied in the compute graph (where the token embeddings tensor is duplicated for the output), so it should probably not be ignored from LoRA adapters.

Yeah, you're right.

The model.output is pointed to the same tensor as model.tok_embd, then seems like create_tensor_for has the ability to create different tensors (with different names), but point to the same memory:

https://github.com/ggerganov/llama.cpp/blob/2339a0be1c8e31fcf4531427183b94f2ef019e56/src/llama.cpp#L7039

https://github.com/ggerganov/llama.cpp/blob/2339a0be1c8e31fcf4531427183b94f2ef019e56/src/llama.cpp#L4107

So I think at least we're good on cpp side, as it can handle lora tensors separately even if output and tok_embd are the same.

Although this doesn't affect tok_embd.weight (is that a problem?).

I think that should not be a problem (not 100% sure). I suppose that PEFT see output and tok_embd as 2 different tensors. Maybe need to check this later on @ltoniazzi

Not sure how to bypass the check. Maybe something like "don't call super().modify_tensors for Gemma2 when the tensor name is lm_head.weight", or maybe even generalize this to all model architectures.

The problem was that calling super().modify_tensors on Gemma2 returns 0 tensors, so that make the lm_head.weight being removed in the final gguf lora adapter. We could either:

Option 1: Override the default behavior of Gemma2.modify_tensors and allow it to accept lm_head.weight only if it's converting lora

We can do this by changing adding and not this.is_lora to the if condition below:

https://github.com/ggerganov/llama.cpp/blob/2339a0be1c8e31fcf4531427183b94f2ef019e56/convert_hf_to_gguf.py#L2696-L2700

Option 2: Or, in my last comment, I suggest to just throw an error if super().modify_tensors returns 0 tensors for lm_head.weight

def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
    dest = super().modify_tensors(data_torch, name, bid)
    if name == "lm_head.weight" and len(dest) == 0:
        raise ValueError(f"lm_head is present in adapter, but is ignored in base model")
ltoniazzi commented 2 months ago

@compilade I think currently convert_lora_to_gguf.py will skip the lm_head adapter, then these lines you mention in build_gemma2 will not have find an adapter to apply.

As you mentioned, one could not skip the lm_head adapter during conversion, and then I agree llm_build_lora_mm will apply it correctly to the base output layer.

But I think the tricky part might be merging the adapter (export-lora), because then one needs to save a new tensor for the lm_head as merging the adapter makes the output layer different from the embedding. And then this new tensor will need to be applied instead of the embedding layer, altering the way build_gemma2 builds the architecture.

So it looks to me that either:

are the cleaner options. And if users show a demand to be able to merge lm_head adapters for Gemma2, then address this further.

ltoniazzi commented 2 months ago

I suppose that PEFT see output and tok_embd as 2 different tensors.

Had a quick look and it looks like for Gemma2 (and probably all models with config.tie_word_embeddings = True) lm_head and tok_embd are the same tensor in PEFT.

It indeed happens that merging the lm_head adapter merges it to the embed layer as well https://github.com/huggingface/peft/issues/2018.

Also at inference, if the adapter is in unmerged state (lora_layer.merged = False) then whether lm_head and tok_embd are the same should not matter, as the forward pass performs the adapter multiplication separately and then adds it to the base as in llm_build_lora_mm.

ngxson commented 2 months ago

Hmm interesting. The problem with llama.cpp is that we currently don't support lora for tok_embd. Although it's simple to fix, I'm not sure if it's worth, because I never see anyone fine tuning a LoRA with tok_embd.

But I think the tricky part might be merging the adapter (export-lora), because then one needs to save a new tensor for the lm_head as merging the adapter makes the output layer different from the embedding.

We could, for example, detect if the model is missing lm_head, and duplicate tok_embd if one is missing. But IMO that's too many small details, given that lora adapters for gemma/gemma2 is not very popular atm.

Probably we should go with the simple way for now: don't allow user to convert such adapter in the first place. Then, we will fix it when more users use lora adapters with gemma.

ltoniazzi commented 2 months ago

Probably we should go with the simple way for now: don't allow user to convert such adapter in the first place

Agree, I can have a go at it later this week