Closed ltoniazzi closed 1 month ago
I have ~zero context here but I will note that the Gemma family of models uses a reversible embedding, so the lm_head layer is tied to be identical to the embedding layer.
Gemma 2 uses weight tying so lm_head weights are same as token embedding weights.
Gemma 2 uses weight tying so lm_head weights are same as token embedding weights.
@qnixsynapse @josharian Thanks yes I suspect this might be the reason why these weights are skipped, but I still feel the conversion should not succeed, as the adapter will not work correctly.
Yes that's correct, the adapter should not have lm_head
. In fact, we actively ignore this tensor in convert_hf_to_gguf.py
The simple solution for now is to add a check in convert_lora_to_gguf.py
, in side the function modify_tensors
. We can check if the super()
call returns empty for lm_head
:
Add the check:
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
dest = super().modify_tensors(data_torch, name, bid)
if name == "lm_head.weight" and len(dest) == 0:
raise ValueError(f"lm_head is present in adapter, but is ignored in base model")
What do you think? @compilade
I think it's still applied in the compute graph (where the token embeddings tensor is duplicated for the output), so it should probably not be ignored from LoRA adapters.
Although this doesn't affect tok_embd.weight
(is that a problem?).
Not sure how to bypass the check. Maybe something like "don't call super().modify_tensors
for Gemma2 when the tensor name is lm_head.weight
", or maybe even generalize this to all model architectures.
Or yes, maybe an error could be appropriate.
I think it's still applied in the compute graph (where the token embeddings tensor is duplicated for the output), so it should probably not be ignored from LoRA adapters.
Yeah, you're right.
The model.output
is pointed to the same tensor as model.tok_embd
, then seems like create_tensor_for
has the ability to create different tensors (with different names), but point to the same memory:
So I think at least we're good on cpp side, as it can handle lora tensors separately even if output and tok_embd are the same.
Although this doesn't affect
tok_embd.weight
(is that a problem?).
I think that should not be a problem (not 100% sure). I suppose that PEFT see output and tok_embd as 2 different tensors. Maybe need to check this later on @ltoniazzi
Not sure how to bypass the check. Maybe something like "don't call
super().modify_tensors
for Gemma2 when the tensor name islm_head.weight
", or maybe even generalize this to all model architectures.
The problem was that calling super().modify_tensors
on Gemma2 returns 0 tensors, so that make the lm_head.weight
being removed in the final gguf lora adapter. We could either:
Option 1: Override the default behavior of Gemma2.modify_tensors
and allow it to accept lm_head.weight
only if it's converting lora
We can do this by changing adding and not this.is_lora
to the if
condition below:
Option 2: Or, in my last comment, I suggest to just throw an error if super().modify_tensors
returns 0 tensors for lm_head.weight
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]: dest = super().modify_tensors(data_torch, name, bid) if name == "lm_head.weight" and len(dest) == 0: raise ValueError(f"lm_head is present in adapter, but is ignored in base model")
@compilade I think currently convert_lora_to_gguf.py
will skip the lm_head
adapter, then these lines you mention in build_gemma2
will not have find an adapter to apply.
As you mentioned, one could not skip the lm_head
adapter during conversion, and then I agree llm_build_lora_mm
will apply it correctly to the base output layer.
But I think the tricky part might be merging the adapter (export-lora
), because then one needs to save a new tensor for the lm_head
as merging the adapter makes the output layer different from the embedding.
And then this new tensor will need to be applied instead of the embedding layer, altering the way build_gemma2
builds the architecture.
So it looks to me that either:
lm_head
orlm_head
but raise an error for export-lora
adapterare the cleaner options.
And if users show a demand to be able to merge lm_head
adapters for Gemma2, then address this further.
I suppose that PEFT see output and tok_embd as 2 different tensors.
Had a quick look and it looks like for Gemma2 (and probably all models with config.tie_word_embeddings = True
) lm_head
and tok_embd
are the same tensor in PEFT.
It indeed happens that merging the lm_head
adapter merges it to the embed layer as well https://github.com/huggingface/peft/issues/2018.
Also at inference, if the adapter is in unmerged state (lora_layer.merged = False
) then whether lm_head
and tok_embd
are the same should not matter, as the forward pass performs the adapter multiplication separately and then adds it to the base as in llm_build_lora_mm
.
Hmm interesting. The problem with llama.cpp is that we currently don't support lora for tok_embd
. Although it's simple to fix, I'm not sure if it's worth, because I never see anyone fine tuning a LoRA with tok_embd
.
But I think the tricky part might be merging the adapter (
export-lora
), because then one needs to save a new tensor for thelm_head
as merging the adapter makes the output layer different from the embedding.
We could, for example, detect if the model is missing lm_head
, and duplicate tok_embd
if one is missing. But IMO that's too many small details, given that lora adapters for gemma/gemma2 is not very popular atm.
Probably we should go with the simple way for now: don't allow user to convert such adapter in the first place. Then, we will fix it when more users use lora adapters with gemma.
Probably we should go with the simple way for now: don't allow user to convert such adapter in the first place
Agree, I can have a go at it later this week
What happened?
The
lm_head
layer for a Gemma2 LoRA adapter is not converted byconvert_lora_to_gguf.py
, and therefore not applied at inference (ruining performance of the adapter).How to reproduce:
Expand
1. LoRA fine-tune Gemma2 with `pytorch`/`peft` including `lm_head` in the `target_modules` param: ```python config = LoraConfig(target_modules=["lm_head"], ...) ``` 2. Save the adapter. 3. Convert the adapter debugging ```bash python convert_lora_to_gguf.py
Expected behaviour
I think this is a bug because a user might have trained an adapter that is applied to the the
lm_head
layer, so skipping it on conversion will destroy the adapter's performance. I think the code should either:Cannot convert Gemma2 adapter with lm_head layer
or
lm_head
layer shares the weights with theembed
layer in Gemma2, probably leading to having to create a new tensor for thelm_head
to merge the adapter to).Comments
convert_lora_to_gguf.py
was introduced in PR #8332, so maybe the @ngxson knows if skipping thelm_head
is the desired outcome of if it is actually a bug. Otherwise I'm happy to try figure out why this happens.lm_head
lora layer correctly.Name and Version
version: 3524 (bc0f887e) built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.4.0
What operating system are you seeing the problem on?
MacOS, but it should be a platform-independent problem.
Relevant log output
No response