Unfreeze layers in mixtral does not work as expected

funeric555 commented 7 months ago

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

The config for mixtral.yml contains examples for unfreeze such as: "^model.embed_tokens.weight$[:32000]" and "^lm_head.weights$"

The expected behavior is that the layers listed in the config mixtral.yml are unfrozen, in this case embed_tokens would be unfrozen for 0-31999 and lm_head would be unfrozen.

Current behaviour

The layers are not unfrozen as expected.

"All parameters are frozen. Model will not be trained."

Steps to reproduce

Uncomment the lines in the mixtral.yml config:

unfrozen_parameters: - ^lm_head.weight$ - ^model.embed_tokens.weight$[:32000]

Then run as shown in the readme: accelerate launch -m axolotl.cli.train examples/mistral/mixtral.yml --deepspeed deepspeed_configs/zero2.json

Config yaml

base_model: mistralai/Mixtral-8x7B-v0.1
model_type: AutoModelForCausalLM
tokenizer_type: LlamaTokenizer
trust_remote_code: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
  - path: tatsu-lab/alpaca
    type: alpaca
dataset_prepared_path: last_run_prepared
val_set_size: 0.0
output_dir: ./qlora-out

## You can optionally freeze the entire model and unfreeze a subset of parameters
unfrozen_parameters:
  - ^lm_head.weight$
  - ^model.embed_tokens.weight$[:32000]
#  - model.layers.2[0-9]+.block_sparse_moe.gate
#  - model.layers.2[0-9]+.block_sparse_moe.experts
#  - model.layers.3[0-9]+.block_sparse_moe.gate
#  - model.layers.3[0-9]+.block_sparse_moe.experts

model_config:
  output_router_logits: true

adapter: qlora
lora_model_dir:

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
#lora_target_modules:
#  - gate
#  - q_proj
#  - k_proj
#  - v_proj
#  - o_proj
#  - w1
#  - w2
#  - w3

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 2
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3

warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
saves_per_epoch: 1
debug:
deepspeed: deepspeed_configs/zero2.json
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:

Possible solution

Two part fix: 1: change "match" to search at: https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/utils/freeze.py#L195 2: Match the substring in the layers instead of anchored match: change the yml regex to: - embed_tokens.[^\s]*weight$[:32000] or - lm_head.[^\s]*weight

Some details: When running the config as shown, the layer names are different than shown: For example: base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight base_model.model.model.embed_tokens.modules_to_save.default.weight

Also - in a usecase where the goal is to freeze the existing embeddings, and unfreeze only newly added ones, you should use [32000:] to indicate that the newly added token embeddings are unfrozen, not the existing ones. If the goal is to unfreeze the embedding layer, then no need to add the slice at the end.

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.10.12

axolotl branch-commit

main/dbrx

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

winglian commented 7 months ago

This was last modified in this PR https://github.com/OpenAccess-AI-Collective/axolotl/pull/1390 do you mind looking in there to see if that helps?

funeric555 commented 7 months ago

I looked, and the issue I have is that I can't do something like: _lmhead.[^\s]*weight - and by changing match to search, it permits freezing/unfreezing layers when running LoRA.

winglian commented 7 months ago

What are the failure modes that currently exist that you recommend switching to use search instead of match? I want to make sure we can capture those in unit tests. thanks!

funeric555 commented 7 months ago

Thanks! A few 'issues':

1: If you are using (Q)LoRA, then the layer names are very different as I pasted above in the original issue 2: You can't ever do an intermediate match i.e. if you wanted to match all layers that contain lm_head there is currently no way, so: "lm_head.[^\s]*weight" would never be satisfied since re.match implies a left anchor, as opposed to re.search which allows anywhere. In your example config, you added "^" implying the expectation that you needed to explicitly indicate you wanted to match from the start, which is consistent with re.search, and not necessary for re.match.

So the main failure mode is

any search in the middle fails to match any layers - you can test this with the lm_head since all models should have an lm_head

Separately - a minor nit - if you had . in your regex, the code currently makes this \. which means you can never use a "." in the regex, although using [^\s] can substitute for "." if spaces are not permitted in layer (or parameter) names.

funeric555 commented 7 months ago

I wonder if this could be related to/address: https://github.com/OpenAccess-AI-Collective/axolotl/issues/1188#issuecomment-1909101590 -- If you are running LoRA, we can require unfreezing only layers with "lora" - or would this not work?

funeric555 commented 7 months ago

After some messing around, I have a guess of some of what is going on, and would appreciate being corrected if I am wrong, as well as details about my questions...

In examining the layers for a FFT (such as with the example: examples/llama-2/fft_optimized.yml), they are normal names, as expected I.e. model.embed_tokens.weight or lm_head.weight.

When running LoRA (or QLoRA), the layers are now replaced by three layers I.e. The original, renamed something like: base_model.model.model.layers.0.self_attn.q_proj.base_layer.weight as well as the A and B for LoRA I.e.: base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight and base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight. For LoRA, in theory you can specify the layers to train via the config: lora_target_modules. If you unfroze the original weights, then it screws things up.

I noticed a few things, and wonder what I am missing...

For the config: lora_target_modules: when I set it to only include the "q_proj" and "k_proj", it still included the other layers as A & B LoRA I.e. base_model.model.model.layers.1.self_attn.o_proj.lora_B.default.weight I even added an invalid value, and I didn't see any errors. Examining the code, it looks like the lora_target_modules are sent to the lora_config, which is sent to the get_peft_model as a peft_config. Not sure what is happening here.

There is also a separate option: lora_modules_to_save: where you can specify the lm_head or embed_tokens. If you don't specify embed or lm_head there, then you see them as: base_model.model.lm_head.weight (and requires_grad = False).

When you do include the embed_tokens (and add extra tokens) or lm_head, then you see two layers I.e.: base_model.model.model.embed_tokens.original_module.weight and base_model.model.model.embed_tokens.modules_to_save.default.weight

So - if I understand this correctly, LoRA if applied to embed, it does not actually perform any reduction, and in theory you could specify to freeze part of the non-original embed if you want to just re-train new embeddings for new tokens...

My questions are: 1: Is it expected that if you list at least one lora_target_module, then the others should not contain the "A" and "B" LoRA layers (since you are not re-training them) 2: Confirming that for embeddings or lm_head you do not want to have reduced matrices built I.e. no A,B so the observed behavior is correct.

seungduk-yanolja commented 7 months ago

If you are using LoRA, you need to touch the "modules_to_save" one instead of the "original" one. I am not sure what you want to ask here. Can you ask the question to ChatGPT or Claude and elaborate on what the problem is in the code? Correction on the test cases would be very helpful. Thanks!

axolotl-ai-cloud / axolotl