axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
6.99k stars 766 forks source link

ValueError: Attempting to unscale FP16 gradients. #1031

Open hengjiUSTC opened 6 months ago

hengjiUSTC commented 6 months ago

Please check that this issue hasn't been reported before.

Expected Behavior

Should run correctly.

Current behaviour

running crash

wandb: WARNING Saving files without folders. If you want to preserve sub directories pass base_path to wandb.save, i.e. wandb.save("/mnt/folder/file.h5", base_path="/mnt")
[2024-01-02 12:15:40,565] [INFO] [axolotl.callbacks.on_train_begin:572] [PID:9425] [RANK:0] The Axolotl config has been saved to the WandB run under files.
  0%|                                                                                                                                                            | 0/90 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/axolotl/src/axolotl/cli/train.py", line 38, in <module>
    fire.Fire(do_cli)
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/ubuntu/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
    train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
  File "/home/ubuntu/axolotl/src/axolotl/train.py", line 136, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1896, in _inner_training_loop
    self.accelerator.clip_grad_norm_(
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 2124, in clip_grad_norm_
    self.unscale_gradients()
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 2087, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 284, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 212, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/axolotl/src/axolotl/cli/train.py", line 38, in <module>
    fire.Fire(do_cli)
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/ubuntu/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
    train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
  File "/home/ubuntu/axolotl/src/axolotl/train.py", line 136, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1896, in _inner_training_loop
    self.accelerator.clip_grad_norm_(
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 2124, in clip_grad_norm_
    self.unscale_gradients()
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 2087, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 284, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 212, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")

Steps to reproduce

I use following config:

base_model: mistralai/Mistral-7B-v0.1
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true
trust_remote_code: true

load_in_8bit: false
load_in_4bit: false
strict: false

chat_template: chatml
datasets:
  - path: HenryJJ/tangshi
    type:
      system_prompt: ""
      field_system: system
      format: |-
        {instruction}
        Input: {input}
        Output: 
      no_input_format: "[INST] {instruction} [/INST]"
dataset_prepared_path:
val_set_size: 0.1
output_dir: ./out

adapter: lora
lora_model_dir:

sequence_len: 1024
sample_packing: false
pad_to_sequence_len: true
eval_sample_packing: false

lora_r: 64
lora_alpha: 16
lora_dropout: 0.1
lora_target_linear: true
lora_modules_to_save:
  - embed_tokens
  - lm_head

lora_fan_in_fan_out:

wandb_project: tangshi
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: constant
learning_rate: 0.0001

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false

device_map: auto

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 2
xformers_attention:
flash_attention: false

warmup_steps: 10
eval_steps: 2
eval_batch_size: 4
eval_table_size:
eval_table_max_new_tokens:
save_steps: 10
save_total_limit: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  pad_token: "<unk>"

and run with python3 -m axolotl.cli.train mix_tangshi/config.yml

Config yaml

No response

Possible solution

No response

Which Operating Systems are you using?

Python Version

3.10

axolotl branch-commit

main commit 3678a6c41d051ca6376d013c11c948e55b4c8b4f

Acknowledgements

hengjiUSTC commented 6 months ago

However I am able to run lora with fp16 in my other experiments https://github.com/hengjiUSTC/learn-llm/blob/main/trl_finetune.py#L316. So I am not sure what is the expected behavior?

hengjiUSTC commented 6 months ago

I found the bug happens when I set

lora_modules_to_save:
  - embed_tokens
  - lm_head
  1. Why I set it? The reason I set it is because detection in https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/utils/models.py#L153 However given my special token is just setting pad_token to <unk> which is already in token list. I feel this detection shouldn't be triggered?

  2. Not sure why setting lora_modules_to_save with fp16 leads to crash.

  3. Another problem at https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/utils/models.py#L123. When I have flash_attention to false and is_mistral_derived_model to true, it will not set mixtral padding to left. Which is incorrect for Mixtral training

winglian commented 6 months ago

I'm wondering if we are even supposed to be recasting to fp16. the original qlora only recasts when bf16 is used https://github.com/artidoro/qlora/blame/main/qlora.py#L396-L405

winglian commented 6 months ago

@hengjiUSTC if you comment out these lines for your configuration above, does that fix the issue?

hengjiUSTC commented 6 months ago

I am using Lora instead of Qlora, these lines won't be triggered https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/utils/models.py#L554-L561/

    if (cfg.adapter == "lora" and load_in_8bit) or (
        cfg.adapter == "qlora" and cfg.load_in_4bit
    ):

load_in_8bit is false and load_in_4bit is also false

hengjiUSTC commented 6 months ago

See relevant discussion in : https://github.com/huggingface/transformers/issues/23165 https://github.com/huggingface/peft/issues/341

Here are some experiements: Break with raise ValueError("Attempting to unscale FP16 gradients.")

    model = AutoModelForCausalLM.from_pretrained(
        ...
        torch_dtype=torch.float16,
    )
    training_args = TrainingArguments(
        fp16=True,
        ...
    )

No error for below two configs

    model = AutoModelForCausalLM.from_pretrained(
        ...
        torch_dtype=torch.float32,
    )
    training_args = TrainingArguments(
        fp16=True,
        ...
    )
    model = AutoModelForCausalLM.from_pretrained(
        ...
        torch_dtype=torch.float16,
    )
    training_args = TrainingArguments(
        fp16=False,
        ...
    )

I am a bit new to these settings, does anyone know what is the reason? (I am using T4 gpu, so not able to use bf16) How should we handle this error in axololt?

hengjiUSTC commented 6 months ago

I get confirmation that we should not load model in float16 when enable fp16 in peft config. https://github.com/huggingface/peft/issues/341#issuecomment-1884911753. But I do see a lot of code (other finetune repo) doing this. And it's the reason error is raised in Axolotl (when fp16 is ture in config.yml, model is loaded with float16 and fp16 is enabled in peft).

ehartford commented 5 months ago

I also have these lines because I am using ChatML and adding new tokens to the base model

lora_modules_to_save:
  - embed_tokens
  - lm_head
NanoCode012 commented 3 months ago

Based on what @hengjiUSTC linked, if I understand it correctly, fp16 adapter training must use fp32 for trainable and fp16 for non-trainable. They provided a utility function cast_mixed_precision_params(peft_model, dtype) for us to use, but since we also handle gate/norm, we may need to adjust ourselves.