axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.58k stars 824 forks source link

RuntimeError: Error(s) in loading state_dict for MistralForCausalLM (Deepspeed Zero 3) #933

Open RicardoDominguez opened 9 months ago

RicardoDominguez commented 9 months ago

Please check that this issue hasn't been reported before.

Expected Behavior

I fine-tune a Mistral model with the default zero3.json and

Training finishes without error. Afterwards, I expect to be able to load the fine-tuned model using

 model = transformers.AutoModelForCausalLM.from_pretrained('test')

My accelerate config is

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: deepspeed/zero3.json
  zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Current behaviour

 model = transformers.AutoModelForCausalLM.from_pretrained('test')

yields the error

You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model initialized on CPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Traceback (most recent call last):
  File "/lustre/home/rolmedo/lllm/longcontext/evaluate_winning_single.py", line 159, in <module>
    tokenizer, model = load_tokenizer_model(args.model_dir, use_flash_attention_2=True)
  File "/lustre/home/rolmedo/lllm/longcontext/evaluate_winning_single.py", line 21, in load_tokenizer_model
    model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
    return model_class.from_pretrained(
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3307, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3756, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for MistralForCausalLM:
    size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32002, 4096]).
    You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

and

  model = transformers.AutoModelForCausalLM.from_pretrained('test',
                                                            device_map='auto',
                                                            torch_dtype=torch.bfloat16,
                                                            trust_remote_code=True,
                                                            low_cpu_mem_usage=True)

yields the error

Traceback (most recent call last):
  File "/lustre/home/rolmedo/lllm/longcontext/evaluate_winning_single.py", line 159, in <module>
    tokenizer, model = load_tokenizer_model(args.model_dir, use_flash_attention_2=True)
  File "/lustre/home/rolmedo/lllm/longcontext/evaluate_winning_single.py", line 21, in load_tokenizer_model
    model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
    return model_class.from_pretrained(
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3307, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3695, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 741, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 285, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([0]) in "weight" (which has shape torch.Size([32002, 4096])), this look incorrect.

Steps to reproduce

accelerate launch -m axolotl.cli.train mistral_config.yml  --deepspeed deepspeed/zero3.json

and thereafter

 model = transformers.AutoModelForCausalLM.from_pretrained('test')

Config yaml

base_model: model_dir/mistral-7b-v0.1/
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
    - path: dset_dir/slim-orca/slim-orca.jsonl
      type: sharegpt
      ds_type: json
      conversation: chatml

dataset_prepared_path: prep-datasets/
val_set_size: 0
output_dir: test/
sequence_len: 8192 
sample_packing: true
pad_to_sequence_len: true

wandb_project: orca
wandb_entity:
wandb_watch:
wandb_run_id: mistral-slimorca
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 6
num_epochs: 4
optimizer: adamw_torch_fused
adam_beta1: 0.9
adam_beta2: 0.95
adam_epsilon: 0.00001
max_grad_norm: 1.0 # gradient clipping max norm
lr_scheduler: cosine
learning_rate: 0.00002

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 100
eval_steps: 0
eval_table_size:
eval_table_max_new_tokens:
save_steps: 0.9999
debug:
deepspeed:
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "<|im_end|>"
  unk_token: "<unk>"
tokens:
  - "<|im_start|>"
  - "<|im_end|>"

Possible solution

Seems related to #705 and #709

Which Operating Systems are you using?

Python Version

3.10

axolotl branch-commit

main/3e3229e2d99bb509784ac72e6589f8a8e406247f

Acknowledgements

winglian commented 9 months ago

Are you using a model from a checkpoint folder or the output folder?

RicardoDominguez commented 9 months ago

From the output folder

  File "<stdin>", line 1, in <module>
  File "/lustre/home/rolmedo/axo/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/lustre/home/rolmedo/axo/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3480, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/lustre/home/rolmedo/axo/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3931, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for MistralForCausalLM:
    size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32002, 4096]).
    You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
RicardoDominguez commented 9 months ago

I can confirm that I only experience this issue when using Zero3, and Zero 2 works fine.

maxidl commented 8 months ago

I can confirm that I only experience this issue when using Zero3, and Zero 2 works fine.

I just ran into the same error, can confirm switching from zero3 to zero2 "solved" the issue.

mgoulao commented 7 months ago

Using transformers @ git+https://github.com/huggingface/transformers.git@3cefac1d974db5e2825a0cb2b842883a628be7a0 seems to work.

winglian commented 7 months ago

Using transformers @ git+https://github.com/huggingface/transformers.git@3cefac1d974db5e2825a0cb2b842883a628be7a0 seems to work.

@mgoulao is this a transformers regression then? That particular commit works with zero3 ?

mgoulao commented 7 months ago

Yes, it does work with ZeRO 3 however you will get this problem: #1035

luijait commented 7 months ago

I had the same error, the transformer library fixes it, but now I get this one.

new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 813, in _load_state_dict_into_meta_model set_module_quantized_tensor_to_device(model, param_name, param_device, value=param) File "/usr/local/lib/python3.10/dist-packages/transformers/integrations/bitsandbytes.py", line 128, in set_module_quantized_tensor_to_device new_value = value.to(device) NotImplementedError: Cannot copy out of meta tensor; no data!

tcapelle commented 6 months ago

I can confirm the same error when finetuning Mistral with chatml format and deepspeed3.

loading model
Traceback (most recent call last):
  File "/home/ubuntu/llm_recipes/scripts/push2hub.py", line 33, in <module>
    model = AutoModelForCausalLM.from_pretrained(config.model_path, torch_dtype=getattr(torch, config.torch_dtype))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3977, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for MistralForCausalLM:
        size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32002, 4096]).
        You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
maxidl commented 6 months ago

I can confirm the same error when finetuning Mistral with chatml format and deepspeed3.

loading model
Traceback (most recent call last):
  File "/home/ubuntu/llm_recipes/scripts/push2hub.py", line 33, in <module>
    model = AutoModelForCausalLM.from_pretrained(config.model_path, torch_dtype=getattr(torch, config.torch_dtype))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3977, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for MistralForCausalLM:
        size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32002, 4096]).
        You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

The post is old, I think there is no solution, you simply cannot use Qlora + DeepSpeed3 Zero. Fortunately, there is now a quite good alternative that has been recently implemented in Axolotl, which involves FSDP (Full Shard + Qlora). Link

The solution I found most viable was to use a non-quantized Lora with DeepSpeed 3.

Apart from that, I believe that as of today, there is no way with DeepSpeed Stage 3 to load Qloras.

I hope I'm wrong, but all the final answers I found on the internet were basically these.

This issue is about full finetune, no lora involved.

tcapelle commented 6 months ago

I am doing full tine tune, no qlora.

0-hero commented 6 months ago

+1 Zero3_bf16 + Full-finetune

RuntimeError: Error(s) in loading state_dict for MistralModel:
    size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32006, 4096]).
    You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

EDIT - Can confirm zero2 works

JCRPaquin commented 4 months ago

I encountered this, although mine was with llama 3 + zero3. The model safe tensors were being output as shards, but there was also a model.safetensors that HF seems to load by default, even though it's not included in the index.json. Once I (re)moved the model.safetensors file the model seems to have loaded successfully.