[BUG] The text cannot be generated successfully during the Raft step

Describe the bug When I use the fine-tuned LLAMA3 model to run the examples/raft_align.py script, I encountered the following error:

Traceback (most recent call last):
  File "/home/work/user-job-dir/app/liubiao/llm/LMflow/examples/raft_align.py", line 220, in <module>
    main()
  File "/home/work/user-job-dir/app/liubiao/llm/LMflow/examples/raft_align.py", line 183, in main
    outputs = model.generate(**inputs, **generation_kwargs)
  File "/home/naie/.local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/naie/.local/lib/python3.9/site-packages/transformers/generation/utils.py", line 1758, in generate
    result = self._sample(
  File "/home/naie/.local/lib/python3.9/site-packages/transformers/generation/utils.py", line 2397, in _sample
    outputs = self(
  File "/home/naie/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/naie/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/naie/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 1164, in forward
    outputs = self.model(
  File "/home/naie/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/naie/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/naie/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 968, in forward
    layer_outputs = decoder_layer(
  File "/home/naie/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/naie/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/naie/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 713, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/naie/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/naie/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/naie/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 331, in forward
    query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
RuntimeError: shape '[2, 206, 32, 128]' is invalid for input of size 412

This is my running script:

accelerate  launch --main_process_port 8836 \
    --config_file configs/deepspeed_zeor3.yaml --num_processes 1 \
    examples/raft_align.py \
    --model_name_or_path ${model_name_or_path} \
    --reward_model_or_path ${reward_model_or_path} \
    --tokenizer_name ${tokenizer_name} \
    --num_raft_iteration 20 \
    --learning_rate 2e-5 \
    --block_size 512 \
    --fp16 \
    --dataset_path ${dataset_path} \
    --output_reward_path log/raft_aligner/reward.txt \
    --output_dir ${output_dir} --overwrite_output_dir \
    --run_name "${exp_id}_${timestamp}" \
    --num_train_epochs 4 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --validation_split_percentage 0 \
    --logging_steps 1 \
    --do_train \
    --ddp_timeout 72000 \
    --save_steps 7777 \
    --dataloader_num_workers 1 \
    --preprocessing_num_workers 12 \
    --inference_batch_size_per_device 1 \
    --collection_strategy "local" \
    --raft_batch_size 1024 \
    --output_min_length 96 \
    --output_max_length 512 \
    --top_reward_percentage 0.125

However, when I use the following test script, the text is generated successfully during the generate step without any errors:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "/home/work/user-job-dir/app/liubiao/huggingface/merge_instruct_llama3_sft"
tokenizer_name = "/home/naie/work/liubiao/huggingface/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
device = "npu"
model.to(device)

input_texts = ["<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nShould you buy a case to protect your cell phone?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nIt depends on your circumstances.  If you carry your phone in a pocket or a purse then you probably want a case.  But if you only need a phone for quick interactions, a case may actually cause more harm than good.  What do you need the phone for?  Are you a parent, or do you work from home?<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat harm could it do?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nA phone case can damage the screen, for one thing.  It can also get you in trouble if you have your phone turned off for some reason.  Then you will turn it back on and it won’t do anything.  If you can afford to replace it, then you need a case to protect it.  The problem is that most people aren’t able to afford to replace their phones all the time.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nThanks for letting me know.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"] * 2 
# generation_kwargs = {
#     "max_new_tokens": 50 
# }

stop_token = "<|eot_id|>"
stop_token_id = tokenizer.encode(stop_token)[0]
# tokenizer.add_special_tokens({"eos_token": "<|eot_id|>"})

# print(tokenizer.eos_token)

generation_kwargs = {
            "max_new_tokens": 96,
            "min_length": 1,
            "top_k": 0.0,
            "top_p": 1.0,
            "do_sample": True,
            "pad_token_id": tokenizer.eos_token_id,
            "eos_token_id": stop_token_id,
            "temperature":0.85,
            "repetition_penalty": 1.2
        }

tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer(input_texts, return_tensors="pt", padding=True).to(device)
print("Input IDs size:", inputs["input_ids"].size())

with torch.no_grad():
    outputs = model.generate(**inputs, **generation_kwargs)
    print("Generated Outputs size:", outputs.size())

outputs = outputs.cpu()

generated_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)

for i, text in enumerate(generated_texts):
    print(f"Generated text {i+1}: {text}")

Expected behavior

Text is generated successfully during the Raft step.

OptimalScale / LMFlow

[BUG] The text cannot be generated successfully during the Raft step #861