CPO 复现，模型重复输出

你好，我按照脚本里默认的超参数（learning rate），以及论文提到各参数配置、偏好数据，在ALMA-7B-Lora上做CPO，但是训出来的模型输出大量重复前文甚至不翻译的情况，如下图（zh->en,raw_res 是没用utils里的clean函数的结果），请问是哪里没设好超参吗？谢谢你。

下面是的训练脚本

accelerate launch --main_process_port ${port} --config_file configs/deepspeed_train_config_bf16.yaml \
     run_cpo_llmmt.py \
    --model_name_or_path xxxx/ALMA-7B-Pretrain \
    --tokenizer_name xxxx/ALMA-7B-Pretrain \
    --peft_model_id  xxxx/ALMA-7B-Pretrain-LoRA \
    --cpo_scorer kiwi_xcomet \
    --beta 0.1 \
    --use_flash_attention_2 True \
    --use_peft \
    --use_fast_tokenizer False \
    --cpo_data_path  xxxx/ALMA-R-Preference \
    --do_train \
    --language_pairs ${pairs} \
    --low_cpu_mem_usage \
    --bf16 \
    --learning_rate 1e-4 \
    --weight_decay 0.01 \
    --gradient_accumulation_steps 4 \
    --gradient_checkpointing True \
    --lr_scheduler_type inverse_sqrt \
    --warmup_ratio 0.01 \
    --ignore_pad_token_for_loss \
    --ignore_prompt_token_for_loss \
    --per_device_train_batch_size 16 \
    --evaluation_strategy no \
    --save_strategy steps \
    --save_total_limit 2 \
    --logging_strategy steps \
    --logging_steps 0.05 \
    --output_dir ${OUTPUT_DIR} \
    --num_train_epochs 1 \
    --prediction_loss_only \
    --max_new_tokens 256 \
    --max_source_length 256 \
    --max_prompt_length 256 \
    --max_length 512 \
    --seed 42 \
    --overwrite_output_dir \
    --report_to tensorboard \
    --overwrite_cache

fe1ixxu / ALMA

CPO 复现，模型重复输出 #65