Train_loss = 0 and Eval_loss = NaN in stage2_sft

xuxiaoang commented 4 months ago

Hello! Thank you for your work at MLLM. I had a fine-tuning bug that I couldn't fix: when I ran the stage2_sft.sh script and trained with speech_conv_datasets only, the logger showed that the train loss was 0 all the time and eval loss was NaN, as shown in the figure. 屏幕截图 2024-07-20 210750

Command in stage2_sft.sh as follows:

torchrun
    --nproc_per_node 2 \
    anygpt/src/train/stage2_sft.py \
    --model_name_or_path "${METAROOT}" \
    --run_name "mm_sft" \
    --cache_dir ${CACHEROOT} \
    --report_to "wandb" \
    --speech_conv_datasets "$speech_conv_datasets" \
    --speech_datasets "$speech_datasets"\
    --preprocessing_num_workers 100 \
    --bf16 True \
    --do_train \
    --do_eval \
    --output_dir "${OUTROOT}" \
    --model_max_length 4096 \
    --save_strategy "steps" \
    --save_steps 5 \
    --evaluation_strategy "steps" \
    --eval_steps 5 \
    --max_steps 5 \
    --concatenating False \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --val_set_size 10 \
    --num_train_epochs 3\
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --log_level debug \
    --logging_steps 1 \
    --overwrite_output_dir False\
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --use_flash_attn True \
    --ddp_timeout 7200 \
    --save_total_limit 10

I'm using the following python environment:

transformers              4.34.1
huggingface-hub           0.24.0
tokenizers                0.14.1
torch                     2.1.0
torchaudio                2.1.0
torchvision               0.16.0
flash-attn                2.5.9.post1

JunZhan2000 commented 3 months ago

Hi, is your training data very small? Maybe you can use a larger training data?

xuxiaoang commented 3 months ago

Hi, thank you for your reply.

I change the dataset to the whole metadata.jsonl in part1 of AnyInstruct dataset, but there are still issues.

When I was debugging, I found that the preprocess method in anygpt/src/train/stage2_sft.py would MASK all tokens in targets by IGNORE_TOKEN_ID and return them as labels, as shown below: masked_targets

I noticed that the comment on line 248 of the source code: Mask targets. Only compute loss on the assistant outputs. Does this mean that anygpt_system_prompt part and user_massage part need to be masked and only the "anygpt_massage" part should remain? I personally think that there are some minor bugs at the part of masking the tokens in preprocess method.

By the way, could you explain why the user_massage part needs to be masked? Is this based on rules or experience? What happens if the user_massage part is not masked?

Looking forward to your reply.

Thanks.

JunZhan2000 commented 3 months ago

Does this mean that anygpt_system_prompt part and user_massage part need to be masked and only the "anygpt_massage" part should remain? yes, it is.

I think this code seems to work fine on my data. Ideally, except for the part of the model response, the targets corresponding to other tokens will be set to -100, which means no loss is calculated. We do this because it seems to be a common practice for fine-tuning instructions, but we actually tried not to do this and directly calculate the loss on the entire sequence, and I don’t think there is much difference

OpenMOSS / AnyGPT

Train_loss = 0 and Eval_loss = NaN in stage2_sft #31