FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
7.75k stars 563 forks source link

save_steps and save_total_limit doesn't work #1202

Open nmquang003 opened 3 weeks ago

nmquang003 commented 3 weeks ago

My Issue:

  1. No matter what value I set for the --save_steps parameter, the system always saves the checkpoint after exactly 500 steps.
  2. No matter what value I set for the --save_total_limit parameter, the system always saves all checkpoints every 500 steps. Kaggle's output directory has a storage limit, so I want to delete older checkpoints when saving new ones.

Notebook Cell for Training bge-m3 on Kaggle Notebook with 2 T4 GPUs:


!WANDB_MODE=disabled torchrun --nproc_per_node 2 \
    -m FlagEmbedding.finetune.embedder.encoder_only.m3 \
    --model_name_or_path BAAI/bge-m3 \
    --cache_dir ./cache/model \
    --train_data /kaggle/input/data-process/train_model_01.json \
    --cache_path ./cache/data \
    --train_group_size 2 \
    --query_max_len 64 \
    --passage_max_len 392 \
    --pad_to_multiple_of 8 \
    --query_instruction_for_retrieval 'Biểu diễn câu này để tìm kiếm đoạn văn có liên quan: ' \
    --query_instruction_format '{}{}' \
    --knowledge_distillation False \
    --same_dataset_within_batch True \
    --small_threshold 0 \
    --drop_threshold 0 \
    --output_dir ./models \
    --overwrite_output_dir \
    --learning_rate 1e-5 \
    --fp16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --dataloader_drop_last True \
    --warmup_ratio 0.1 \
    --deepspeed ./ds_stage0.json \
    --logging_steps 1000 \
    --save_steps 10 \
    --negatives_cross_device \
    --temperature 0.02 \
    --sentence_pooling_method cls \
    --normalize_embeddings True \
    --kd_loss_type m3_kd_loss \
    --unified_finetuning True \
    --use_self_distill True \
    --fix_encoder False \
    --self_distill_start_step 0 \
    --save_total_limit 1
hanhainebula commented 2 weeks ago

Hello, @nmquang003! Sorry for the late response. Have you fixed this issue? We haven't encountered this issue before🤔.

nmquang003 commented 2 weeks ago

@hanhainebula you can try to add --gradient_checkpointing to the script