01-ai / Yi

A series of large language models trained from scratch by developers @01-ai
https://01.ai
Apache License 2.0
7.6k stars 469 forks source link

自定义数据train.jsonl 8万多,eval.jsonl 105条,为什么SFT时候只显示 length of train dataset:2852,length of eval dataset: 9 #493

Closed 15024287710Jackson closed 4 months ago

15024287710Jackson commented 5 months ago

Reminder

Environment

- OS:
- Python: 3.10
- PyTorch: 2.0.1+cu117
- CUDA: 11.6
模型:Yi-6B-200K,

Current Behavior

image

Expected Behavior

为什么显示的不是8万多条train数据,还有100多条eval数据

Steps to Reproduce

sft的sh文件配置如下:

/usr/bin/env bash

cd "$(dirname "${BASH_SOURCE[0]}")/../sft/"

deepspeed main.py \ --data_path ../yi_2024041001_govern_data \ --model_name_or_path /home/ma-user/work/Yi-6B-200K \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --max_seq_len 4096 \ --learning_rate 2e-5 \ --weight_decay 0. \ --num_train_epochs 8 \ --training_debug_steps 20 \ --gradient_accumulation_steps 1 \ --lr_scheduler_type cosine \ --num_warmup_steps 10 \ --seed 1234 \ --gradient_checkpointing \ --zero_stage 2 \ --deepspeed \ --offload \ --print_loss \ --output_dir ./2024041101finetuned_modelfloat32_governdata_lr_2e-5_warmup_10_epochs_8

Anything Else?

No response

Yimi81 commented 5 months ago

看样子是加载了默认的yi_example_dataset,是不是缓存没清