OpenBMB / MiniCPM

MiniCPM-2B: An end-side LLM outperforming Llama2-13B.
Apache License 2.0
4.67k stars 334 forks source link

[Feature Request]: Can you provide a detailed requirements.txt #4

Closed fpcsong closed 6 months ago

fpcsong commented 6 months ago

Feature request / 功能建议

Your nice work helps me a lot! I meet some bugs when finetuning the openbmb/MiniCPM-2B-sft-bf16, I guess it should be caused by version inconsistency of some packages(torch, accelerate, etc.), I have checked requirements here, could you provide a detailed requirements.txt? Thanks.

ShengdingHu commented 6 months ago

Could you paste your bug report? Thanks!

fpcsong commented 6 months ago

It does not crash directly, but it create multiple processes on cuda 0.

SwordFaith commented 6 months ago

Would u mind paste your script? It seems not correctly using CUDA VISIBLE DEVICES for isolation

fpcsong commented 6 months ago

It is our internal tool-kits and is adapted to many transformer based models.The script

        deepspeed --num_gpus 8 benchmark.py \
        -it \
        -t_data $TRAINDATA \
        -te \
        -v_data $EVALDATA \
        --model_path $BASEMODEL \
        --model_name $2 \
        --gen_config $3 \
        --bf16 \
        -output_dir $OUTDIR \
        -m_bsz $4 \
        -e_bsz $4 \
        -max_len 1024 \
        --max_steps 3072 \
        --save_steps 1024 \
        --template_name none \
        -lr 2e-5 \
        -bsz 64 \
        --gradient_checkpointing \
        --train_files_pattern  '/*/train/*.jsonl' \
        --val_files_pattern '/*/eval/*.jsonl' \
        -output \
        --deepspeed true \

I have encountered similar problems before, usually a bug in memory management of a certain library, such as torch/deepspeed/peft or flash_attn on a certain CUDA version. So I guess it must be some version mismatch in our envs.

SwordFaith commented 6 months ago

It is our internal tool-kits and is adapted to many transformer based models.The script

        deepspeed --num_gpus 8 benchmark.py \
        -it \
        -t_data $TRAINDATA \
        -te \
        -v_data $EVALDATA \
        --model_path $BASEMODEL \
        --model_name $2 \
        --gen_config $3 \
        --bf16 \
        -output_dir $OUTDIR \
        -m_bsz $4 \
        -e_bsz $4 \
        -max_len 1024 \
        --max_steps 3072 \
        --save_steps 1024 \
        --template_name none \
        -lr 2e-5 \
        -bsz 64 \
        --gradient_checkpointing \
        --train_files_pattern  '/*/train/*.jsonl' \
        --val_files_pattern '/*/eval/*.jsonl' \
        -output \
        --deepspeed true \

I have encountered similar problems before, usually a bug in memory management of a certain library, such as torch/deepspeed/peft or flash_attn on a certain CUDA version. So I guess it must be some version mismatch in our envs.

It appears that benchmark.py is not included in our repository. Could you please provide more details? My suspicion is that device_map might be the root cause.

fpcsong commented 6 months ago

It is our internal tool-kits. In short, can you please provide your version of cuda, torch, deepspeed, flash_attn, xformers, and other key packages.