HuangLK / transpeeder

train llama on a single A100 80G node using 🤗 transformers and 🚀 Deepspeed Pipeline Parallelism
Apache License 2.0
208 stars 18 forks source link

Output is not getting saved #29

Open dittops opened 1 year ago

dittops commented 1 year ago

I have tried finetuning LLaMa 30B on an A100 with 2 GPUs with 80 GB each. The script got completed running in 5 min and there is no output generated. I couldn't find any error as well.

image

The command used to run the script:

deepspeed --include A1:0,1 --master_port 22384 train.py --output_dir output --init_ckpt /root/llama-30b-init-ckpt/ --data_path /root/alpaca_deepspeed.json --max_seq_len 1024 --train_steps 1000 --eval_steps 10 --save_steps 200 --log_steps 1 --pipe_parallel_size 2 --model_parallel_size 1 --use_flash_attn true --deepspeed_config ./configs/ds_config_zero1.json

HuangLK commented 1 year ago

It might be that the training is too slow, you can check the GPU usage.