Oneflow-Inc / libai

LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training
https://libai.readthedocs.io
Apache License 2.0
390 stars 55 forks source link

LLaMA-7B SFT died with <Signals.SIGABRT: 6> #539

Open PussyCat0700 opened 6 months ago

PussyCat0700 commented 6 months ago

配置:单卡A100 在Finetune时遇到SIGABRT: 6错误

  File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/site-packages/oneflow/distributed/launch.py", line 240, in <module>
    main()
  File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/site-packages/oneflow/distributed/launch.py", line 228, in main
    sigkill_handler(signal.SIGTERM, None)
  File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/site-packages/oneflow/distributed/launch.py", line 196, in sigkill_handler
    raise subprocess.CalledProcessError(
subprocess.CalledProcessError: Command '['/home/yfliu/anaconda3/envs/oneflow/bin/python3', '-u', 'projects/Llama/train_net.py', '--config-file', 'projects/Llama/configs/llama_sft.py']' died with <Signals.SIGABRT: 6>.
set -e
if [ -z "$1" ]; then
    echo "Usage: $0 <number>"
    exit 1
fi
libai_path=../libai
cd $libai_path
# scripts split in case blocks.
case $1 in
1)
# See https://github.com/Oneflow-Inc/libai/tree/main/projects/Llama for reference
# Notice:
# 1. Please make sure you have setup destination_path and checkpoint_dir
# For example, our checkpoint_dir is /data1/yfliu/models/LLaMA2/LLaMA2_hf_7B downloaded from https://llama.meta.com/llama-downloads/
# our destination dir is /data1/yfliu/alpaca
# 2. You should also modify terms in projects/Llama/configs/llama_config.py
python projects/Llama/utils/prepare_alpaca.py
;;
2)
# full finetune
# Please set the finetuning parameters in projects/Llama/configs/llama_sft.py, such as dataset_path and pretrained_model_path
# Type python3 -m oneflow.distributed.launch -h for more usage
FILE=projects/Llama/train_net.py
CONFIG=projects/Llama/configs/llama_sft.py
GPUS=1
NODE=1
NODE_RANK=0
ADDR=127.0.0.1
PORT=12345
LOGDIR=/home/yfliu/horizontal/oneflowtest/runs/llama2/oneflow

export ONEFLOW_FUSE_OPTIMIZER_UPDATE_CAST=true

python3 -m oneflow.distributed.launch \
--nproc_per_node $GPUS --nnodes $NODE --node_rank $NODE_RANK --master_addr $ADDR --master_port $PORT --logdir $LOGDIR --redirect_stdout_and_stderr \
$FILE --config-file $CONFIG
;;
esac

bash llama_sft.sh 2

在执行SFT训练时报错,似乎无法定位到是哪里出了问题。