File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/site-packages/oneflow/distributed/launch.py", line 240, in <module>
main()
File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/site-packages/oneflow/distributed/launch.py", line 228, in main
sigkill_handler(signal.SIGTERM, None)
File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/site-packages/oneflow/distributed/launch.py", line 196, in sigkill_handler
raise subprocess.CalledProcessError(
subprocess.CalledProcessError: Command '['/home/yfliu/anaconda3/envs/oneflow/bin/python3', '-u', 'projects/Llama/train_net.py', '--config-file', 'projects/Llama/configs/llama_sft.py']' died with <Signals.SIGABRT: 6>.
我的脚本
set -e
if [ -z "$1" ]; then
echo "Usage: $0 <number>"
exit 1
fi
libai_path=../libai
cd $libai_path
# scripts split in case blocks.
case $1 in
1)
# See https://github.com/Oneflow-Inc/libai/tree/main/projects/Llama for reference
# Notice:
# 1. Please make sure you have setup destination_path and checkpoint_dir
# For example, our checkpoint_dir is /data1/yfliu/models/LLaMA2/LLaMA2_hf_7B downloaded from https://llama.meta.com/llama-downloads/
# our destination dir is /data1/yfliu/alpaca
# 2. You should also modify terms in projects/Llama/configs/llama_config.py
python projects/Llama/utils/prepare_alpaca.py
;;
2)
# full finetune
# Please set the finetuning parameters in projects/Llama/configs/llama_sft.py, such as dataset_path and pretrained_model_path
# Type python3 -m oneflow.distributed.launch -h for more usage
FILE=projects/Llama/train_net.py
CONFIG=projects/Llama/configs/llama_sft.py
GPUS=1
NODE=1
NODE_RANK=0
ADDR=127.0.0.1
PORT=12345
LOGDIR=/home/yfliu/horizontal/oneflowtest/runs/llama2/oneflow
export ONEFLOW_FUSE_OPTIMIZER_UPDATE_CAST=true
python3 -m oneflow.distributed.launch \
--nproc_per_node $GPUS --nnodes $NODE --node_rank $NODE_RANK --master_addr $ADDR --master_port $PORT --logdir $LOGDIR --redirect_stdout_and_stderr \
$FILE --config-file $CONFIG
;;
esac
配置:单卡A100 在Finetune时遇到SIGABRT: 6错误
bash llama_sft.sh 2
在执行SFT训练时报错,似乎无法定位到是哪里出了问题。