Open AlphaNext opened 3 days ago
It seems that this is not due to the model, but a torch error. Are you doing distributed training?
It seems that this is not due to the model, but a torch error. Are you doing distributed training?
Yes, single node with 4 GPUs, use scripts finetune_single_rank.sh and accelerate_config_machine_single.yaml
num_processes
has been changed, with default type: distributed_type: DEEPSPEED
and run command with:
CUDA_VISIBLE_DEVICES="0,1,2,3" accelerate launch --config_file accelerate_config_machine_single.yaml --multi_gpu \
System Info / 系統信息
Python 3.10.12, torch 2.4.0+cu121, cuda12.2, accelerate=1.1.1
Information / 问题信息
Reproduction / 复现过程
Log errors:
Expected behavior / 期待表现
solve it