ValueError: Default process group has not been initialized, please make sure to call init_process_group.

VectorSpaceLab / OmniGen

OmniGen: Unified Image Generation. https://arxiv.org/pdf/2409.11340

MIT License

2.97k stars 237 forks source link

ValueError: Default process group has not been initialized, please make sure to call init_process_group. #71

Open WuYeeh opened 4 weeks ago

WuYeeh commented 4 weeks ago

When I use a single GPU for fine-tune, I encounter the following error. How can I solve it? Traceback (most recent call last): File "/OmniGen/train.py", line 368, in main(args) File "/OmniGen/train.py", line 236, in main dist.all_reduce(avg_loss, op=dist.ReduceOp.SUM) File "/anaconda3/envs/omnigen/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(*args, **kwargs) File "/anaconda3/envs/omnigen/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2208, in all_reduce group = _get_default_group() File "/anaconda3/envs/omnigen/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1008, in _get_default_group raise ValueError( ValueError: Default process group has not been initialized, please make sure to call init_process_group.

WuYeeh commented 4 weeks ago

Add the code "if dist.is_available() and dist.is_initialized():" before the "dist.all_reduce(avg_loss, op=dist.ReduceOp.SUM)" and "dist.barrier()". And change "model.module.save_pretrained(checkpoint_path)" to if dist.is_available() and dist.is_initialized(): model.module.save_pretrained(checkpoint_path) else: model.save_pretrained(checkpoint_path) And then you can run train.py use single GPU.

staoxiao commented 4 weeks ago

Can you show me your command? I want to reproduce the error

WuYeeh commented 4 weeks ago

Can you show me your command? I want to reproduce the error

CUDA_VISIBLE_DEVICES=5 accelerate launch \ --num_processes=1 \ train.py \ --model_name_or_path Shitao/OmniGen-v1 \ --batch_size_per_device 2 \ --condition_dropout_prob 0.01 \ --lr 1e-3 \ --use_lora \ --lora_rank 8 \ --json_file ./toy_data/toy_subject_data.jsonl \ --image_path ./toy_data/images \ --max_input_length_limit 18000 \ --keep_raw_resolution \ --max_image_size 1024 \ --gradient_accumulation_steps 1 \ --ckpt_every 50 \ --epochs 200 \ --log_every 1 \ --results_dir ./results/toy_finetune_lora

staoxiao commented 2 weeks ago

Add the code "if dist.is_available() and dist.is_initialized():" before the "dist.all_reduce(avg_loss, op=dist.ReduceOp.SUM)" and "dist.barrier()". And change "model.module.save_pretrained(checkpoint_path)" to if dist.is_available() and dist.is_initialized(): model.module.save_pretrained(checkpoint_path) else: model.save_pretrained(checkpoint_path) And then you can run train.py use single GPU.

Thanks for your suggestions! I have updated these code.