Open WuYeeh opened 4 weeks ago
Add the code "if dist.is_available() and dist.is_initialized():" before the "dist.all_reduce(avg_loss, op=dist.ReduceOp.SUM)" and "dist.barrier()". And change "model.module.save_pretrained(checkpoint_path)" to if dist.is_available() and dist.is_initialized(): model.module.save_pretrained(checkpoint_path) else: model.save_pretrained(checkpoint_path) And then you can run train.py use single GPU.
Can you show me your command? I want to reproduce the error
Can you show me your command? I want to reproduce the error
CUDA_VISIBLE_DEVICES=5 accelerate launch \ --num_processes=1 \ train.py \ --model_name_or_path Shitao/OmniGen-v1 \ --batch_size_per_device 2 \ --condition_dropout_prob 0.01 \ --lr 1e-3 \ --use_lora \ --lora_rank 8 \ --json_file ./toy_data/toy_subject_data.jsonl \ --image_path ./toy_data/images \ --max_input_length_limit 18000 \ --keep_raw_resolution \ --max_image_size 1024 \ --gradient_accumulation_steps 1 \ --ckpt_every 50 \ --epochs 200 \ --log_every 1 \ --results_dir ./results/toy_finetune_lora
Add the code "if dist.is_available() and dist.is_initialized():" before the "dist.all_reduce(avg_loss, op=dist.ReduceOp.SUM)" and "dist.barrier()". And change "model.module.save_pretrained(checkpoint_path)" to if dist.is_available() and dist.is_initialized(): model.module.save_pretrained(checkpoint_path) else: model.save_pretrained(checkpoint_path) And then you can run train.py use single GPU.
Thanks for your suggestions! I have updated these code.
When I use a single GPU for fine-tune, I encounter the following error. How can I solve it? Traceback (most recent call last): File "/OmniGen/train.py", line 368, in
main(args)
File "/OmniGen/train.py", line 236, in main
dist.all_reduce(avg_loss, op=dist.ReduceOp.SUM)
File "/anaconda3/envs/omnigen/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
return func(*args, **kwargs)
File "/anaconda3/envs/omnigen/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2208, in all_reduce
group = _get_default_group()
File "/anaconda3/envs/omnigen/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1008, in _get_default_group
raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.