Open dingyuan-shi opened 1 month ago
I tried running it, but it seems that I need to remove shuffle=(args.split=='train')
because it conflicts with sampler
.
@dingyuan-shi @visionxyz , could you share a sample configuration file to launch training across multiple nodes using "Accelerate Launch"?
My training is stuck after data loads into 1 GPU per node.
@dingyuan-shi @visionxyz , could you share a sample configuration file to launch training across multiple nodes using "Accelerate Launch"?
My training is stuck after data loads into 1 GPU per node.
Hello, I just use 'accelerate launch --multi_gpu train.py', and make a config with 'accelerate config' first.
Hello, It seems that the dataloader is not adapted to distributed setting (Line 881 at train.py). The data entries will be repeatedly loaded and trained by different processes. Maybe a sampler should be added, code as below: