Potential Issue on data loader in distributed setting.

SalesforceAIResearch / DiffusionDPO

Code for "Diffusion Model Alignment Using Direct Preference Optimization"

https://arxiv.org/abs/2311.12908

Apache License 2.0

251 stars 23 forks source link

Potential Issue on data loader in distributed setting. #15

Open dingyuan-shi opened 1 month ago

dingyuan-shi commented 1 month ago

Hello, It seems that the dataloader is not adapted to distributed setting (Line 881 at train.py). The data entries will be repeatedly loaded and trained by different processes. Maybe a sampler should be added, code as below:

train_dataloader = torch.utils.data.DataLoader(
        train_dataset,
        shuffle=(args.split=='train'),
        collate_fn=collate_fn,
        batch_size=args.train_batch_size,
        num_workers=args.dataloader_num_workers,
        drop_last=True, 
        sampler=torch.utils.data.distributed.DistributedSampler(train_dataset), 
    )

visionxyz commented 2 weeks ago

I tried running it, but it seems that I need to remove shuffle=(args.split=='train') because it conflicts with sampler.

avinabsaha commented 1 week ago

@dingyuan-shi @visionxyz , could you share a sample configuration file to launch training across multiple nodes using "Accelerate Launch"?

My training is stuck after data loads into 1 GPU per node.

visionxyz commented 1 week ago

@dingyuan-shi @visionxyz , could you share a sample configuration file to launch training across multiple nodes using "Accelerate Launch"?

My training is stuck after data loads into 1 GPU per node.

Hello, I just use 'accelerate launch --multi_gpu train.py', and make a config with 'accelerate config' first.