PKU-YuanGroup / Open-Sora-Plan

This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.
MIT License
11.4k stars 1.02k forks source link

about sp seed config. Why seed in each device is different? #366

Open Edwardmark opened 2 months ago

Edwardmark commented 2 months ago

https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/main/opensora/npu_config.py#L336 why seed += self.rank? why not just use same seed in each device?

rob-hen commented 2 months ago

When we use the same seed on all devices, there is no benefit of using multiple devices, they would all do exactly the same (in theory).

However, there are a few things unclear to me.

  1. seed_everything is set only on NPU devices. So how did you train on the H100 GPUs then?
  2. When the sampler is created, no generator is provided: https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/3c08cb2fef031dc659fbcfb33acac4c515d7bc51/opensora/train/train_t2v_diffusers.py#L433 Consequently, on each GPU, the sampler is using a generator with seed set to seed=42 https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/3c08cb2fef031dc659fbcfb33acac4c515d7bc51/opensora/utils/dataset_utils.py#L248 https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/3c08cb2fef031dc659fbcfb33acac4c515d7bc51/opensora/utils/dataset_utils.py#L251

Is that correct?

Edwardmark commented 2 months ago

@rob-hen I am using npu to do inference. What I care about is how to set seed when do inference.

LinB203 commented 2 months ago

https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/main/opensora/npu_config.py#L336 why seed += self.rank? why not just use same seed in each device?

If each GPU sets the same random number seed, then for each training step, the timestep is the same for different GPU, worrying about unfavourable training.

LinB203 commented 2 months ago

When we use the same seed on all devices, there is no benefit of using multiple devices, they would all do exactly the same (in theory).

However, there are a few things unclear to me.

  1. seed_everything is set only on NPU devices. So how did you train on the H100 GPUs then?
  2. When the sampler is created, no generator is provided: https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/3c08cb2fef031dc659fbcfb33acac4c515d7bc51/opensora/train/train_t2v_diffusers.py#L433

    Consequently, on each GPU, the sampler is using a generator with seed set to seed=42 https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/3c08cb2fef031dc659fbcfb33acac4c515d7bc51/opensora/utils/dataset_utils.py#L248

    https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/3c08cb2fef031dc659fbcfb33acac4c515d7bc51/opensora/utils/dataset_utils.py#L251

Is that correct?

  1. According to here, each GPU on the npu is actually seeded differently.
  2. The purpose is to support dynamic training, see this. Also the seed should really be the same for each device as each device maintains its own sampler. Refer to torch.
rob-hen commented 2 months ago
  1. For NPU yes, the seed is set differently. But if you do not use an NPU (like for H100 in your last training stages), I dont see where the seed is set differently.
  2. Agree, the seed should be the same on all devices for the sampler. However, the data should be split so that each rank gets its own part of the entire data. I did not find this part in your LengthGroupedSampler while it is implemented in DistributedSampler. Why is that?
LinB203 commented 2 months ago
  1. We need set different seed to make different timestep for every rank in same training step. For npu, we use seed_everything but seed += self.rank, so the seed of every rank is different . For gpu, we do not apply seed_everything so it is different.
  2. The data is split by accelerate. We take some example codes here. The runing command is accelerate launch --num_processes 2 testddp.py
import torch
from torch.utils.data import DataLoader, Dataset
from accelerate import Accelerator
from torch.utils.data import Sampler

class RandomDataset(Dataset):
    def __init__(self, length):
        self.len = length
        self.data = torch.arange(length) * 10

    def __len__(self):
        return self.len

    def __getitem__(self, index):
        return self.data[index]

class LengthGroupedSampler(Sampler):
    def __init__(
        self,
        batch_size = int,
        world_size = int,
        lengths = None, 
        group_data=False, 
        generator=None,
    ):
        if lengths is None:
            raise ValueError("Lengths must be provided.")

        self.batch_size = batch_size
        self.world_size = world_size
        self.lengths = lengths
        self.group_data = group_data
        self.generator = generator

    def __len__(self):
        return len(self.lengths)

    def __iter__(self):
        indices = torch.arange(self.lengths)  # Assuming the seeds are the same
        return iter(indices)

def main():
    accelerator = Accelerator()

    total_data_size = 8
    batch_size_per_gpu = 2
    dataset = RandomDataset(length=total_data_size)
    sampler = LengthGroupedSampler(
                batch_size_per_gpu,
                world_size=accelerator.num_processes,
                lengths=total_data_size, 
            ) 
    dataloader = DataLoader(
        dataset, 
        batch_size=batch_size_per_gpu, 
        shuffle=False, 
        sampler=sampler
        )
    dataloader = accelerator.prepare(dataloader)
    for batch in dataloader:
        print(f'rank: {accelerator.process_index}, x: {batch}')

if __name__ == "__main__":
    main()

Then, we get:

rank: 1, x: tensor([20, 30], device='cuda:1')
rank: 1, x: tensor([60, 70], device='cuda:1')
rank: 0, x: tensor([ 0, 10], device='cuda:0')
rank: 0, x: tensor([40, 50], device='cuda:0')

But if we comment out this line:

# dataloader = accelerator.prepare(dataloader)

We get:

rank: 0, x: tensor([ 0, 10])
rank: 0, x: tensor([20, 30])
rank: 0, x: tensor([40, 50])
rank: 0, x: tensor([60, 70])
rank: 1, x: tensor([ 0, 10])
rank: 1, x: tensor([20, 30])
rank: 1, x: tensor([40, 50])
rank: 1, x: tensor([60, 70])
rob-hen commented 2 months ago

thanks for all the explanations!

rob-hen commented 2 months ago

@LinB203 Why you set the seed the same for all devices after seed_everything though?

https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/3c08cb2fef031dc659fbcfb33acac4c515d7bc51/opensora/train/train_t2v_diffusers.py#L202

This should result in all GPUs using the same noise, with H100 and with NPU.

LinB203 commented 2 months ago

@LinB203 Why you set the seed the same for all devices after seed_everything though?

https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/3c08cb2fef031dc659fbcfb33acac4c515d7bc51/opensora/train/train_t2v_diffusers.py#L202

This should result in all GPUs using the same noise, with H100 and with NPU.

It is a error after merge code. We fixed it.

rob-hen commented 2 months ago

@LinB203 Now with the new code, when we set the seed and we are not using NPU, all GPUs will work with the same seed. However, we should have a GPU specific seed, derived from the global seed, as you do for NPU.

LinB203 commented 2 months ago

No, we need the same seed for all GPUs.

rob-hen commented 2 months ago

Why is that? Each GPU should have an independent timestep sampling (when not using SP). With SP, you need to use the same timestep for all members of a SP group, which you ensure here. However, different SP groups should have different timesteps. So you might want to set the seed the same for an SP group, but among different SP groups, different seeds should be used.

LinB203 commented 2 months ago

Sorry for the confusion. All GPUs require the same seed in sampler. However, in the training script, we did not set the seed for the GPUs, which means the timesteps are still different.