Open Edwardmark opened 2 months ago
When we use the same seed on all devices, there is no benefit of using multiple devices, they would all do exactly the same (in theory).
However, there are a few things unclear to me.
seed_everything
is set only on NPU devices. So how did you train on the H100 GPUs then?seed=42
https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/3c08cb2fef031dc659fbcfb33acac4c515d7bc51/opensora/utils/dataset_utils.py#L248
https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/3c08cb2fef031dc659fbcfb33acac4c515d7bc51/opensora/utils/dataset_utils.py#L251Is that correct?
@rob-hen I am using npu to do inference. What I care about is how to set seed when do inference.
https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/main/opensora/npu_config.py#L336 why seed += self.rank? why not just use same seed in each device?
If each GPU sets the same random number seed, then for each training step, the timestep is the same for different GPU, worrying about unfavourable training.
When we use the same seed on all devices, there is no benefit of using multiple devices, they would all do exactly the same (in theory).
However, there are a few things unclear to me.
seed_everything
is set only on NPU devices. So how did you train on the H100 GPUs then?When the sampler is created, no generator is provided: https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/3c08cb2fef031dc659fbcfb33acac4c515d7bc51/opensora/train/train_t2v_diffusers.py#L433
Consequently, on each GPU, the sampler is using a generator with seed set to
seed=42
https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/3c08cb2fef031dc659fbcfb33acac4c515d7bc51/opensora/utils/dataset_utils.py#L248Is that correct?
LengthGroupedSampler
while it is implemented in DistributedSampler
. Why is that?seed += self.rank
, so the seed of every rank is different . For gpu, we do not apply seed_everything so it is different.accelerate launch --num_processes 2 testddp.py
import torch
from torch.utils.data import DataLoader, Dataset
from accelerate import Accelerator
from torch.utils.data import Sampler
class RandomDataset(Dataset):
def __init__(self, length):
self.len = length
self.data = torch.arange(length) * 10
def __len__(self):
return self.len
def __getitem__(self, index):
return self.data[index]
class LengthGroupedSampler(Sampler):
def __init__(
self,
batch_size = int,
world_size = int,
lengths = None,
group_data=False,
generator=None,
):
if lengths is None:
raise ValueError("Lengths must be provided.")
self.batch_size = batch_size
self.world_size = world_size
self.lengths = lengths
self.group_data = group_data
self.generator = generator
def __len__(self):
return len(self.lengths)
def __iter__(self):
indices = torch.arange(self.lengths) # Assuming the seeds are the same
return iter(indices)
def main():
accelerator = Accelerator()
total_data_size = 8
batch_size_per_gpu = 2
dataset = RandomDataset(length=total_data_size)
sampler = LengthGroupedSampler(
batch_size_per_gpu,
world_size=accelerator.num_processes,
lengths=total_data_size,
)
dataloader = DataLoader(
dataset,
batch_size=batch_size_per_gpu,
shuffle=False,
sampler=sampler
)
dataloader = accelerator.prepare(dataloader)
for batch in dataloader:
print(f'rank: {accelerator.process_index}, x: {batch}')
if __name__ == "__main__":
main()
Then, we get:
rank: 1, x: tensor([20, 30], device='cuda:1')
rank: 1, x: tensor([60, 70], device='cuda:1')
rank: 0, x: tensor([ 0, 10], device='cuda:0')
rank: 0, x: tensor([40, 50], device='cuda:0')
But if we comment out this line:
# dataloader = accelerator.prepare(dataloader)
We get:
rank: 0, x: tensor([ 0, 10])
rank: 0, x: tensor([20, 30])
rank: 0, x: tensor([40, 50])
rank: 0, x: tensor([60, 70])
rank: 1, x: tensor([ 0, 10])
rank: 1, x: tensor([20, 30])
rank: 1, x: tensor([40, 50])
rank: 1, x: tensor([60, 70])
thanks for all the explanations!
@LinB203 Why you set the seed the same for all devices after seed_everything
though?
This should result in all GPUs using the same noise, with H100 and with NPU.
@LinB203 Why you set the seed the same for all devices after
seed_everything
though?This should result in all GPUs using the same noise, with H100 and with NPU.
It is a error after merge code. We fixed it.
@LinB203 Now with the new code, when we set the seed and we are not using NPU, all GPUs will work with the same seed. However, we should have a GPU specific seed, derived from the global seed, as you do for NPU.
No, we need the same seed for all GPUs.
Why is that? Each GPU should have an independent timestep sampling (when not using SP). With SP, you need to use the same timestep for all members of a SP group, which you ensure here. However, different SP groups should have different timesteps. So you might want to set the seed the same for an SP group, but among different SP groups, different seeds should be used.
Sorry for the confusion. All GPUs require the same seed in sampler. However, in the training script, we did not set the seed for the GPUs, which means the timesteps are still different.
https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/main/opensora/npu_config.py#L336 why seed += self.rank? why not just use same seed in each device?