mask random seed fix - Githubissues

z-fabian commented 3 years ago

When using ddp training, workers with the same id across different devices share the same initial random seed for the mask function. This decreases the diversity of random masks during training, since the same mask will be generated on different devices throughout training. This fix generates a unique initial random seed for each worker based on their id, device rank and dataset (in case of combine_train_val = True).

z-fabian commented 3 years ago

@mmuckley I agree that this is probably expected behavior when using pl_seed_everything() and it works normally, but I think it makes sense to generate completely unique masks in each worker (even though the data it is applied to will be different). There is even a slight improvement in validation ssim on smaller subsampled datasets.

Also I agree that seed = (worker_info.seed + torch.distributed.get_rank() * worker_info.num_workers) % (2 ** 32 - 1) should be used in the distributed case when there is a single dataset. In the non-distributed setting we should fall back on the same seed as before.

I included this part

        if torch.distributed.is_initialized():
            is_ddp = True
            device_rank = torch.distributed.get_rank()
            num_gpus = torch.distributed.get_world_size()`

to check whether distributed training is being used. If not, we probably cannot call torch.distributed.get_rank(). Is there a simpler way of doing this? I might have misunderstood your comment on this part.

mmuckley commented 3 years ago

@z-fabian I thought your original code with a serial check of torch.distributed.is_available() and then torch.distributed.is_initialized() was reasonable. After this we should be able to call torch.distributed.get_rank() and get the global rank, unless I'm missing something. We shouldn't need the GPU count - we just need our process rank.

z-fabian commented 3 years ago

@mmuckley Oh, I see. I use the GPU count to generate unique seed when we have multiple datasets here:

                    seed_i = (
                        base_seed
                        + device_rank * num_workers
                        + i * num_gpus * num_workers
                    ) % (2 ** 32 - 1)

because here we have 3 'dimensions' that can vary. That is we need a unique seed for each dataset on each worker on each GPU. But I could base it on the total number of datasets if that's preferable, like this:

                    seed_i = (
                        base_seed
                        + i * num_workers
                        + device_rank * num_datasets * num_workers
                    ) % (2 ** 32 - 1)

Am I missing something here? Is there anything else I should change?

mmuckley commented 3 years ago

Oh I see, in that case the "spread" for each GPU process is num_workers * num_datasets. In that case, For a single dataset we need

seed = base_seed + torch.distributed.get_rank() * worker_info.num_workers

For single dataset code on a 2-GPU problem with 4 workers per GPU this gives:

rank: 1, worker: 0, seed: 734796318
rank: 0, worker: 1, seed: 734796315
rank: 0, worker: 2, seed: 734796316
rank: 1, worker: 1, seed: 734796319
rank: 0, worker: 3, seed: 734796317
rank: 1, worker: 2, seed: 734796320
rank: 1, worker: 3, seed: 734796321

For multiple datasets, we need:

seed_i = (
    base_seed
    - worker_info.id
    + torch.distributed.get_rank()
    * (worker_info.num_workers * len(data.datasets))
    + worker_info.id * len(data.datasets)
    + i
) % (2 ** 32 - 1)

This multiple dataset code on a 2-GPU problem with 4 workers per GPU gives:

rank: 0, worker: 0, dataset: 0, seed: 2723649642
rank: 0, worker: 0, dataset: 1, seed: 2723649643
rank: 1, worker: 0, dataset: 0, seed: 2723649650
rank: 1, worker: 0, dataset: 1, seed: 2723649651
rank: 0, worker: 1, dataset: 0, seed: 2723649644
rank: 0, worker: 1, dataset: 1, seed: 2723649645
rank: 1, worker: 1, dataset: 0, seed: 2723649652
rank: 1, worker: 1, dataset: 1, seed: 2723649653
rank: 0, worker: 2, dataset: 0, seed: 2723649646
rank: 0, worker: 2, dataset: 1, seed: 2723649647
rank: 1, worker: 2, dataset: 0, seed: 2723649654
rank: 1, worker: 2, dataset: 1, seed: 2723649655
rank: 0, worker: 3, dataset: 0, seed: 2723649648
rank: 0, worker: 3, dataset: 1, seed: 2723649649
rank: 1, worker: 3, dataset: 0, seed: 2723649656
rank: 1, worker: 3, dataset: 1, seed: 2723649657

Ignore this

The below code is just a reference for how the default seeding works. On a 2-GPU DDP job with 4 workers per GPU, we get:

torch.distributed.get_rank(): 1, worker.info.seed: 8314211556539077902
torch.distributed.get_rank(): 0, worker.info.seed: 8314211556539077902
torch.distributed.get_rank(): 1, worker.info.seed: 8314211556539077903
torch.distributed.get_rank(): 0, worker.info.seed: 8314211556539077903
torch.distributed.get_rank(): 1, worker.info.seed: 8314211556539077904
torch.distributed.get_rank(): 0, worker.info.seed: 8314211556539077904
torch.distributed.get_rank(): 1, worker.info.seed: 8314211556539077905
torch.distributed.get_rank(): 0, worker.info.seed: 8314211556539077905

For a 4-node, 32-GPU job, on the 5th GPU on the 3rd node with 10 workers I get:

torch.distributed.get_rank(): 20, worker.info.seed: 6298923453764313846
torch.distributed.get_rank(): 20, worker.info.seed: 6298923453764313847
torch.distributed.get_rank(): 20, worker.info.seed: 6298923453764313848
torch.distributed.get_rank(): 20, worker.info.seed: 6298923453764313849
torch.distributed.get_rank(): 20, worker.info.seed: 6298923453764313850
torch.distributed.get_rank(): 20, worker.info.seed: 6298923453764313851
torch.distributed.get_rank(): 20, worker.info.seed: 6298923453764313852
torch.distributed.get_rank(): 20, worker.info.seed: 6298923453764313853
torch.distributed.get_rank(): 20, worker.info.seed: 6298923453764313854
torch.distributed.get_rank(): 20, worker.info.seed: 6298923453764313855

The 2nd GPU on the 1st node is:

torch.distributed.get_rank(): 1, worker.info.seed: 6298923453764313846
torch.distributed.get_rank(): 1, worker.info.seed: 6298923453764313847
torch.distributed.get_rank(): 1, worker.info.seed: 6298923453764313848
torch.distributed.get_rank(): 1, worker.info.seed: 6298923453764313849
torch.distributed.get_rank(): 1, worker.info.seed: 6298923453764313850
torch.distributed.get_rank(): 1, worker.info.seed: 6298923453764313851
torch.distributed.get_rank(): 1, worker.info.seed: 6298923453764313852
torch.distributed.get_rank(): 1, worker.info.seed: 6298923453764313853
torch.distributed.get_rank(): 1, worker.info.seed: 6298923453764313854
torch.distributed.get_rank(): 1, worker.info.seed: 6298923453764313855

z-fabian commented 3 years ago

Exactly, we are on the same page.

For multiple datasets, we need:

seed_i = ( base_seed

worker_info.id

torch.distributed.get_rank()

(worker_info.num_workers * len(data.datasets))

worker_info.id * len(data.datasets)

i ) % (2 ** 32 - 1)

This seems to be equivalent to

      seed_i = (
             base_seed
             + i * worker_info.num_workers
             + torch.distributed.get_rank() * len(data.datasets) * worker_info.num_workers
) % (2 ** 32 - 1)

but with the role of i and worker_info.id swapped because base_seed already has worker_info.id built in. Anyways, I will update the commit with the version you suggested since we have both verified that it works well.

z-fabian commented 3 years ago

@mmuckley Updated, it should be good now.

z-fabian commented 3 years ago

Updated! Mask seeds should be truly random now.

facebookresearch / fastMRI

mask random seed fix #123