Closed z-fabian closed 3 years ago
@mmuckley I agree that this is probably expected behavior when using pl_seed_everything()
and it works normally, but I think it makes sense to generate completely unique masks in each worker (even though the data it is applied to will be different). There is even a slight improvement in validation ssim on smaller subsampled datasets.
Also I agree that seed = (worker_info.seed + torch.distributed.get_rank() * worker_info.num_workers) % (2 ** 32 - 1)
should be used in the distributed case when there is a single dataset. In the non-distributed setting we should fall back on the same seed as before.
I included this part
if torch.distributed.is_initialized():
is_ddp = True
device_rank = torch.distributed.get_rank()
num_gpus = torch.distributed.get_world_size()`
to check whether distributed training is being used. If not, we probably cannot call torch.distributed.get_rank()
. Is there a simpler way of doing this? I might have misunderstood your comment on this part.
@z-fabian I thought your original code with a serial check of torch.distributed.is_available()
and then torch.distributed.is_initialized()
was reasonable. After this we should be able to call torch.distributed.get_rank()
and get the global rank, unless I'm missing something. We shouldn't need the GPU count - we just need our process rank.
@mmuckley Oh, I see. I use the GPU count to generate unique seed when we have multiple datasets here:
seed_i = (
base_seed
+ device_rank * num_workers
+ i * num_gpus * num_workers
) % (2 ** 32 - 1)
because here we have 3 'dimensions' that can vary. That is we need a unique seed for each dataset on each worker on each GPU. But I could base it on the total number of datasets if that's preferable, like this:
seed_i = (
base_seed
+ i * num_workers
+ device_rank * num_datasets * num_workers
) % (2 ** 32 - 1)
Am I missing something here? Is there anything else I should change?
Oh I see, in that case the "spread" for each GPU process is num_workers * num_datasets
. In that case, For a single dataset we need
seed = base_seed + torch.distributed.get_rank() * worker_info.num_workers
For single dataset code on a 2-GPU problem with 4 workers per GPU this gives:
rank: 1, worker: 0, seed: 734796318
rank: 0, worker: 1, seed: 734796315
rank: 0, worker: 2, seed: 734796316
rank: 1, worker: 1, seed: 734796319
rank: 0, worker: 3, seed: 734796317
rank: 1, worker: 2, seed: 734796320
rank: 1, worker: 3, seed: 734796321
For multiple datasets, we need:
seed_i = (
base_seed
- worker_info.id
+ torch.distributed.get_rank()
* (worker_info.num_workers * len(data.datasets))
+ worker_info.id * len(data.datasets)
+ i
) % (2 ** 32 - 1)
This multiple dataset code on a 2-GPU problem with 4 workers per GPU gives:
rank: 0, worker: 0, dataset: 0, seed: 2723649642
rank: 0, worker: 0, dataset: 1, seed: 2723649643
rank: 1, worker: 0, dataset: 0, seed: 2723649650
rank: 1, worker: 0, dataset: 1, seed: 2723649651
rank: 0, worker: 1, dataset: 0, seed: 2723649644
rank: 0, worker: 1, dataset: 1, seed: 2723649645
rank: 1, worker: 1, dataset: 0, seed: 2723649652
rank: 1, worker: 1, dataset: 1, seed: 2723649653
rank: 0, worker: 2, dataset: 0, seed: 2723649646
rank: 0, worker: 2, dataset: 1, seed: 2723649647
rank: 1, worker: 2, dataset: 0, seed: 2723649654
rank: 1, worker: 2, dataset: 1, seed: 2723649655
rank: 0, worker: 3, dataset: 0, seed: 2723649648
rank: 0, worker: 3, dataset: 1, seed: 2723649649
rank: 1, worker: 3, dataset: 0, seed: 2723649656
rank: 1, worker: 3, dataset: 1, seed: 2723649657
Ignore this
The below code is just a reference for how the default seeding works. On a 2-GPU DDP job with 4 workers per GPU, we get:
torch.distributed.get_rank(): 1, worker.info.seed: 8314211556539077902
torch.distributed.get_rank(): 0, worker.info.seed: 8314211556539077902
torch.distributed.get_rank(): 1, worker.info.seed: 8314211556539077903
torch.distributed.get_rank(): 0, worker.info.seed: 8314211556539077903
torch.distributed.get_rank(): 1, worker.info.seed: 8314211556539077904
torch.distributed.get_rank(): 0, worker.info.seed: 8314211556539077904
torch.distributed.get_rank(): 1, worker.info.seed: 8314211556539077905
torch.distributed.get_rank(): 0, worker.info.seed: 8314211556539077905
For a 4-node, 32-GPU job, on the 5th GPU on the 3rd node with 10 workers I get:
torch.distributed.get_rank(): 20, worker.info.seed: 6298923453764313846
torch.distributed.get_rank(): 20, worker.info.seed: 6298923453764313847
torch.distributed.get_rank(): 20, worker.info.seed: 6298923453764313848
torch.distributed.get_rank(): 20, worker.info.seed: 6298923453764313849
torch.distributed.get_rank(): 20, worker.info.seed: 6298923453764313850
torch.distributed.get_rank(): 20, worker.info.seed: 6298923453764313851
torch.distributed.get_rank(): 20, worker.info.seed: 6298923453764313852
torch.distributed.get_rank(): 20, worker.info.seed: 6298923453764313853
torch.distributed.get_rank(): 20, worker.info.seed: 6298923453764313854
torch.distributed.get_rank(): 20, worker.info.seed: 6298923453764313855
The 2nd GPU on the 1st node is:
torch.distributed.get_rank(): 1, worker.info.seed: 6298923453764313846
torch.distributed.get_rank(): 1, worker.info.seed: 6298923453764313847
torch.distributed.get_rank(): 1, worker.info.seed: 6298923453764313848
torch.distributed.get_rank(): 1, worker.info.seed: 6298923453764313849
torch.distributed.get_rank(): 1, worker.info.seed: 6298923453764313850
torch.distributed.get_rank(): 1, worker.info.seed: 6298923453764313851
torch.distributed.get_rank(): 1, worker.info.seed: 6298923453764313852
torch.distributed.get_rank(): 1, worker.info.seed: 6298923453764313853
torch.distributed.get_rank(): 1, worker.info.seed: 6298923453764313854
torch.distributed.get_rank(): 1, worker.info.seed: 6298923453764313855
Exactly, we are on the same page.
For multiple datasets, we need:
seed_i = ( base_seed
- worker_info.id
- torch.distributed.get_rank()
- (worker_info.num_workers * len(data.datasets))
- worker_info.id * len(data.datasets)
- i ) % (2 ** 32 - 1)
This seems to be equivalent to
seed_i = (
base_seed
+ i * worker_info.num_workers
+ torch.distributed.get_rank() * len(data.datasets) * worker_info.num_workers
) % (2 ** 32 - 1)
but with the role of i
and worker_info.id
swapped because base_seed
already has worker_info.id
built in. Anyways, I will update the commit with the version you suggested since we have both verified that it works well.
@mmuckley Updated, it should be good now.
Updated! Mask seeds should be truly random now.
When using
ddp
training, workers with the sameid
across different devices share the same initial random seed for the mask function. This decreases the diversity of random masks during training, since the same mask will be generated on different devices throughout training. This fix generates a unique initial random seed for each worker based on theirid
, devicerank
and dataset (in case ofcombine_train_val = True
).