hi - Githubissues

jun0wanan commented 3 years ago

Will multi-gpu cause the result low?

AidenDurrant commented 3 years ago

I have only tested with multi-GPU on one machine, data-parallel, and distributed data-parallel. The results shouldn't change significantly based on multi-GPU.

jun0wanan commented 3 years ago

I have only tested with multi-GPU on one machine, data-parallel, and distributed data-parallel. The results shouldn't change significantly based on multi-GPU.

hi，But then you have fewer negative samples, so why doesn't that make it worse

AidenDurrant commented 3 years ago

you are correct if you use distributed data-parallel, sorry for the confusion i miss-typed I have not tested distributed data-parallel. For the standard data-parallel this same batch size still applies, therefore should not reduce performance.

jun0wanan commented 3 years ago

you are correct if you use distributed data-parallel, sorry for the confusion i miss-typed I have not tested distributed data-parallel. For the standard data-parallel this same batch size still applies, therefore should not reduce performance.

For the standard data-parallel this same batch size still applies, therefore should not reduce performance Your mean is that a gpu match 64 ,so 8 gpus match 8*64 . So is it not reduce performance ?

jun0wanan commented 3 years ago

you are correct if you use distributed data-parallel, sorry for the confusion i miss-typed I have not tested distributed data-parallel. For the standard data-parallel this same batch size still applies, therefore should not reduce performance.

I also use this `DataParallel To train with traditional nn.DataParallel with multiple GPUs, use:

python main.py --no_distributed`

AidenDurrant commented 3 years ago

Use the --no_distributed flag as my implementation of distributed training currently does not work as intended.

check out https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html for more details on data-parallel.

Basically what happens is a subset of the mini-batch is sent to each GPU with a copy of the same model, the activations are computed for an aggregate of all models, then the loss is computed for the same aggregate of all models on all GPUs in parallel. the back prop step is then computed in parallel for each copy of the model resulting in identical copies of the model on each GPU. This means the algorithm sees the full batch size as intended.

So a batch of 64 on 4 GPUs will mean each GPU will see 16 samples (64/4).

https://erickguan.me/2019/pytorch-parallel-model gives a nice post on how this is implemented.

jun0wanan commented 3 years ago

Use the --no_distributed flag as my implementation of distributed training currently does not work as intended.

check out https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html for more details on data-parallel.

Basically what happens is a subset of the mini-batch is sent to each GPU with a copy of the same model, the activations are computed for an aggregate of all models, then the loss is computed for the same aggregate of all models on all GPUs in parallel. the back prop step is then computed in parallel for each copy of the model resulting in identical copies of the model on each GPU. This means the algorithm sees the full batch size as intended.

So a batch of 64 on 4 GPUs will mean each GPU will see 16 samples (64/4).

https://erickguan.me/2019/pytorch-parallel-model gives a nice post on how this is implemented.

Thank you for your prompt reply. (●'◡'●)

I know what you mean So a batch of 64 on 4 GPUs will mean each GPU will see 16 samples (64/4) However, in this case, the negative samples of a GPU will become less（16-batch/gpu） and the effect will become worse？

AidenDurrant / SimCLR-Pytorch

hi #1