ROCm / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
17 stars 14 forks source link

Batchnorm support #129

Closed ramcherukuri closed 7 months ago

ramcherukuri commented 7 months ago

Fixes to Sync batchnorm test scripts.

Test result: torchrun --nnodes 1 --nproc-per-node 2 two_gpu_test_different_batch_size.py --apex [2024-01-18 19:22:08,932] torch.distributed.run: [WARNING]


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


====SBN two gpu with different batches test passed

torchrun --nnodes 1 --nproc-per-node 2 two_gpu_unit_test.py [2024-01-18 19:43:50,447] torch.distributed.run: [WARNING]


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


--- count : --- count : tensor([25600, 25600], device='cuda:1', dtype=torch.int32) tensor([25600, 25600], device='cuda:0', dtype=torch.int32) ====SBN two gpu passed tests ====SBN two gpu passed tests