ROCm / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
17 stars 14 forks source link

Consider both contiguous and channels_last tensors for FusedSGD #97

Closed hubertlu-tw closed 1 year ago

hubertlu-tw commented 1 year ago

Authored by @luise1030 to address the issue of tensor memory type inconsistency on Resnet50 trained with NHWC format. To run the corresponding unit tests for the code changes in this PR: $ python tests/L0/run_test.py --include run_optimizers

test_float (test_fused_optimizer.TestFusedSGD) ... ok
test_half (test_fused_optimizer.TestFusedSGD) ... ok
test_multi_device (test_fused_optimizer.TestFusedSGD) ... ok

Internal JIRA ticket for the context: https://ontrack-internal.amd.com/browse/SWDEV-357815

hubertlu-tw commented 1 year ago

This PR is to resolve the issue when p in parameters() of apex.optimizers.FusedSGD and p.grad are not in same memory format.

Before this PR:

p (in parameters() of a optimizer) p.grad Results from Test A = torch.optim.SGD and Test B = torch.optim.SGD
Test A Test B torch.contiguous_format torch.channels_last torch.contiguous_format torch.contiguous_format Same
p p.grad Results from Test A = torch.optim.SGD and Test B = apex.optimizers.FusedSGD
Test A Test B torch.contiguous_format torch.channels_last torch.contiguous_format torch.contiguous_format Different

At this PR:

p (in parameters() of a optimizer) p.grad Results from Test A = torch.optim.SGD and Test B = torch.optim.SGD
Test A Test B torch.contiguous_format torch.channels_last torch.contiguous_format torch.contiguous_format Same
p p.grad Results from Test A = torch.optim.SGD and Test B = apex.optimizers.FusedSGD
Test A Test B torch.contiguous_format torch.channels_last torch.contiguous_format torch.contiguous_format Same