Cannot train on gfx803 - Githubissues

Burstaholic commented 5 years ago

🐛 Bug

Compiling PyTorch in the rocm/pytorch:rocm2.1 docker, I'm getting a ton of warning: loop not unrolled printing out. I don't see them in any of your CI output or other snippets posted here, so I wondered if this might be the reason for my problems. I have three tests failing, two with errors similar to another open issue, and neural network training isn't working for me.

In the PyTorch beginning tutorial, there are no errors, but the network is clearly not being trained:

[1,  2000] loss: 2.304
[1,  4000] loss: 2.303
[1,  6000] loss: 2.303
[1,  8000] loss: 2.303
[1, 10000] loss: 2.303
[1, 12000] loss: 2.304
[2,  2000] loss: 2.303
[2,  4000] loss: 2.303
[2,  6000] loss: 2.303
[2,  8000] loss: 2.304
[2, 10000] loss: 2.304
[2, 12000] loss: 2.303
Finished Training

Just to be clear, the loss function should converge towards 1.0, and does when run via CPU.

My PyTorch is at least partly working - I've been using it to run https://github.com/xinntao/ESRGAN, and the results are clearly superior to running via CPU. I have no idea if I'm doing something wrong with the compile or there's a bug somewhere, but it seems to be training rather than executing that is broken.

Environment

rocm/pytorch:rocm2.1 docker after apt full-update. Host: Ubuntu 18.10, Ryzen 5 1600x, 16GB RAM. I've tried both lowering MAX_JOBS and creating a large swap file to avoid memory issues, but none of that affects the errors.

Here's everything from your environment script that got a value:

PyTorch version: 1.1.0a0+c751cf8
Is debug build: No

OS: Ubuntu 16.04.5 LTS
CMake version: version 3.6.3

Python version: 2.7
Is CUDA available: Yes

Versions of relevant libraries:
[pip] numpy==1.15.4
[pip] torch==1.1.0a0+c751cf8
[pip] torchvision==0.2.1

GPU

R9 Fury, target gfx803. I wonder if using an older, non-default target may be part of my problem. I understand older GPUs naturally receive less focus, though I hope you'll be able to look at it if there is a gfx803 issue.

Output

Example warning:

In file included from /data/development/rocm-pytorch/aten/src/THH/THHTensorSort.cuh:8:
/data/development/rocm-pytorch/aten/src/THH/THHSortUtils.cuh:141:1: 
warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]

Test Output:

======================================================================
FAIL: test_broadcast_batched_matmul (test_cuda.TestCuda)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/development/rocm-pytorch/test/common_utils.py", line 296, in wrapper
    method(*args, **kwargs)
  File "/data/development/rocm-pytorch/test/test_cuda.py", line 2218, in test_broadcast_batched_matmul
    _TestTorchMixin._test_broadcast_batched_matmul(self, lambda t: t.cuda())
  File "/data/development/rocm-pytorch/test/test_torch.py", line 3760, in _test_broadcast_batched_matmul
    verify_batched_matmul(*indices)
  File "/data/development/rocm-pytorch/test/test_torch.py", line 3752, in verify_batched_matmul
    self.assertEqual(truth, maybe_squeeze_result(l, r, out))
  File "/data/development/rocm-pytorch/test/common_utils.py", line 427, in assertEqual
    assertTensorsEqual(x, y)
  File "/data/development/rocm-pytorch/test/common_utils.py", line 408, in assertTensorsEqual
    self.assertTrue(torch.equal(nan_mask, torch.isnan(b)), message)
AssertionError: False is not true : 

======================================================================
FAIL: test_broadcast_fused_matmul (test_cuda.TestCuda)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/development/rocm-pytorch/test/common_utils.py", line 296, in wrapper
    method(*args, **kwargs)
  File "/data/development/rocm-pytorch/test/test_cuda.py", line 2215, in test_broadcast_fused_matmul
    _TestTorchMixin._test_broadcast_fused_matmul(self, lambda t: t.cuda())
  File "/data/development/rocm-pytorch/test/test_torch.py", line 3689, in _test_broadcast_fused_matmul
    self.assertEqual(r0, r1)
  File "/data/development/rocm-pytorch/test/common_utils.py", line 427, in assertEqual
    assertTensorsEqual(x, y)
  File "/data/development/rocm-pytorch/test/common_utils.py", line 419, in assertTensorsEqual
    self.assertLessEqual(max_err, prec, message)
AssertionError: tensor(9., device='cuda:0', dtype=torch.float32) not less than or equal to 1e-05 : 

======================================================================
FAIL: test_randperm_cuda (test_cuda.TestCuda)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/development/rocm-pytorch/test/common_utils.py", line 296, in wrapper
    method(*args, **kwargs)
  File "/data/development/rocm-pytorch/test/test_cuda.py", line 2513, in test_randperm_cuda
    self.assertEqual(res1, res2, 0)
  File "/data/development/rocm-pytorch/test/common_utils.py", line 427, in assertEqual
    assertTensorsEqual(x, y)
  File "/data/development/rocm-pytorch/test/common_utils.py", line 419, in assertTensorsEqual
    self.assertLessEqual(max_err, prec, message)
AssertionError: tensor(9223372036854775492, device='cuda:0') not less than or equal to 0 : 

----------------------------------------------------------------------
Ran 150 tests in 7.430s

FAILED (failures=3, skipped=92)

iotamudelta commented 5 years ago

The loop unroll warning is just a performance warning that the compiler backend recently introduced. It will not affect correctness.

That being said, your training accuracy being off is a problem. Let's look int that.

First: We do not observe the unit test failures you report on gfx900 or gfx906. Can you confirm you are running the unit tests with PYTORCH_TEST_WITH_ROCM=1?

Second: Can you point to the script you are using so that we check whether we observe the same issue on gfx900?

Thanks!

Burstaholic commented 5 years ago

Sure - I'm following https://rocm-documentation.readthedocs.io/en/latest/Deep_learning/Deep-learning.html#building-pytorch-for-rocm and I do see a number of skipped "test doesn't currently work on the ROCm stack" messages in my output.

The tutorial is here: https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#training-on-gpu

The actual changes to use CUDA are not in the downloadable code there, being left as an exercise for the reader; however, in my quest to make sure I was doing it correctly I found https://github.com/pytorch/examples/tree/master/mnist which is almost exactly what the tutorial has you build.

The MNIST example shows a slightly different problem:

Train Epoch: 1 [0/60000 (0%)]   Loss: 2.300039
Train Epoch: 1 [640/60000 (1%)] Loss: 2.182527
Train Epoch: 1 [1280/60000 (2%)]    Loss: 2.288186
Train Epoch: 1 [1920/60000 (3%)]    Loss: 3.501651
Train Epoch: 1 [2560/60000 (4%)]    Loss: nan
Train Epoch: 1 [3200/60000 (5%)]    Loss: nan
...

All subsequent lines simply show Loss: nan.

iotamudelta commented 5 years ago

OK, we can correctly train the mnist example on gfx900 and gfx906 - so this must be a gfx803 issue.

ROCm / pytorch

Cannot train on gfx803 #342

🐛 Bug

Environment

GPU

Output