Open Burstaholic opened 5 years ago
The loop unroll warning is just a performance warning that the compiler backend recently introduced. It will not affect correctness.
That being said, your training accuracy being off is a problem. Let's look int that.
First: We do not observe the unit test failures you report on gfx900 or gfx906. Can you confirm you are running the unit tests with PYTORCH_TEST_WITH_ROCM=1
?
Second: Can you point to the script you are using so that we check whether we observe the same issue on gfx900?
Thanks!
Sure - I'm following https://rocm-documentation.readthedocs.io/en/latest/Deep_learning/Deep-learning.html#building-pytorch-for-rocm and I do see a number of skipped "test doesn't currently work on the ROCm stack"
messages in my output.
The tutorial is here: https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#training-on-gpu
The actual changes to use CUDA are not in the downloadable code there, being left as an exercise for the reader; however, in my quest to make sure I was doing it correctly I found https://github.com/pytorch/examples/tree/master/mnist which is almost exactly what the tutorial has you build.
The MNIST example shows a slightly different problem:
Train Epoch: 1 [0/60000 (0%)] Loss: 2.300039
Train Epoch: 1 [640/60000 (1%)] Loss: 2.182527
Train Epoch: 1 [1280/60000 (2%)] Loss: 2.288186
Train Epoch: 1 [1920/60000 (3%)] Loss: 3.501651
Train Epoch: 1 [2560/60000 (4%)] Loss: nan
Train Epoch: 1 [3200/60000 (5%)] Loss: nan
...
All subsequent lines simply show Loss: nan
.
OK, we can correctly train the mnist example on gfx900 and gfx906 - so this must be a gfx803 issue.
🐛 Bug
Compiling PyTorch in the
rocm/pytorch:rocm2.1
docker, I'm getting a ton ofwarning: loop not unrolled
printing out. I don't see them in any of your CI output or other snippets posted here, so I wondered if this might be the reason for my problems. I have three tests failing, two with errors similar to another open issue, and neural network training isn't working for me.In the PyTorch beginning tutorial, there are no errors, but the network is clearly not being trained:
Just to be clear, the loss function should converge towards 1.0, and does when run via CPU.
My PyTorch is at least partly working - I've been using it to run https://github.com/xinntao/ESRGAN, and the results are clearly superior to running via CPU. I have no idea if I'm doing something wrong with the compile or there's a bug somewhere, but it seems to be training rather than executing that is broken.
Environment
rocm/pytorch:rocm2.1
docker afterapt full-update
. Host: Ubuntu 18.10, Ryzen 5 1600x, 16GB RAM. I've tried both loweringMAX_JOBS
and creating a large swap file to avoid memory issues, but none of that affects the errors.Here's everything from your environment script that got a value:
GPU
R9 Fury, target
gfx803
. I wonder if using an older, non-default target may be part of my problem. I understand older GPUs naturally receive less focus, though I hope you'll be able to look at it if there is agfx803
issue.Output
Example warning:
Test Output: