ROCm / tensorflow-upstream

TensorFlow ROCm port
https://tensorflow.org
Apache License 2.0
686 stars 93 forks source link

Wrong result while training SIGNS Dataset with RX580 but code works in NVIDA enviroment #337

Closed YongyiZhou closed 3 years ago

YongyiZhou commented 5 years ago

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the current behavior I'm learning DeepLearning.ai 's course "Improving Deep Neural Networks" via Andrew Ng. And I got a homework to train model for classification on SIGNS dataset.

Describe the expected behavior The train in cell "model()" is supposed to be success. And with the same code and dataset, in my classmates' Nvida computer, result came true. https://github.com/YongyiZhou/issue-tf-rocm/blob/master/assignment3/NVIDA_result.png

But I copy dataset and code from him, result wrong while minimize cost. https://github.com/YongyiZhou/issue-tf-rocm/blob/master/assignment3/AMD_result.png

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. https://github.com/YongyiZhou/issue-tf-rocm/blob/master/assignment3/Tensorflow%2BTutorial.ipynb

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

The whole project of dataset and code: https://github.com/YongyiZhou/issue-tf-rocm/tree/master/assignment3

sunway513 commented 5 years ago

Hi @YongyiZhou , thanks for reporting the issue. I'll try to repro it locally and let you know.

YongyiZhou commented 5 years ago

Hello @sunway513 , thank you for your time. Is standalone script means a .py file that can run in console?

I've converted the main code to "Tensorflow_Tutorial.py" in path https://github.com/YongyiZhou/issue-tf-rocm/tree/master/assignment3 , which remove matplotlib cells.

Here are some details of my docker enviroment, which with some extra lib installed:

  1. I used the latest docker image at https://hub.docker.com/r/rocm/tensorflow week ago, and generate a container with the recommended command.
  2. I used pypi to install some lib: numpy and matplotlib, scipy.
  3. I used apt install to install jupyter notebook.
  4. Then I have't change any thing in my container.

Here are some details of repro the issues:

  1. I followed the tutorial and complete the cells one by one, and all result came true before cell "In [28]"
  2. And I run the cell In [28], code runs, but result wrong. Meanwhile, I sent my project to my classmate, and then the result came true on his CUDA env. And I copy my project to another windows laptop, run it in CPU model, result came true as well.
  3. Few minutes ago, I run "python3 Tensorflow_Tutorial.py" in my container console, result came wrong as before, and came true in another computer in windows CPU model.

I noticed that this issue happend in pytorch-rocm as well, see ROCmSoftwarePlatform/pytorch#342. which shows that only gfx803 met this problem.

As a green hand, I can hardly give more info ...

Finally, thank you for your time again :)

sunway513 commented 5 years ago

@YongyiZhou Thanks a lot for the script, that helps! I can confirm the converging issue is observed on Polaris (gfx803) GPUs only, for both TF1.12 and TF1.13; results are converging fine on GFX900 and GFX906 GPUs. This issue is potentially related to issue #297 , cc @jerryyin .

sunway513 commented 5 years ago

Hi @YongyiZhou , the converging issue is caused by the AdamOptimizer implementation on GFX803 targets, potentially related to the compiler codegen. To workaround the issue, you can set the AdamOptimizer onto CPU, e.g. the following patch:

diff --git a/assignment3/Tensorflow_Tutorial.py b/assignment3/Tensorflow_Tutorial.py
index b012cc9..e5e8c76 100644
--- a/assignment3/Tensorflow_Tutorial.py
+++ b/assignment3/Tensorflow_Tutorial.py
@@ -203,7 +203,8 @@ def model(X_train, Y_train, X_test, Y_test, learning_rate = 0.0001,

     # Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer.
     ### START CODE HERE ### (1 line)
-    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
+    with tf.device("/cpu:0"): 
+        optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
     ### END CODE HERE ###

     # Initialize all the variables

cc @whchung for awareness.

YongyiZhou commented 5 years ago

Thanks @sunway513 , it helps me to continue my lessons. And I'll try other optimizer later. Hope I could use my gpu near future. Thanks :)

sunway513 commented 5 years ago

@YongyiZhou You are welcome! We are looking into the issue and will update you when we got a fix for it.

Bengt commented 5 years ago

I saw better performance with running everything on CPU (Threadripper 1950X) than with running only the optimizers on CPU and everything else on GPU (gfx803, Fiji, Fury X). So the communication overhead makes it not worth it using a GPU at this moment.

ROCmSupport commented 3 years ago

Thanks for reaching out. gfx8 is not a supported config now. We are not supporting gfx8 devices officially with ROCm and request you to follow our supported hardware section @ ROCm docs: https://github.com/RadeonOpenCompute/ROCm#Hardware-and-Software-Support