Closed YongyiZhou closed 3 years ago
Hi @YongyiZhou , thanks for reporting the issue. I'll try to repro it locally and let you know.
Hello @sunway513 , thank you for your time. Is standalone script means a .py file that can run in console?
I've converted the main code to "Tensorflow_Tutorial.py" in path https://github.com/YongyiZhou/issue-tf-rocm/tree/master/assignment3 , which remove matplotlib cells.
Here are some details of my docker enviroment, which with some extra lib installed:
Here are some details of repro the issues:
I noticed that this issue happend in pytorch-rocm as well, see ROCmSoftwarePlatform/pytorch#342. which shows that only gfx803 met this problem.
As a green hand, I can hardly give more info ...
Finally, thank you for your time again :)
@YongyiZhou Thanks a lot for the script, that helps! I can confirm the converging issue is observed on Polaris (gfx803) GPUs only, for both TF1.12 and TF1.13; results are converging fine on GFX900 and GFX906 GPUs. This issue is potentially related to issue #297 , cc @jerryyin .
Hi @YongyiZhou , the converging issue is caused by the AdamOptimizer implementation on GFX803 targets, potentially related to the compiler codegen. To workaround the issue, you can set the AdamOptimizer onto CPU, e.g. the following patch:
diff --git a/assignment3/Tensorflow_Tutorial.py b/assignment3/Tensorflow_Tutorial.py
index b012cc9..e5e8c76 100644
--- a/assignment3/Tensorflow_Tutorial.py
+++ b/assignment3/Tensorflow_Tutorial.py
@@ -203,7 +203,8 @@ def model(X_train, Y_train, X_test, Y_test, learning_rate = 0.0001,
# Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer.
### START CODE HERE ### (1 line)
- optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
+ with tf.device("/cpu:0"):
+ optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
### END CODE HERE ###
# Initialize all the variables
cc @whchung for awareness.
Thanks @sunway513 , it helps me to continue my lessons. And I'll try other optimizer later. Hope I could use my gpu near future. Thanks :)
@YongyiZhou You are welcome! We are looking into the issue and will update you when we got a fix for it.
I saw better performance with running everything on CPU (Threadripper 1950X) than with running only the optimizers on CPU and everything else on GPU (gfx803, Fiji, Fury X). So the communication overhead makes it not worth it using a GPU at this moment.
Thanks for reaching out. gfx8 is not a supported config now. We are not supporting gfx8 devices officially with ROCm and request you to follow our supported hardware section @ ROCm docs: https://github.com/RadeonOpenCompute/ROCm#Hardware-and-Software-Support
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
Describe the current behavior I'm learning DeepLearning.ai 's course "Improving Deep Neural Networks" via Andrew Ng. And I got a homework to train model for classification on SIGNS dataset.
Describe the expected behavior The train in cell "model()" is supposed to be success. And with the same code and dataset, in my classmates' Nvida computer, result came true. https://github.com/YongyiZhou/issue-tf-rocm/blob/master/assignment3/NVIDA_result.png
But I copy dataset and code from him, result wrong while minimize cost. https://github.com/YongyiZhou/issue-tf-rocm/blob/master/assignment3/AMD_result.png
Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. https://github.com/YongyiZhou/issue-tf-rocm/blob/master/assignment3/Tensorflow%2BTutorial.ipynb
Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
The whole project of dataset and code: https://github.com/YongyiZhou/issue-tf-rocm/tree/master/assignment3