Closed yshen22 closed 4 years ago
Hi @yshen22 , can you try to force your GPU fan speed using rocm-smi and try again?
You can find the manual below:
https://github.com/RadeonOpenCompute/ROC-smi
/opt/rocm/bin/rocm-smi --setfan 80
would force your fan speed to 80% of the max speed possible.
Hi @sunway513 , unfortunately this issue still exists even I set the fan speed to 150. My program follows the standard pipeline using tf.TFRecordReader() to read data from tfrecord and using tf.train.shuffle_batch() to get a training data batch queue. And then using tf.slim.conv to define network computational graph. And then using tf.losses.softmax_cross_entropy to get loss. Finally using slim.learning.train() to train with tf.train.MomentumOptimizer(). Is there any bottle neck to slow down the speed that I need to set? Thank you
Hi @yshen22 , can you provide a reduced script for me to reproduce the issue locally?
Hi @sunway513 I have sent you a private email of my script. Hope you could find some useful way to resolve the slowdown issue. Thank you.
Analysis indicated that the issue may have been caused by storage bandwidth issues (reported training speed numbers corresponded to bandwidth requirements in excess of 500 MB/s).
Closing due to inactivity.
I am training my UNet in tensorflow 1.15-ROCM3.1 on my Radeon vii. My model is writing on tf.slim. The first 50 training takes about 0.0105s per/step on average. However the training speed always slows down to 0.03 per/step after 50 training steps. However this issue was not observed on tensorflow-CUDA on a 1080Ti gpu. The speed always keeps on about 0.015 per/step on average. Would anyone tells how to resolve this issue?