ROCm / tensorflow-upstream

TensorFlow ROCm port
https://tensorflow.org
Apache License 2.0
688 stars 94 forks source link

Training slow down after 50 steps #901

Closed yshen22 closed 4 years ago

yshen22 commented 4 years ago

I am training my UNet in tensorflow 1.15-ROCM3.1 on my Radeon vii. My model is writing on tf.slim. The first 50 training takes about 0.0105s per/step on average. However the training speed always slows down to 0.03 per/step after 50 training steps. However this issue was not observed on tensorflow-CUDA on a 1080Ti gpu. The speed always keeps on about 0.015 per/step on average. Would anyone tells how to resolve this issue?

sunway513 commented 4 years ago

Hi @yshen22 , can you try to force your GPU fan speed using rocm-smi and try again? You can find the manual below: https://github.com/RadeonOpenCompute/ROC-smi /opt/rocm/bin/rocm-smi --setfan 80 would force your fan speed to 80% of the max speed possible.

yshen22 commented 4 years ago

Hi @sunway513 , unfortunately this issue still exists even I set the fan speed to 150. My program follows the standard pipeline using tf.TFRecordReader() to read data from tfrecord and using tf.train.shuffle_batch() to get a training data batch queue. And then using tf.slim.conv to define network computational graph. And then using tf.losses.softmax_cross_entropy to get loss. Finally using slim.learning.train() to train with tf.train.MomentumOptimizer(). Is there any bottle neck to slow down the speed that I need to set? Thank you

sunway513 commented 4 years ago

Hi @yshen22 , can you provide a reduced script for me to reproduce the issue locally?

yshen22 commented 4 years ago

Hi @sunway513 I have sent you a private email of my script. Hope you could find some useful way to resolve the slowdown issue. Thank you.

ekuznetsov139 commented 4 years ago

Analysis indicated that the issue may have been caused by storage bandwidth issues (reported training speed numbers corresponded to bandwidth requirements in excess of 500 MB/s).

Closing due to inactivity.