ROCm / ROCm-docker

Dockerfiles for the various software layers defined in the ROCm software platform
MIT License
432 stars 65 forks source link

Tensorflow 2.0 : training stuck at the very begining if model "too large" #60

Open aviallon opened 5 years ago

aviallon commented 5 years ago

Hello, since this is a big project, I do not know which part is responsible for the problem, hence why I post the issue where. Using the rocm/tensorflow:latest container pulled yesterday, if my model has too much (only three) convolutions/deconvolutions + merge layers, the model is stuck at : tensorflow/core/kernels/conv_grad_input_ops.cc:981] running auto-tune for Backward-Data forever (even after 6 hours of run non-stop). Here is a screenshot : Screenshot_20190916_173033 And here is the model summary by Keras :

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, None, None, 3 0                                            
__________________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, None, None, 3 12          input_1[0][0]                    
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, None, None, 1 3088        batch_normalization_1[0][0]      
__________________________________________________________________________________________________
batch_normalization_2 (BatchNor (None, None, None, 1 64          conv2d_1[0][0]                   
__________________________________________________________________________________________________
leaky_re_lu_1 (LeakyReLU)       (None, None, None, 1 0           batch_normalization_2[0][0]      
__________________________________________________________________________________________________
conv2d_2 (Conv2D)               (None, None, None, 3 131104      leaky_re_lu_1[0][0]              
__________________________________________________________________________________________________
batch_normalization_3 (BatchNor (None, None, None, 3 128         conv2d_2[0][0]                   
__________________________________________________________________________________________________
leaky_re_lu_2 (LeakyReLU)       (None, None, None, 3 0           batch_normalization_3[0][0]      
__________________________________________________________________________________________________
conv2d_3 (Conv2D)               (None, None, None, 6 2097216     leaky_re_lu_2[0][0]              
__________________________________________________________________________________________________
batch_normalization_4 (BatchNor (None, None, None, 6 256         conv2d_3[0][0]                   
__________________________________________________________________________________________________
leaky_re_lu_3 (LeakyReLU)       (None, None, None, 6 0           batch_normalization_4[0][0]      
__________________________________________________________________________________________________
conv2d_transpose_1 (Conv2DTrans (None, None, None, 3 2097184     leaky_re_lu_3[0][0]              
__________________________________________________________________________________________________
batch_normalization_5 (BatchNor (None, None, None, 3 128         conv2d_transpose_1[0][0]         
__________________________________________________________________________________________________
leaky_re_lu_4 (LeakyReLU)       (None, None, None, 3 0           batch_normalization_5[0][0]      
__________________________________________________________________________________________________
average_1 (Average)             (None, None, None, 3 0           leaky_re_lu_2[0][0]              
                                                                 leaky_re_lu_4[0][0]              
__________________________________________________________________________________________________
conv2d_transpose_2 (Conv2DTrans (None, None, None, 1 131088      average_1[0][0]                  
__________________________________________________________________________________________________
batch_normalization_6 (BatchNor (None, None, None, 1 64          conv2d_transpose_2[0][0]         
__________________________________________________________________________________________________
leaky_re_lu_5 (LeakyReLU)       (None, None, None, 1 0           batch_normalization_6[0][0]      
__________________________________________________________________________________________________
average_2 (Average)             (None, None, None, 1 0           leaky_re_lu_1[0][0]              
                                                                 leaky_re_lu_5[0][0]              
__________________________________________________________________________________________________
conv2d_transpose_3 (Conv2DTrans (None, None, None, 3 3075        average_2[0][0]                  
__________________________________________________________________________________________________
batch_normalization_7 (BatchNor (None, None, None, 3 12          conv2d_transpose_3[0][0]         
__________________________________________________________________________________________________
leaky_re_lu_6 (LeakyReLU)       (None, None, None, 3 0           batch_normalization_7[0][0]      
__________________________________________________________________________________________________
activation_1 (Activation)       (None, None, None, 3 0           leaky_re_lu_6[0][0]              
==================================================================================================
Total params: 4,463,419
Trainable params: 4,463,087
Non-trainable params: 332

I am using a custom loss function (combination of DSSIM, MSE and MAE), but it din't cause any problem with the same level without the third layer of convolution, nor with model with more convolution/convolutions, but without merge layers. Could there be some kind of loop ? Or is it a bug.

Thank you :smile:

PS : if the issue shall not be posted here, please tell me where so I can close it here and open it somewhere else.