ROCm / tensorflow-upstream

TensorFlow ROCm port
https://tensorflow.org
Apache License 2.0
683 stars 93 forks source link

call to miopenFindConvolutionBackwardDataAlgorithm failed #1110

Closed oleid closed 3 years ago

oleid commented 3 years ago

System information

Describe the current behavior

The minimal example dies with:

[...]
2020-09-10 12:38:53.692762: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3796 MB memory) -> physical 
GPU (device: 0, name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590], pci bus id: 0000:08:00.0)                                                                                          
2020-09-10 12:38:53.695045: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best 
performance.                                                                                                                                                                                  
Model: "sequential"                                                                                                                                                                           
_________________________________________________________________                                                                                                                             
Layer (type)                 Output Shape              Param #                                                                                                                                
=================================================================                                                                                                                             
conv1d (Conv1D)              (None, 392, 16)           64                                                                                                                                     
_________________________________________________________________                                                                                                                             
conv1d_1 (Conv1D)            (None, 196, 32)           1568                                                                                                                                   
_________________________________________________________________                                                                                                                             
conv1d_transpose (Conv1DTran (None, 392, 32)           3104                                                                                                                                   
_________________________________________________________________                                                                                                                             
conv1d_transpose_1 (Conv1DTr (None, 784, 16)           1552                                                                                                                                   
_________________________________________________________________                                                                                                                             
conv1d_transpose_2 (Conv1DTr (None, 784, 1)            17                                                                                                                                     
=================================================================                                                                                                                             
Total params: 6,305                                                                                                                                                                           
Trainable params: 6,305                                                                                                                                                                       
Non-trainable params: 0                                                                                                                                                                       
_________________________________________________________________                                                                                                                             
Epoch 1/30                                                                                                                                                                                    
2020-09-10 12:38:54.834599: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libMIOpen.so                                                  
2020-09-10 12:38:54.898485: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library librocblas.so                                                 
CreateModule for file /tmp/miopen-miopen-2410-f966-e003-1fe4/miopen-074d-32d2-c8a6-f3e0                                                                                                       
:1:hip_code_object.cpp      :187 : 15910570847 us: Cannot find the function: Cijk_Ailk_Bljk_SB_MT32x8x32_SE_1LDSB0_APM1_AF0EM1_AF1EM1_AMAS3_ASBE01_ASEM1_BL1_DTL0_DVO0_EPS1_FL0_GRVW2_GSU4_ISA
803_IU1_K1_KLA_LBSPP0_LPA0_LPB0_LDL1_LRVW2_NLCA1_NLCB1_ONLL1_PBD0_PK0_PGR1_PLR1_RK0_SIA1_SU32_SUM0_SUS256_SRVW0_SVW4_SNLL0_TT2_2_TLDS0_USFGROn1_VAW1_VSn1_VW2_WSGRA0_WSGRB0_WG16_4_4_WGM1     
CreateModule for file /tmp/miopen-miopen-5932-c6e1-1f1b-c07a/miopen-5b41-26bb-e1d9-55ef                                                                                                       
CreateModule for file /tmp/miopen-gridwise_convolution_backward_data_implicit_gemm_v1r1_nchw_kcyx_nkhw.cpp-2be4-1ec9-fcfd-b999/gridwise_convolution_backward_data_implicit_gemm_v1r1_nchw_kcyx
_nkhw.cpp.o                                                                                                                                                                                   
:1:devprogram.cpp           :1697: 15915650534 us: Error: runtime metadata section not present in ELF program binary                                                                          
Error: Cannot Global Var Sizes                                                                                                                                                                

Call stack generated in loadCodeObjectData at: /home/oleid/.cache/rua/build/hip-rocclr/src/HIP-rocm-3.7.0/rocclr/hip_code_object.cpp:235: /opt/rocm/hip/lib/libamdhip64.so.3(+0x82715)[0x7fb0f
b30d715]                                                                                                                                                                                      
/opt/rocm/hip/lib/libamdhip64.so.3(+0x828b4)[0x7fb0fb30d8b4]
/opt/rocm/hip/lib/libamdhip64.so.3(+0x14f251)[0x7fb0fb3da251]
/opt/rocm/hip/lib/libamdhip64.so.3(hipModuleLoad+0x182)[0x7fb0fb3a8872]
/opt/rocm/lib/libMIOpen.so(+0x9288c4)[0x7fb0c46c48c4]
/opt/rocm/lib/libMIOpen.so(+0x92908a)[0x7fb0c46c508a]
/opt/rocm/lib/libMIOpen.so(+0x928d7a)[0x7fb0c46c4d7a]
/opt/rocm/lib/libMIOpen.so(_ZN6miopen12HIPOCProgramC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES6_bS6_S8_+0x75)[0x7fb0c46c45a5]
/opt/rocm/lib/libMIOpen.so(_ZNK6miopen6Handle11LoadProgramERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES6_bS8_+0x1fd)[0x7fb0c46c026d]
/opt/rocm/lib/libMIOpen.so(+0x290fa2)[0x7fb0c402cfa2]
MIOpen Error: /home/oleid/.cache/rua/build/miopen-hip/src/MIOpen-rocm-3.7.0/src/hipoc/hipoc_program.cpp:95: Failed creating module hipErrorSharedObjectInitFailed
2020-09-10 12:39:06.978512: F tensorflow/stream_executor/rocm/rocm_dnn.cc:3445] call to miopenFindConvolutionBackwardDataAlgorithm failed: miopenStatusUnknownError

Please note, I added a bunch of extra debug output (like a backtrace) before I realized (by digging through the code) that there is AMD_LOG_LEVEL environment variable.

As you can see, the call

  if(CL_SUCCESS != program_->build(hip::getCurrentDevice()->devices(), nullptr, nullptr, nullptr,
                                   kOptionChangeable, kNewDevProg))

in DynCO::loadCodeObjectData in hip_code_object.cpp fails.

It seems to work, for some reason, if I do 2d convolutions. But as I'm trying to train a 1d model, this doesn't really help :)

Big unknown for me -- hence this issue: What does it exactly mean, that runtime metadata section not present in ELF program binary. Why could this be the case and how do I get it there?

Describe the expected behavior

Training should work.

Standalone code to reproduce the issue

import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv1D, Conv1DTranspose, Flatten, Dropout
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.preprocessing import image

import os
import numpy as np

img_width, img_height = 28, 28
batch_size = 16
epochs = 30
input_shape = (img_width* img_height, 1)

###########

def prepare_image(im):
  return im.reshape(-1) / 255.0

###########
def log_n(x, n=10):
    """
    Compute log_n(x), i.e. the log base `n` value of `x`.
    :param x:   Input tensor
    :param n:   Value of the log base
    :return:    Log result
    """
    log_e = tf.math.log(x)
    div_log_n = tf.math.log(tf.constant(n, dtype=log_e.dtype))
    return log_e / div_log_n

def psnr(img_a, img_b, max_img_value=255):
    """
    Compute the PSNR (Peak Signal-to-Noise Ratio) between two images.
    The Peak Signal-to-Noise Ration (PSNR) is commonly used, as it measures the quality of a corrupted or
    recovered signal/image compared to its original version. The higher the value, the closer to the original image...
    (the value is in decibels, i.e. following a logarithmic scale).
    :param img_a:           Image A
    :param img_b:           Image B
    :param max_img_value:   Maximum possible pixel value of the images
    :return:                PSNR value
    """
    mse = tf.reduce_mean((img_a - img_b) ** 2)
    return 20 * log_n(max_img_value, 10) - 10 * log_n(mse, 10)

import functools

psnr_metrics = functools.partial(psnr, max_img_value=1.)
psnr_metrics.__name__ = 'psnr'

#####

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

x_train = np.expand_dims(np.stack([prepare_image(im.reshape(28,28)) for im in x_train]), axis=-1)
x_test = np.expand_dims(np.stack([prepare_image(im.reshape(28,28)) for im in x_test]), axis=-1)

y_train = x_train
y_test = x_test

model = Sequential([
    Conv1D(16, 3, activation='relu', strides=2, input_shape=input_shape, padding="same"),
    Conv1D(32, 3, activation='relu', strides=2, padding="same"),
    Conv1DTranspose(32, 3, activation='relu', strides=2, padding="same"),
    Conv1DTranspose(16, 3, activation='relu', strides=2, padding="same"),
    Conv1DTranspose(1, 1, activation='sigmoid', strides=1, padding="same"),
])

model.compile(optimizer='adam',
        loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
        metrics=[psnr_metrics])

model.summary()

model.fit(
    x_train,
    y_train, 
    steps_per_epoch = len(x_train) // batch_size,
    validation_data = (x_test, y_test), 
    validation_steps = len(x_test) // batch_size,
    epochs = epochs)
oleid commented 3 years ago

Correction: 2D convolutions don't work anymore for unknown reason. But dense networks are doing fine: (AMD_LOG_LEVEL=1)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
dense (Dense)                (None, 28, 28, 128)       256
_________________________________________________________________
dense_1 (Dense)              (None, 28, 28, 128)       16512
_________________________________________________________________
dense_2 (Dense)              (None, 28, 28, 1)         129
=================================================================
Total params: 16,897
Trainable params: 16,897
Non-trainable params: 0
_________________________________________________________________
Epoch 1/30
2020-09-10 11:33:05.511470: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library librocblas.so
:1:hip_code_object.cpp      :187 : 11961159632 us: Cannot find the function: Cijk_Ailk_Bljk_SB_MT48x72x32_SE_1LDSB0_APM1_AF0EM1_AF1EM1_AMAS3_ASBE01_ASEM1_BL1_DTL0_DVO0_EPS1_FL0_GRVW2_GSU2_IS
A803_IU1_K1_KLA_LBSPP0_LPA0_LPB0_LDL1_LRVW2_NLCA1_NLCB1_ONLL1_PBD0_PK0_PGR1_PLR1_RK0_SIA1_SU32_SUM0_SUS256_SRVW0_SVW4_SNLL0_TT6_6_TLDS0_USFGROn1_VAW1_VSn1_VW2_WSGRA0_WSGRB0_WG8_12_2_WGM1
:1:hip_code_object.cpp      :187 : 11961355426 us: Cannot find the function: Cijk_Alik_Bljk_SB_MT24x24x16_SE_1LDSB0_APM1_AF0EM1_AF1EM1_AMAS0_ASBE01_ASEM1_BL1_DTL0_DVO0_EPS1_FL0_GRVW1_GSU4_IS
A803_IU1_K1_KLA_LBSPP0_LPA0_LPB0_LDL1_LRVW1_NLCA1_NLCB1_ONLL1_PBD0_PK0_PGR1_PLR1_RK0_SIA1_SU32_SUM0_SUS256_SRVW0_SVW4_SNLL0_TT3_4_TLDS0_USFGROn1_VAW1_VSn1_VW1_WSGRA0_WSGRB0_WG8_6_4_WGM1
:1:hip_code_object.cpp      :187 : 11961357147 us: Cannot find the function: Cijk_Ailk_Bjlk_SB_MT128x128x8_SE_1LDSB0_APM1_AF0EM1_AF1EM1_AMAS3_ASBE01_ASEM1_BL1_DTL0_DVO0_EPS1_FL0_GRVW4_GSU1_I
SA803_IU1_K1_KLA_LBSPP0_LPA0_LPB0_LDL1_LRVW4_NLCA1_NLCB1_ONLL1_PBD0_PK0_PGR1_PLR1_RK0_SIA1_SU32_SUM0_SUS256_SRVW0_SVW4_SNLL0_TT8_8_TLDS0_USFGROn1_VAW1_VSn1_VW4_WSGRA0_WSGRB0_WG16_16_1_WGM1
   1/3750 [..............................] - ETA: 39s - loss: 0.6983 - psnr: 6.3381WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.
3748/3750 [============================>.] - ETA: 0s - loss: 0.0817 - psnr: 27.3319WARNING:tensorflow:Callbacks method `on_test_batch_end` is slow compared to the batch time (batch time: 0.0
3750/3750 [==============================] - 72s 19ms/step - loss: 0.0817 - psnr: 27.3320 - val_loss: 0.0715 - val_psnr: 27.7953
oleid commented 3 years ago

The result of the test suite in miopen (according to rocm-smi and screen flickering the GPU is used :):

$ make check
[...]
Scanning dependencies of target check
Test project /home/oleid/.cache/rua/build/miopen-hip/src/build/test
      Start 11: test_conv2d
 1/55 Test #11: test_conv2d ......................   Passed   35.59 sec
      Start 17: test_dropout
 2/55 Test #17: test_dropout .....................   Passed    3.28 sec
      Start 27: test_lrn_test
 3/55 Test #27: test_lrn_test ....................   Passed    1.20 sec
      Start 35: test_pooling2d
 4/55 Test #35: test_pooling2d ...................   Passed    1.09 sec
      Start 36: test_pooling3d
 5/55 Test #36: test_pooling3d ...................   Passed    0.09 sec
      Start 40: test_soft_max
 6/55 Test #40: test_soft_max ....................   Passed    1.09 sec
      Start 55: test_conv_igemm_dynamic_small
 7/55 Test #55: test_conv_igemm_dynamic_small ....   Passed    1.83 sec
      Start  1: test_activation
 8/55 Test  #1: test_activation ..................   Passed    0.85 sec
      Start  2: test_bn_3d_peract_test
 9/55 Test  #2: test_bn_3d_peract_test ...........   Passed    2.15 sec
      Start  3: test_bn_3d_spatial_test
10/55 Test  #3: test_bn_3d_spatial_test ..........   Passed    2.22 sec
      Start  4: test_bn_aux
11/55 Test  #4: test_bn_aux ......................   Passed    0.07 sec
      Start  5: test_bn_peract_test
12/55 Test  #5: test_bn_peract_test ..............   Passed    2.04 sec
      Start  6: test_bn_spatial_test
13/55 Test  #6: test_bn_spatial_test .............   Passed    2.27 sec
      Start  7: test_cache
14/55 Test  #7: test_cache .......................   Passed    0.09 sec
      Start  8: test_cba_inference
15/55 Test  #8: test_cba_inference ...............   Passed    1.42 sec
      Start  9: test_cbna_inference
16/55 Test  #9: test_cbna_inference ..............   Passed    1.41 sec
      Start 10: test_check_numerics_test
17/55 Test #10: test_check_numerics_test .........   Passed    0.74 sec
      Start 12: test_conv2d_bias
18/55 Test #12: test_conv2d_bias .................   Passed    0.74 sec
      Start 13: test_conv3d
19/55 Test #13: test_conv3d ......................   Passed   35.10 sec
      Start 14: test_conv3d_bias
20/55 Test #14: test_conv3d_bias .................   Passed    0.75 sec
      Start 15: test_ctc
21/55 Test #15: test_ctc .........................   Passed    1.20 sec
      Start 16: test_custom_allocator
22/55 Test #16: test_custom_allocator ............   Passed    0.38 sec
      Start 18: test_find_db
23/55 Test #18: test_find_db .....................***Failed   23.48 sec
MIOpen(HIP): Info [Handle] stream: 0, device_id: 0
Starting forward find-db test.
MIOpen(HIP): Info [ForwardGetWorkSpaceSize] 
MIOpen(HIP): Info2 [HipCompilerVersionImpl] Read version information from HIP package...
MIOpen(HIP): Info [HipCompilerVersionImpl] 3.7.20364
MIOpen(HIP): Info [AmdRocmMetadataVersionDetect] ROCm MD version AMDHSA_COv3, MIOpen version 2.6.0.0
MIOpen(HIP): Info2 [ValidateGcnAssemblerImpl] Running: '/opt/rocm/llvm/bin/clang --version'
MIOpen(HIP): Info2 [ValidateGcnAssemblerImpl] clang version 11.0.0 
MIOpen(HIP): Info2 [ValidateGcnAssemblerImpl] Target: x86_64-pc-linux-gnu
MIOpen(HIP): Info2 [ValidateGcnAssemblerImpl] Thread model: posix
MIOpen(HIP): Info2 [ValidateGcnAssemblerImpl] InstalledDir: /opt/rocm/llvm/bin
MIOpen(HIP): Info2 [ValidateGcnAssemblerImpl] 
MIOpen(HIP): Info2 [GetFindModeValueImpl] MIOPEN_FIND_MODE = HYBRID(3)
MIOpen(HIP): Info [GetForwardSolutions] 
MIOpen(HIP): Info [Measure] Db::Prefetch time: 0.01879 ms
MIOpen(HIP): Info2 [FindRecordUnsafe] Looking for key 192-28-28-5x5-32-26-26-16-1x1-1x1-1x1-0-NCHW-FP32-F in file /tmp/miopen-tmp-f51f-0c93-6672-1e9e/miopen.test.find_db
MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 0.06925 ms
MIOpen(HIP): Info [GetForwardSolutionsFallback] Fallback path, GEMM
MIOpen(HIP): Info2 [GetWorkspaceSize] ConvAsm3x3U: Not applicable
MIOpen(HIP): Info2 [GetWorkspaceSize] ConvAsm1x1U: Not applicable
MIOpen(HIP): Info2 [GetWorkspaceSize] ConvAsm1x1UV2: Not applicable
MIOpen(HIP): Info2 [GetWorkspaceSize] ConvAsm5x10u2v2f1: Not applicable
MIOpen(HIP): Info2 [GetWorkspaceSize] ConvAsm7x7c3h224w224k64u2v2p3q3f1: Not applicable
MIOpen(HIP): Info2 [GetWorkspaceSize] ConvAsm5x10u2v2b1: Not applicable
MIOpen(HIP): Info2 [GetWorkspaceSize] ConvOclDirectFwd11x11: Not applicable
MIOpen(HIP): Info2 [GetWorkspaceSize] ConvOclDirectFwdGen: Not applicable
MIOpen(HIP): Info2 [GetWorkspaceSize] ConvOclDirectFwd3x3: Not applicable
MIOpen(HIP): Info2 [GetWorkspaceSize] ConvOclDirectFwd1x1: Not applicable
MIOpen(HIP): Info2 [GetPerformanceConfig] Returns: 16,16,32,32,2,2,8,2,1
MIOpen(HIP): Info2 [SQLiteBase] Initializing system database file /opt/rocm/miopen/share/miopen/db/miopen.db
MIOpen(HIP): Info2 [SQLiteBase] Initializing user database file /home/oleid/.config/miopen/miopen_1.0.0.udb
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvHipImplicitGemmForwardV4R4Xdlops: Not applicable
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvHipImplicitGemmV4R4GenXdlopsFwdFp32: Not applicable
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvHipImplicitGemmV4R4GenFwdXdlops: Not applicable
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvHipImplicitGemmBwdDataV1R1Xdlops: Not applicable
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvHipImplicitGemmBwdDataV4R1Xdlops: Not applicable
MIOpen(HIP): Info [FindSolutionImpl] ConvHipImplicitGemmV4R1Fwd (db access disabled)
MIOpen(HIP): Info [EuristicInit] 16,16,4,2,2,2,2,4,2,4,4,1,16,1,4,16
MIOpen(HIP): Info [GetPerformanceConfigBase] 16,16,4,2,2,2,2,4,2,4,4,1,16,1,4,16
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvHipImplicitGemmV4R1Fwd: Success.
MIOpen(HIP): Info2 [ForwardGetWorkSpaceSize] 12979200
MIOpen(HIP): Info [FindConvFwdAlgorithm] requestAlgoCount = 1, workspace = 12979200
MIOpen(HIP): Info [GetForwardSolutions] 
MIOpen(HIP): Info2 [FindRecordUnsafe] Looking for key 192-28-28-5x5-32-26-26-16-1x1-1x1-1x1-0-NCHW-FP32-F in file /tmp/miopen-tmp-f51f-0c93-6672-1e9e/miopen.test.find_db
MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 0.0325 ms
MIOpen(HIP): Info [GetForwardSolutionsFallback] Fallback path, GEMM
MIOpen(HIP): Info2 [FindRecordUnsafe] Looking for key 192-28-28-5x5-32-26-26-16-1x1-1x1-1x1-0-NCHW-FP32-F in file /tmp/miopen-tmp-f51f-0c93-6672-1e9e/miopen.test.find_db
MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 0.0107 ms
MIOpen(HIP): Info [TryLoad] Find-db regenerating.
MIOpen(HIP): Info2 [GetKernels] 0 kernels for key: miopenIm2d2Col "c192i28_28w5_5p1_1s1_1d1_1t1"
MIOpen(HIP): Info2 [AddKernel] Key: miopenIm2Col "c192i28_28w5_5p1_1s1_1d1_1t1"
MIOpen(HIP): Info2 [AddKernelDumpKernelParams] runcl MIOpenIm2d2Col.cl -k Im2d2Col -dumpilisa -r 10 if#0: if#0: if#0: iv#0 196608,1,1/256,1,1  -DNUM_CH_PER_WG=1 -DNUM_IM_BLKS_X=1 -DNUM_IM_BLKS=4 -DLOCAL_MEM_SIZE=432 -DSTRIDE_GT_1=0 -DTILE_SZ_X=32 -DTILE_SZ_Y=8 -DUSE_IM_OFF_GUARD=1 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1
MIOpen(HIP): Info2 [SQLiteBase] Initializing system database file 
MIOpen(HIP): Info [KernDb] database not present
MIOpen(HIP): Info2 [SQLiteBase] Initializing user database file /home/oleid/.cache/miopen/2.6.0.0/gfx803_32.ukdb
MIOpen(HIP): Info2 [KernDb] Database created successfully
MIOpen(HIP): Info2 [LoadBinary] Loading binary for: MIOpenIm2d2Col.cl ;args:  -DNUM_CH_PER_WG=1 -DNUM_IM_BLKS_X=1 -DNUM_IM_BLKS=4 -DLOCAL_MEM_SIZE=432 -DSTRIDE_GT_1=0 -DTILE_SZ_X=32 -DTILE_SZ_Y=8 -DUSE_IM_OFF_GUARD=1 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -mcpu=gfx803
MIOpen(HIP): Info2 [Prepare] SELECT kernel_blob, kernel_hash, uncompressed_size FROM kern_db WHERE (kernel_name = 'MIOpenIm2d2Col.cl.o') AND (kernel_args = ' -DNUM_CH_PER_WG=1 -DNUM_IM_BLKS_X=1 -DNUM_IM_BLKS=4 -DLOCAL_MEM_SIZE=432 -DSTRIDE_GT_1=0 -DTILE_SZ_X=32 -DTILE_SZ_Y=8 -DUSE_IM_OFF_GUARD=1 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -mcpu=gfx803');
MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 0.037309 ms
MIOpen(HIP): Info2 [SaveBinary] Saving binary for: MIOpenIm2d2Col.cl ;args:  -DNUM_CH_PER_WG=1 -DNUM_IM_BLKS_X=1 -DNUM_IM_BLKS=4 -DLOCAL_MEM_SIZE=432 -DSTRIDE_GT_1=0 -DTILE_SZ_X=32 -DTILE_SZ_Y=8 -DUSE_IM_OFF_GUARD=1 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -mcpu=gfx803
MIOpen(HIP): Info2 [Prepare] INSERT OR REPLACE INTO kern_db(kernel_name, kernel_args, kernel_blob, kernel_hash, uncompressed_size) VALUES(?, ?, ?, ?, ?);
MIOpen(HIP): Info2 [Measure] Db::StoreRecord time: 11.8474 ms
MIOpen(HIP): Info2 [CallGemm] gemm_desc: {isColMajor 0, transA 0, transB 0, m 32, n 676, k 4800, lda 4800, ldb 676, ldc 676, batch_count 1, strideA 0, strideB 0, strideC 0, alpha 1, beta 0, dataType 1} 
MIOpen(HIP): Info2 [CallGemm] gemm_desc: {isColMajor 0, transA 0, transB 0, m 32, n 676, k 4800, lda 4800, ldb 676, ldc 676, batch_count 1, strideA 0, strideB 0, strideC 0, alpha 1, beta 0, dataType 1} 
MIOpen(HIP): Info2 [dummy_memset] dummy gpu memset
MIOpen(HIP): Info [SetValues] 192-28-28-5x5-32-26-26-16-1x1-1x1-1x1-0-NCHW-FP32-F, content inserted: miopenConvolutionFwdAlgoGEMM:gemm,6.37189,12979200,rocBlas,<unused>
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvBinWinograd3x3U: Not applicable
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvBinWinogradRxSf3x2: Not applicable
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvBinWinogradRxSf2x3: Not applicable
MIOpen(HIP): Info [FindSolutionImpl] ConvBinWinogradRxS (not searchable)
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvBinWinogradRxS: Success.
MIOpen(HIP): Info2 [LoadBinary] Loading binary for: conv_3x3_wheel_alpha_v9_0_15.s ;args: -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx803
MIOpen(HIP): Info2 [Prepare] SELECT kernel_blob, kernel_hash, uncompressed_size FROM kern_db WHERE (kernel_name = 'conv_3x3_wheel_alpha_v9_0_15.s.o') AND (kernel_args = '-Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx803');
MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 0.742803 ms
MIOpen(HIP): Info2 [PrepareInvoker] Preparing kernel: miopenSp3AsmConvRxSU
MIOpen(HIP): Info2 [GetSolution]  N=16 C=192 H=28 W=28 K=32 n_groups=32 flags=0 R=5 S=5 pad_H=1 pad_W=1 out_H=26 out_W=26
MIOpen(HIP): Info [EvaluateInvokers] ConvBinWinogradRxS: miopenSp3AsmConvRxSU: 0.684966 < 3.40282e+38
MIOpen(HIP): Info [EvaluateInvokers] Selected: ConvBinWinogradRxS: miopenSp3AsmConvRxSU: 0.684966, workspce_sz = 0
MIOpen(HIP): Info [SetValues] 192-28-28-5x5-32-26-26-16-1x1-1x1-1x1-0-NCHW-FP32-F, content inserted: miopenConvolutionFwdAlgoWinograd:ConvBinWinogradRxS,0.684966,0,miopenConvolutionFwdAlgoWinograd,<unused>
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvAsm3x3U: Not applicable
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvAsm1x1U: Not applicable
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvAsm1x1UV2: Not applicable
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvAsm5x10u2v2f1: Not applicable
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvAsm7x7c3h224w224k64u2v2p3q3f1: Not applicable
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvAsm5x10u2v2b1: Not applicable
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvOclDirectFwd11x11: Not applicable
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvOclDirectFwdGen: Not applicable
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvOclDirectFwd3x3: Not applicable
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvOclDirectFwd1x1: Not applicable
MIOpen(HIP): Info2 [GetPerformanceConfig] Returns: 16,16,32,32,2,2,8,2,1
MIOpen(HIP): Info [FindSolutionImpl] ConvOclDirectFwd
MIOpen(HIP): Info2 [Prepare] SELECT solver, params FROM perf_db INNER JOIN config ON perf_db.config = config.id WHERE ( (layout = ? ) AND (data_type = ? ) AND (direction = ? ) AND (spatial_dim = ? ) AND (in_channels = ? ) AND (in_h = ? ) AND (in_w = ? ) AND (in_d = ? ) AND (fil_h = ? ) AND (fil_w = ? ) AND (fil_d = ? ) AND (out_channels = ? ) AND (batchsize = ? ) AND (pad_h = ? ) AND (pad_w = ? ) AND (pad_d = ? ) AND (conv_stride_h = ? ) AND (conv_stride_w = ? ) AND (conv_stride_d = ? ) AND (dilation_h = ? ) AND (dilation_w = ? ) AND (dilation_d = ? ) AND (bias = ? ) AND (group_count = ? ) )AND (arch = 'gfx803' ) AND (num_cu = '32');
MIOpen(HIP): Info2 [impl] [NCHW,FP32,F,2,192,28,28,1,5,5,1,32,16,1,1,0,1,1,0,1,1,0,0,1]
MIOpen(HIP): Info2 [Prepare] SELECT solver, params FROM perf_db INNER JOIN config ON perf_db.config = config.id WHERE ( (layout = ? ) AND (data_type = ? ) AND (direction = ? ) AND (spatial_dim = ? ) AND (in_channels = ? ) AND (in_h = ? ) AND (in_w = ? ) AND (in_d = ? ) AND (fil_h = ? ) AND (fil_w = ? ) AND (fil_d = ? ) AND (out_channels = ? ) AND (batchsize = ? ) AND (pad_h = ? ) AND (pad_w = ? ) AND (pad_d = ? ) AND (conv_stride_h = ? ) AND (conv_stride_w = ? ) AND (conv_stride_d = ? ) AND (dilation_h = ? ) AND (dilation_w = ? ) AND (dilation_d = ? ) AND (bias = ? ) AND (group_count = ? ) )AND (arch = 'gfx803' ) AND (num_cu = '32');
MIOpen(HIP): Info2 [impl] [NCHW,FP32,F,2,192,28,28,1,5,5,1,32,16,1,1,0,1,1,0,1,1,0,0,1]
MIOpen(HIP): Info2 [Measure] Db::Load time: 0.230188 ms
MIOpen(HIP): Info [FindSolutionImpl] Perf Db: record not found for: ConvOclDirectFwd
MIOpen(HIP): Info2 [GetPerformanceConfig] Returns: 16,16,32,32,2,2,8,2,1
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvOclDirectFwd: Success.
MIOpen(HIP): Info2 [LoadBinary] Loading binary for: MIOpenConvDirUni.cl ;args:  -DMLO_HW_WAVE_SZ=64 -DMLO_DIR_FORWARD=1 -DMLO_FILTER_SIZE0=5 -DMLO_FILTER_SIZE1=5 -DMLO_FILTER_PAD0=1 -DMLO_FILTER_PAD1=1 -DMLO_FILTER_STRIDE0=1 -DMLO_FILTER_STRIDE1=1 -DMLO_N_OUTPUTS=32 -DMLO_N_INPUTS=192 -DMLO_BATCH_SZ=16 -DMLO_OUT_WIDTH=26 -DMLO_OUT_HEIGHT=26 -DMLO_OUT_BATCH_STRIDE=21632 -DMLO_OUT_CHANNEL_STRIDE=676 -DMLO_OUT_STRIDE=26 -DMLO_IN_WIDTH=28 -DMLO_IN_HEIGHT=28 -DMLO_IN_BATCH_STRIDE=150528 -DMLO_IN_CHANNEL_STRIDE=784 -DMLO_IN_STRIDE=28 -DMLO_IN_TILE0=32 -DMLO_IN_TILE1=32 -DMLO_GRP_TILE0=16 -DMLO_GRP_TILE1=16 -DMLO_OUT_TILE0=2 -DMLO_OUT_TILE1=2 -DMLO_N_STACKS=1 -DMLO_N_OUT_TILES=8 -DMLO_N_OUT_TILES_PERSTACK=8 -DMLO_N_IN_TILES_PERSTACK=2 -DMLO_N_READ_PROCS=256 -DMLO_ALU_VTILE0=16 -DMLO_ALU_VTILE1=16 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DMLO_CONV_BIAS=0 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -mcpu=gfx803
MIOpen(HIP): Info2 [Prepare] SELECT kernel_blob, kernel_hash, uncompressed_size FROM kern_db WHERE (kernel_name = 'MIOpenConvDirUni.cl.o') AND (kernel_args = ' -DMLO_HW_WAVE_SZ=64 -DMLO_DIR_FORWARD=1 -DMLO_FILTER_SIZE0=5 -DMLO_FILTER_SIZE1=5 -DMLO_FILTER_PAD0=1 -DMLO_FILTER_PAD1=1 -DMLO_FILTER_STRIDE0=1 -DMLO_FILTER_STRIDE1=1 -DMLO_N_OUTPUTS=32 -DMLO_N_INPUTS=192 -DMLO_BATCH_SZ=16 -DMLO_OUT_WIDTH=26 -DMLO_OUT_HEIGHT=26 -DMLO_OUT_BATCH_STRIDE=21632 -DMLO_OUT_CHANNEL_STRIDE=676 -DMLO_OUT_STRIDE=26 -DMLO_IN_WIDTH=28 -DMLO_IN_HEIGHT=28 -DMLO_IN_BATCH_STRIDE=150528 -DMLO_IN_CHANNEL_STRIDE=784 -DMLO_IN_STRIDE=28 -DMLO_IN_TILE0=32 -DMLO_IN_TILE1=32 -DMLO_GRP_TILE0=16 -DMLO_GRP_TILE1=16 -DMLO_OUT_TILE0=2 -DMLO_OUT_TILE1=2 -DMLO_N_STACKS=1 -DMLO_N_OUT_TILES=8 -DMLO_N_OUT_TILES_PERSTACK=8 -DMLO_N_IN_TILES_PERSTACK=2 -DMLO_N_READ_PROCS=256 -DMLO_ALU_VTILE0=16 -DMLO_ALU_VTILE1=16 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DMLO_CONV_BIAS=0 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -mcpu=gfx803');
MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 0.040689 ms
MIOpen(HIP): Info2 [SaveBinary] Saving binary for: MIOpenConvDirUni.cl ;args:  -DMLO_HW_WAVE_SZ=64 -DMLO_DIR_FORWARD=1 -DMLO_FILTER_SIZE0=5 -DMLO_FILTER_SIZE1=5 -DMLO_FILTER_PAD0=1 -DMLO_FILTER_PAD1=1 -DMLO_FILTER_STRIDE0=1 -DMLO_FILTER_STRIDE1=1 -DMLO_N_OUTPUTS=32 -DMLO_N_INPUTS=192 -DMLO_BATCH_SZ=16 -DMLO_OUT_WIDTH=26 -DMLO_OUT_HEIGHT=26 -DMLO_OUT_BATCH_STRIDE=21632 -DMLO_OUT_CHANNEL_STRIDE=676 -DMLO_OUT_STRIDE=26 -DMLO_IN_WIDTH=28 -DMLO_IN_HEIGHT=28 -DMLO_IN_BATCH_STRIDE=150528 -DMLO_IN_CHANNEL_STRIDE=784 -DMLO_IN_STRIDE=28 -DMLO_IN_TILE0=32 -DMLO_IN_TILE1=32 -DMLO_GRP_TILE0=16 -DMLO_GRP_TILE1=16 -DMLO_OUT_TILE0=2 -DMLO_OUT_TILE1=2 -DMLO_N_STACKS=1 -DMLO_N_OUT_TILES=8 -DMLO_N_OUT_TILES_PERSTACK=8 -DMLO_N_IN_TILES_PERSTACK=2 -DMLO_N_READ_PROCS=256 -DMLO_ALU_VTILE0=16 -DMLO_ALU_VTILE1=16 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DMLO_CONV_BIAS=0 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -mcpu=gfx803
MIOpen(HIP): Info2 [Prepare] INSERT OR REPLACE INTO kern_db(kernel_name, kernel_args, kernel_blob, kernel_hash, uncompressed_size) VALUES(?, ?, ?, ?, ?);
MIOpen(HIP): Info2 [Measure] Db::StoreRecord time: 13.6306 ms
MIOpen(HIP): Info2 [PrepareInvoker] Preparing kernel: MIOpenConvUni
MIOpen(HIP): Info [EvaluateInvokers] ConvOclDirectFwd: MIOpenConvUni: 2.30946 < 3.40282e+38
MIOpen(HIP): Info [EvaluateInvokers] Selected: ConvOclDirectFwd: MIOpenConvUni: 2.30946, workspce_sz = 0
MIOpen(HIP): Info [SetValues] 192-28-28-5x5-32-26-26-16-1x1-1x1-1x1-0-NCHW-FP32-F, content inserted: miopenConvolutionFwdAlgoDirect:ConvOclDirectFwd,2.30946,0,miopenConvolutionFwdAlgoDirect,<unused>
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvHipImplicitGemmForwardV4R4Xdlops: Not applicable
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvHipImplicitGemmV4R4GenXdlopsFwdFp32: Not applicable
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvHipImplicitGemmV4R4GenFwdXdlops: Not applicable
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvHipImplicitGemmBwdDataV1R1Xdlops: Not applicable
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvHipImplicitGemmBwdDataV4R1Xdlops: Not applicable
MIOpen(HIP): Info [FindSolutionImpl] ConvHipImplicitGemmV4R1Fwd
MIOpen(HIP): Info2 [Prepare] SELECT solver, params FROM perf_db INNER JOIN config ON perf_db.config = config.id WHERE ( (layout = ? ) AND (data_type = ? ) AND (direction = ? ) AND (spatial_dim = ? ) AND (in_channels = ? ) AND (in_h = ? ) AND (in_w = ? ) AND (in_d = ? ) AND (fil_h = ? ) AND (fil_w = ? ) AND (fil_d = ? ) AND (out_channels = ? ) AND (batchsize = ? ) AND (pad_h = ? ) AND (pad_w = ? ) AND (pad_d = ? ) AND (conv_stride_h = ? ) AND (conv_stride_w = ? ) AND (conv_stride_d = ? ) AND (dilation_h = ? ) AND (dilation_w = ? ) AND (dilation_d = ? ) AND (bias = ? ) AND (group_count = ? ) )AND (arch = 'gfx803' ) AND (num_cu = '32');
MIOpen(HIP): Info2 [impl] [NCHW,FP32,F,2,192,28,28,1,5,5,1,32,16,1,1,0,1,1,0,1,1,0,0,1]
MIOpen(HIP): Info2 [Prepare] SELECT solver, params FROM perf_db INNER JOIN config ON perf_db.config = config.id WHERE ( (layout = ? ) AND (data_type = ? ) AND (direction = ? ) AND (spatial_dim = ? ) AND (in_channels = ? ) AND (in_h = ? ) AND (in_w = ? ) AND (in_d = ? ) AND (fil_h = ? ) AND (fil_w = ? ) AND (fil_d = ? ) AND (out_channels = ? ) AND (batchsize = ? ) AND (pad_h = ? ) AND (pad_w = ? ) AND (pad_d = ? ) AND (conv_stride_h = ? ) AND (conv_stride_w = ? ) AND (conv_stride_d = ? ) AND (dilation_h = ? ) AND (dilation_w = ? ) AND (dilation_d = ? ) AND (bias = ? ) AND (group_count = ? ) )AND (arch = 'gfx803' ) AND (num_cu = '32');
MIOpen(HIP): Info2 [impl] [NCHW,FP32,F,2,192,28,28,1,5,5,1,32,16,1,1,0,1,1,0,1,1,0,0,1]
MIOpen(HIP): Info2 [Measure] Db::Load time: 0.233658 ms
MIOpen(HIP): Info [FindSolutionImpl] Perf Db: record not found for: ConvHipImplicitGemmV4R1Fwd
MIOpen(HIP): Info [EuristicInit] 16,16,4,2,2,2,2,4,2,4,4,1,16,1,4,16
MIOpen(HIP): Info [GetPerformanceConfigBase] 16,16,4,2,2,2,2,4,2,4,4,1,16,1,4,16
MIOpen(HIP): Info2 [SearchForAllSolutions] ConvHipImplicitGemmV4R1Fwd: Success.
MIOpen(HIP): Info2 [LoadBinary] Loading binary for: gridwise_convolution_implicit_gemm_v4r1_nchw_kcyx_nkhw_lds_double_buffer.cpp ;args:  -std=c++14  -DCK_PARAM_PROBLEM_N=16 -DCK_PARAM_PROBLEM_K=32 -DCK_PARAM_PROBLEM_C=192 -DCK_PARAM_PROBLEM_HI=28 -DCK_PARAM_PROBLEM_WI=28 -DCK_PARAM_PROBLEM_HO=26 -DCK_PARAM_PROBLEM_WO=26 -DCK_PARAM_PROBLEM_Y=5 -DCK_PARAM_PROBLEM_X=5 -DCK_PARAM_PROBLEM_CONV_STRIDE_H=1 -DCK_PARAM_PROBLEM_CONV_STRIDE_W=1 -DCK_PARAM_PROBLEM_CONV_DILATION_H=1 -DCK_PARAM_PROBLEM_CONV_DILATION_W=1 -DCK_PARAM_PROBLEM_LEFT_PAD_H=1 -DCK_PARAM_PROBLEM_LEFT_PAD_W=1 -DCK_PARAM_PROBLEM_RIGHT_PAD_H=1 -DCK_PARAM_PROBLEM_RIGHT_PAD_W=1 -DCK_PARAM_PROBLEM_CONV_GROUP_COUNTS=1 -DCK_PARAM_PROBLEM_CONV_DIRECTION_FORWARD=1 -DCK_PARAM_PROBLEM_CONV_DIRECTION_BACKWARD_DATA=0 -DCK_PARAM_PROBLEM_CONV_DIRECTION_BACKWARD_WEIGHT=0 -DCK_PARAM_TUNABLE_BLOCK_SIZE=64 -DCK_PARAM_TUNABLE_B_PER_BLOCK=16 -DCK_PARAM_TUNABLE_K_PER_BLOCK=16 -DCK_PARAM_TUNABLE_E_PER_BLOCK=4 -DCK_PARAM_DEPENDENT_GRID_SIZE=338 -DCK_PARAM_GEMM_N_REPEAT=2 -DCK_PARAM_GEMM_M_PER_THREAD_SUB_C=2 -DCK_PARAM_GEMM_N_PER_THREAD_SUB_C=2 -DCK_PARAM_GEMM_M_LEVEL0_CLUSTER=2 -DCK_PARAM_GEMM_N_LEVEL0_CLUSTER=4 -DCK_PARAM_GEMM_M_LEVEL1_CLUSTER=2 -DCK_PARAM_GEMM_N_LEVEL1_CLUSTER=4 -DCK_PARAM_IN_BLOCK_COPY_CLUSTER_LENGTHS_E=4 -DCK_PARAM_IN_BLOCK_COPY_CLUSTER_LENGTHS_N1=1 -DCK_PARAM_IN_BLOCK_COPY_CLUSTER_LENGTHS_B=16 -DCK_PARAM_IN_BLOCK_COPY_CLUSTER_LENGTHS_N2=1 -DCK_PARAM_IN_BLOCK_COPY_SRC_DATA_PER_READ_B=1 -DCK_PARAM_WEI_BLOCK_COPY_CLUSTER_LENGTHS_E=4 -DCK_PARAM_WEI_BLOCK_COPY_CLUSTER_LENGTHS_K=16 -DCK_PARAM_WEI_BLOCK_COPY_SRC_DATA_PER_READ_E=1 -DCK_PARAM_EPACK_LENGTH=1 -DCK_THREADWISE_GEMM_USE_AMD_INLINE_ASM=0 -DCK_USE_AMD_INLINE_ASM=0 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DCK_PARAM_IN_BLOCK_COPY_DST_DATA_PER_WRITE_N2=2 -DCK_PARAM_WEI_BLOCK_COPY_DST_DATA_PER_WRITE_K=1 -mcpu=gfx803
MIOpen(HIP): Info2 [Prepare] SELECT kernel_blob, kernel_hash, uncompressed_size FROM kern_db WHERE (kernel_name = 'gridwise_convolution_implicit_gemm_v4r1_nchw_kcyx_nkhw_lds_double_buffer.cpp.o') AND (kernel_args = ' -std=c++14  -DCK_PARAM_PROBLEM_N=16 -DCK_PARAM_PROBLEM_K=32 -DCK_PARAM_PROBLEM_C=192 -DCK_PARAM_PROBLEM_HI=28 -DCK_PARAM_PROBLEM_WI=28 -DCK_PARAM_PROBLEM_HO=26 -DCK_PARAM_PROBLEM_WO=26 -DCK_PARAM_PROBLEM_Y=5 -DCK_PARAM_PROBLEM_X=5 -DCK_PARAM_PROBLEM_CONV_STRIDE_H=1 -DCK_PARAM_PROBLEM_CONV_STRIDE_W=1 -DCK_PARAM_PROBLEM_CONV_DILATION_H=1 -DCK_PARAM_PROBLEM_CONV_DILATION_W=1 -DCK_PARAM_PROBLEM_LEFT_PAD_H=1 -DCK_PARAM_PROBLEM_LEFT_PAD_W=1 -DCK_PARAM_PROBLEM_RIGHT_PAD_H=1 -DCK_PARAM_PROBLEM_RIGHT_PAD_W=1 -DCK_PARAM_PROBLEM_CONV_GROUP_COUNTS=1 -DCK_PARAM_PROBLEM_CONV_DIRECTION_FORWARD=1 -DCK_PARAM_PROBLEM_CONV_DIRECTION_BACKWARD_DATA=0 -DCK_PARAM_PROBLEM_CONV_DIRECTION_BACKWARD_WEIGHT=0 -DCK_PARAM_TUNABLE_BLOCK_SIZE=64 -DCK_PARAM_TUNABLE_B_PER_BLOCK=16 -DCK_PARAM_TUNABLE_K_PER_BLOCK=16 -DCK_PARAM_TUNABLE_E_PER_BLOCK=4 -DCK_PARAM_DEPENDENT_GRID_SIZE=338 -DCK_PARAM_GEMM_N_REPEAT=2 -DCK_PARAM_GEMM_M_PER_THREAD_SUB_C=2 -DCK_PARAM_GEMM_N_PER_THREAD_SUB_C=2 -DCK_PARAM_GEMM_M_LEVEL0_CLUSTER=2 -DCK_PARAM_GEMM_N_LEVEL0_CLUSTER=4 -DCK_PARAM_GEMM_M_LEVEL1_CLUSTER=2 -DCK_PARAM_GEMM_N_LEVEL1_CLUSTER=4 -DCK_PARAM_IN_BLOCK_COPY_CLUSTER_LENGTHS_E=4 -DCK_PARAM_IN_BLOCK_COPY_CLUSTER_LENGTHS_N1=1 -DCK_PARAM_IN_BLOCK_COPY_CLUSTER_LENGTHS_B=16 -DCK_PARAM_IN_BLOCK_COPY_CLUSTER_LENGTHS_N2=1 -DCK_PARAM_IN_BLOCK_COPY_SRC_DATA_PER_READ_B=1 -DCK_PARAM_WEI_BLOCK_COPY_CLUSTER_LENGTHS_E=4 -DCK_PARAM_WEI_BLOCK_COPY_CLUSTER_LENGTHS_K=16 -DCK_PARAM_WEI_BLOCK_COPY_SRC_DATA_PER_READ_E=1 -DCK_PARAM_EPACK_LENGTH=1 -DCK_THREADWISE_GEMM_USE_AMD_INLINE_ASM=0 -DCK_USE_AMD_INLINE_ASM=0 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DCK_PARAM_IN_BLOCK_COPY_DST_DATA_PER_WRITE_N2=2 -DCK_PARAM_WEI_BLOCK_COPY_DST_DATA_PER_WRITE_K=1 -mcpu=gfx803');
MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 0.044059 ms
terminate called after throwing an instance of 'miopen::Exception'
  what():  /home/oleid/.cache/rua/build/miopen-hip/src/MIOpen-rocm-3.7.0/src/hipoc/hipoc_program.cpp:94: Failed creating module hipErrorSharedObjectInitFailed
CMake Error at test_test_find_db.cmake:7 (message):
  Test failed

      Start 19: test_fusion_aux
24/55 Test #19: test_fusion_aux ..................   Passed    0.08 sec
      Start 20: test_gru
25/55 Test #20: test_gru .........................***Failed  Error regular expression found in output. Regex=[FAILED] 11.58 sec
Empty batch sequence. Filling uniformly with batch size: 17
/home/oleid/.cache/rua/build/miopen-hip/src/build/bin/test_gru --float --batch-size 17 --seq-len 2 --vector-len 13 --hidden-size 67 --num-layers 1 --use-dropout 0 --in-mode 0 --bias-mode 0 --dir-mode 0 --batch-seq 17 17 
FAILED: 0.0320774
Iteration: 0
./bin/MIOpenDriver rnn -n 17,17 -m gru -k 2 -H 67 -W 13 -l 1 -F 0 -r 0 -b 0 -p 0
inputMode: 0 biasMode: 0 dirMode: 0
hz: 67 batch_n: 34 seqLength: 2 inputLen: 13 numLayers: 1 useDropout: 0
Backward Data GRU: 
Output dx failed verification.
Max diff: 0.017266
Mismatch at 1: -0.00755944 != -0.00755943
/home/oleid/.cache/rua/build/miopen-hip/src/build/bin/test_gru --float --batch-size 17 --seq-len 2 --vector-len 13 --hidden-size 67 --num-layers 1 --use-dropout 0 --in-mode 0 --bias-mode 0 --dir-mode 0 --batch-seq 17 17 
FAILED: 0.00856887
Iteration: 0
./bin/MIOpenDriver rnn -n 17,17 -m gru -k 2 -H 67 -W 13 -l 1 -F 0 -r 0 -b 0 -p 0
inputMode: 0 biasMode: 0 dirMode: 0
hz: 67 batch_n: 34 seqLength: 2 inputLen: 13 numLayers: 1 useDropout: 0
Backward Data GRU: 
Max diff: 0.0487628
Mismatch at 2: 0.00140701 != 0.00140701
/home/oleid/.cache/rua/build/miopen-hip/src/build/bin/test_gru --float --batch-size 17 --seq-len 2 --vector-len 13 --hidden-size 67 --num-layers 1 --use-dropout 0 --in-mode 0 --bias-mode 0 --dir-mode 0 --batch-seq 17 17 
FAILED: 0.0545153
Iteration: 0
./bin/MIOpenDriver rnn -n 17,17 -m gru -k 2 -H 67 -W 13 -l 1 -F 0 -r 0 -b 0 -p 0
inputMode: 0 biasMode: 0 dirMode: 0
hz: 67 batch_n: 34 seqLength: 2 inputLen: 13 numLayers: 1 useDropout: 0
Backward Weights GRU: 
Max diff: 0.507131
Mismatch at 48: 0.00114732 != 0.50561
/home/oleid/.cache/rua/build/miopen-hip/src/build/bin/test_gru --float --batch-size 17 --seq-len 2 --vector-len 13 --hidden-size 67 --num-layers 1 --use-dropout 0 --in-mode 0 --bias-mode 0 --dir-mode 0 --batch-seq 17 17 
FAILED: 0.107456
Iteration: 0
./bin/MIOpenDriver rnn -n 17,17 -m gru -k 2 -H 67 -W 13 -l 1 -F 0 -r 0 -b 0 -p 0
inputMode: 0 biasMode: 0 dirMode: 0
hz: 67 batch_n: 34 seqLength: 2 inputLen: 13 numLayers: 1
Forward Inference GRU: 
Output tensor output failed verification.
Max diff: 0.0694702
Mismatch at 0: 0.00283208 != 0.0151487

      Start 21: test_gru_dropout
26/55 Test #21: test_gru_dropout .................***Failed  Error regular expression found in output. Regex=[FAILED] 10.71 sec
Empty batch sequence. Filling uniformly with batch size: 17
/home/oleid/.cache/rua/build/miopen-hip/src/build/bin/test_gru_dropout --float --batch-size 17 --seq-len 23 --vector-len 13 --hidden-size 67 --num-layers 3 --flat-batch-fill 0 --use-dropout 1 --in-mode 0 --bias-mode 0 --dir-mode 0 --batch-seq 17 17 16 16 16 16 15 15 14 13 12 11 10 10 9 8 8 8 7 6 6 5 5 
FAILED: 0.00454052
Iteration: 0
./bin/MIOpenDriver rnn -n 17,17,16,16,16,16,15,15,14,13,12,11,10,10,9,8,8,8,7,6,6,5,5 -m gru -k 23 -H 67 -W 13 -l 3 -F 0 -r 0 -b 0 -p 0
inputMode: 0 biasMode: 0 dirMode: 0
hz: 67 batch_n: 260 seqLength: 23 inputLen: 13 numLayers: 3 useDropout: 1
Backward Data GRU: 
Output dx failed verification.
Max diff: 0.00322931
Mismatch at 0: 0.000931848 != 0.00093185
/home/oleid/.cache/rua/build/miopen-hip/src/build/bin/test_gru_dropout --float --batch-size 17 --seq-len 23 --vector-len 13 --hidden-size 67 --num-layers 3 --flat-batch-fill 0 --use-dropout 1 --in-mode 0 --bias-mode 0 --dir-mode 0 --batch-seq 17 17 16 16 16 16 15 15 14 13 12 11 10 10 9 8 8 8 7 6 6 5 5 
FAILED: 4.29788e-05
Iteration: 0
./bin/MIOpenDriver rnn -n 17,17,16,16,16,16,15,15,14,13,12,11,10,10,9,8,8,8,7,6,6,5,5 -m gru -k 23 -H 67 -W 13 -l 3 -F 0 -r 0 -b 0 -p 0
inputMode: 0 biasMode: 0 dirMode: 0
hz: 67 batch_n: 260 seqLength: 23 inputLen: 13 numLayers: 3 useDropout: 1
Backward Data GRU: 
Hidden state dhx tensor failed verification.
Max diff: 0.000362066
Mismatch at 0: -0.0204745 != -0.0204745
/home/oleid/.cache/rua/build/miopen-hip/src/build/bin/test_gru_dropout --float --batch-size 17 --seq-len 23 --vector-len 13 --hidden-size 67 --num-layers 3 --flat-batch-fill 0 --use-dropout 1 --in-mode 0 --bias-mode 0 --dir-mode 0 --batch-seq 17 17 16 16 16 16 15 15 14 13 12 11 10 10 9 8 8 8 7 6 6 5 5 
FAILED: 1.34913e-05
Iteration: 0
./bin/MIOpenDriver rnn -n 17,17,16,16,16,16,15,15,14,13,12,11,10,10,9,8,8,8,7,6,6,5,5 -m gru -k 23 -H 67 -W 13 -l 3 -F 0 -r 0 -b 0 -p 0
inputMode: 0 biasMode: 0 dirMode: 0
hz: 67 batch_n: 260 seqLength: 23 inputLen: 13 numLayers: 3 useDropout: 1
Backward Data GRU: 
Max diff: 0.00129145
Mismatch at 0: -0.000566879 != -0.000566879

      Start 22: test_handle_test
27/55 Test #22: test_handle_test .................   Passed    1.09 sec
      Start 23: test_immed_conv2d
28/55 Test #23: test_immed_conv2d ................   Passed    0.07 sec
      Start 24: test_immed_conv3d
29/55 Test #24: test_immed_conv3d ................   Passed    7.70 sec
      Start 25: test_include_inliner
30/55 Test #25: test_include_inliner .............   Passed    0.10 sec
      Start 26: test_kernel_build_params
31/55 Test #26: test_kernel_build_params .........   Passed    0.07 sec
      Start 28: test_lstm
32/55 Test #28: test_lstm ........................   Passed    8.55 sec
      Start 29: test_lstm_dropout
33/55 Test #29: test_lstm_dropout ................***Failed  Error regular expression found in output. Regex=[FAILED]  9.47 sec
Empty batch sequence. Filling uniformly with batch size: 17
/home/oleid/.cache/rua/build/miopen-hip/src/build/bin/test_lstm_dropout --float --batch-size 17 --seq-len 25 --vector-len 17 --hidden-size 67 --num-layers 3 --no-hx 0 --no-dhy 0 --no-cx 0 --no-dcy 0 --no-hy 0 --no-dhx 0 --no-cy 0 --no-dcx 0 --flat-batch-fill 0 --use-dropout 1 --in-mode 0 --bias-mode 1 --dir-mode 0 --algo-mode 0 --batch-seq 17 17 16 16 16 16 15 15 14 13 12 11 10 10 9 8 8 8 7 6 6 5 5 4 4 
FAILED: 0.0148069
Iteration: 0
./bin/MIOpenDriver rnn -n 17,17,16,16,16,16,15,15,14,13,12,11,10,10,9,8,8,8,7,6,6,5,5,4,4 -m lstm  -k 25 -H 67 -W 17 -l 3 -F 0  -r 0 -b 1 -p 0
inputMode: 0 biasMode: 1 dirMode: 0
hz: 67 batch_n: 268 seqLength: 25 inputLen: 17 numLayers: 3 useDropout: 1
Forward Train LSTM: 
Output tensor output failed verification.
Max diff: 0.0296305
Mismatch at 0: -0.00388325 != -0.00388324
/home/oleid/.cache/rua/build/miopen-hip/src/build/bin/test_lstm_dropout --float --batch-size 17 --seq-len 25 --vector-len 17 --hidden-size 67 --num-layers 3 --no-hx 0 --no-dhy 0 --no-cx 0 --no-dcy 0 --no-hy 0 --no-dhx 0 --no-cy 0 --no-dcx 0 --flat-batch-fill 0 --use-dropout 1 --in-mode 0 --bias-mode 1 --dir-mode 0 --algo-mode 0 --batch-seq 17 17 16 16 16 16 15 15 14 13 12 11 10 10 9 8 8 8 7 6 6 5 5 4 4 
FAILED: 0.00364173
Iteration: 0
./bin/MIOpenDriver rnn -n 17,17,16,16,16,16,15,15,14,13,12,11,10,10,9,8,8,8,7,6,6,5,5,4,4 -m lstm  -k 25 -H 67 -W 17 -l 3 -F 0  -r 0 -b 1 -p 0
inputMode: 0 biasMode: 1 dirMode: 0
hz: 67 batch_n: 268 seqLength: 25 inputLen: 17 numLayers: 3 useDropout: 1
Forward Train LSTM: 
Hidden state tensor failed verification.
Max diff: 0.0058289
Mismatch at 1: 0.0453552 != 0.0453552
/home/oleid/.cache/rua/build/miopen-hip/src/build/bin/test_lstm_dropout --float --batch-size 17 --seq-len 25 --vector-len 17 --hidden-size 67 --num-layers 3 --no-hx 0 --no-dhy 0 --no-cx 0 --no-dcy 0 --no-hy 0 --no-dhx 0 --no-cy 0 --no-dcx 0 --flat-batch-fill 0 --use-dropout 1 --in-mode 0 --bias-mode 1 --dir-mode 0 --algo-mode 0 --batch-seq 17 17 16 16 16 16 15 15 14 13 12 11 10 10 9 8 8 8 7 6 6 5 5 4 4 
FAILED: 0.00369584
Iteration: 0
./bin/MIOpenDriver rnn -n 17,17,16,16,16,16,15,15,14,13,12,11,10,10,9,8,8,8,7,6,6,5,5,4,4 -m lstm  -k 25 -H 67 -W 17 -l 3 -F 0  -r 0 -b 1 -p 0
inputMode: 0 biasMode: 1 dirMode: 0
hz: 67 batch_n: 268 seqLength: 25 inputLen: 17 numLayers: 3 useDropout: 1
Forward Train LSTM: 
Cell state tensor failed verification.
Max diff: 0.010714
Mismatch at 0: 0.0155845 != 0.0155845
/home/oleid/.cache/rua/build/miopen-hip/src/build/bin/test_lstm_dropout --float --batch-size 17 --seq-len 25 --vector-len 17 --hidden-size 67 --num-layers 3 --no-hx 0 --no-dhy 0 --no-cx 0 --no-dcy 0 --no-hy 0 --no-dhx 0 --no-cy 0 --no-dcx 0 --flat-batch-fill 0 --use-dropout 1 --in-mode 0 --bias-mode 1 --dir-mode 0 --algo-mode 0 --batch-seq 17 17 16 16 16 16 15 15 14 13 12 11 10 10 9 8 8 8 7 6 6 5 5 4 4 
FAILED: 0.000112727
Iteration: 0
./bin/MIOpenDriver rnn -n 17,17,16,16,16,16,15,15,14,13,12,11,10,10,9,8,8,8,7,6,6,5,5,4,4 -m lstm  -k 25 -H 67 -W 17 -l 3 -F 0  -r 0 -b 1 -p 0
inputMode: 0 biasMode: 1 dirMode: 0
hz: 67 batch_n: 268 seqLength: 25 inputLen: 17 numLayers: 3 useDropout: 1
Backward Data LSTM: 
Output dx failed verification.
Max diff: 0.000107641
Mismatch at 0: 0.00030012 != 0.000300088
/home/oleid/.cache/rua/build/miopen-hip/src/build/bin/test_lstm_dropout --float --batch-size 17 --seq-len 25 --vector-len 17 --hidden-size 67 --num-layers 3 --no-hx 0 --no-dhy 0 --no-cx 0 --no-dcy 0 --no-hy 0 --no-dhx 0 --no-cy 0 --no-dcx 0 --flat-batch-fill 0 --use-dropout 1 --in-mode 0 --bias-mode 1 --dir-mode 0 --algo-mode 0 --batch-seq 17 17 16 16 16 16 15 15 14 13 12 11 10 10 9 8 8 8 7 6 6 5 5 4 4 
FAILED: 7.58906e-05
Iteration: 0
./bin/MIOpenDriver rnn -n 17,17,16,16,16,16,15,15,14,13,12,11,10,10,9,8,8,8,7,6,6,5,5,4,4 -m lstm  -k 25 -H 67 -W 17 -l 3 -F 0  -r 0 -b 1 -p 0
inputMode: 0 biasMode: 1 dirMode: 0
hz: 67 batch_n: 268 seqLength: 25 inputLen: 17 numLayers: 3 useDropout: 1
Backward Data LSTM: 
Hidden state dhx tensor failed verification.
Max diff: 4.07901e-05
Mismatch at 0: 0.000162567 != 0.000162645
/home/oleid/.cache/rua/build/miopen-hip/src/build/bin/test_lstm_dropout --float --batch-size 17 --seq-len 25 --vector-len 17 --hidden-size 67 --num-layers 3 --no-hx 0 --no-dhy 0 --no-cx 0 --no-dcy 0 --no-hy 0 --no-dhx 0 --no-cy 0 --no-dcx 0 --flat-batch-fill 0 --use-dropout 1 --in-mode 0 --bias-mode 1 --dir-mode 0 --algo-mode 0 --batch-seq 17 17 16 16 16 16 15 15 14 13 12 11 10 10 9 8 8 8 7 6 6 5 5 4 4 
FAILED: 0.000162781
Iteration: 0
./bin/MIOpenDriver rnn -n 17,17,16,16,16,16,15,15,14,13,12,11,10,10,9,8,8,8,7,6,6,5,5,4,4 -m lstm  -k 25 -H 67 -W 17 -l 3 -F 0  -r 0 -b 1 -p 0
inputMode: 0 biasMode: 1 dirMode: 0
hz: 67 batch_n: 268 seqLength: 25 inputLen: 17 numLayers: 3 useDropout: 1
Backward Data LSTM: 
Hidden cell dcx tensor failed verification.
Max diff: 0.000141347
Mismatch at 0: -0.000447281 != -0.000447214
/home/oleid/.cache/rua/build/miopen-hip/src/build/bin/test_lstm_dropout --float --batch-size 17 --seq-len 25 --vector-len 17 --hidden-size 67 --num-layers 3 --no-hx 0 --no-dhy 0 --no-cx 0 --no-dcy 0 --no-hy 0 --no-dhx 0 --no-cy 0 --no-dcx 0 --flat-batch-fill 0 --use-dropout 1 --in-mode 0 --bias-mode 1 --dir-mode 0 --algo-mode 0 --batch-seq 17 17 16 16 16 16 15 15 14 13 12 11 10 10 9 8 8 8 7 6 6 5 5 4 4 
FAILED: 0.000338308
Iteration: 0
./bin/MIOpenDriver rnn -n 17,17,16,16,16,16,15,15,14,13,12,11,10,10,9,8,8,8,7,6,6,5,5,4,4 -m lstm  -k 25 -H 67 -W 17 -l 3 -F 0  -r 0 -b 1 -p 0
inputMode: 0 biasMode: 1 dirMode: 0
hz: 67 batch_n: 268 seqLength: 25 inputLen: 17 numLayers: 3 useDropout: 1
Backward Data LSTM: 
Workspace space tensor failed verification.
Max diff: 0.00392286
Mismatch at 0: 5.63776e-06 != 5.63691e-06
/home/oleid/.cache/rua/build/miopen-hip/src/build/bin/test_lstm_dropout --float --batch-size 17 --seq-len 25 --vector-len 17 --hidden-size 67 --num-layers 3 --no-hx 0 --no-dhy 0 --no-cx 0 --no-dcy 0 --no-hy 0 --no-dhx 0 --no-cy 0 --no-dcx 0 --flat-batch-fill 0 --use-dropout 1 --in-mode 0 --bias-mode 1 --dir-mode 0 --algo-mode 0 --batch-seq 17 17 16 16 16 16 15 15 14 13 12 11 10 10 9 8 8 8 7 6 6 5 5 4 4 
FAILED: 6.58243e-05
Iteration: 0
./bin/MIOpenDriver rnn -n 17,17,16,16,16,16,15,15,14,13,12,11,10,10,9,8,8,8,7,6,6,5,5,4,4 -m lstm  -k 25 -H 67 -W 17 -l 3 -F 0  -r 0 -b 1 -p 0
inputMode: 0 biasMode: 1 dirMode: 0
hz: 67 batch_n: 268 seqLength: 25 inputLen: 17 numLayers: 3 useDropout: 1
Backward Weights LSTM: 
Max diff: 0.0102975
Mismatch at 299: -0.000965326 != -0.000965326

      Start 30: test_main
34/55 Test #30: test_main ........................***Failed  Error regular expression found in output. Regex=[FAILED] 24.77 sec
MIOpen(HIP): Warning [Prefetch] File is unreadable: /opt/rocm/miopen/share/miopen/db/gfx803_32.HIP.fdb.txt
FAILED: (miopenFindConvolutionForwardAlgorithm( handle, inputTensor, in_dev, convFilter, wei_dev, convDesc, outputTensor, out_dev, 1, &ret_algo_count, &perf, fwd_workspace_dev, sz_fwd_workspace, 0)) == 0: /home/oleid/.cache/rua/build/miopen-hip/src/MIOpen-rocm-3.7.0/test/main.cpp: 282
MIOpen Error: /home/oleid/.cache/rua/build/miopen-hip/src/MIOpen-rocm-3.7.0/src/hipoc/hipoc_program.cpp:94: Failed creating module hipErrorSharedObjectInitFailed
CMake Error at test_test_main.cmake:7 (message):
  Test failed

      Start 31: test_mdgraph
35/55 Test #31: test_mdgraph .....................   Passed    0.41 sec
      Start 32: test_na_inference
36/55 Test #32: test_na_inference ................   Passed    0.81 sec
      Start 33: test_na_train
37/55 Test #33: test_na_train ....................   Passed    1.30 sec
      Start 34: test_perfdb
38/55 Test #34: test_perfdb ......................   Passed    0.68 sec
      Start 37: test_rnn_vanilla
39/55 Test #37: test_rnn_vanilla .................   Passed    9.20 sec
      Start 38: test_rnn_vanilla_dropout
40/55 Test #38: test_rnn_vanilla_dropout .........***Failed  Error regular expression found in output. Regex=[FAILED] 10.63 sec
Empty batch sequence. Filling uniformly with batch size: 17
/home/oleid/.cache/rua/build/miopen-hip/src/build/bin/test_rnn_vanilla_dropout --float --batch-size 17 --seq-len 51 --vector-len 31 --hidden-size 127 --num-layers 3 --flat-batch-fill 0 --use-dropout 1 --in-mode 1 --bias-mode 1 --dir-mode 0 --rnn-mode 0 --batch-seq 17 17 16 16 16 16 15 15 14 13 12 11 10 10 9 8 8 8 7 6 6 5 5 4 4 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
FAILED: 0.0134763
Iteration: 0
./bin/MIOpenDriver rnn -n 17,17,16,16,16,16,15,15,14,13,12,11,10,10,9,8,8,8,7,6,6,5,5,4,4,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 -m relu -k 51 -H 127 -W 127 -l 3 -F 0 -r 0 -b 1 -p 1 -U 1
Forward Train RNN vanilla: 
Output tensor output failed verification.
Max diff: 0.213538
Mismatch at 17: 0.0560553 != 0.0560552
/home/oleid/.cache/rua/build/miopen-hip/src/build/bin/test_rnn_vanilla_dropout --float --batch-size 17 --seq-len 51 --vector-len 31 --hidden-size 127 --num-layers 3 --flat-batch-fill 0 --use-dropout 1 --in-mode 1 --bias-mode 1 --dir-mode 0 --rnn-mode 0 --batch-seq 17 17 16 16 16 16 15 15 14 13 12 11 10 10 9 8 8 8 7 6 6 5 5 4 4 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
FAILED: 0.00610061
Iteration: 0
./bin/MIOpenDriver rnn -n 17,17,16,16,16,16,15,15,14,13,12,11,10,10,9,8,8,8,7,6,6,5,5,4,4,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 -m relu -k 51 -H 127 -W 127 -l 3 -F 0 -r 0 -b 1 -p 1 -U 1
Forward Train RNN vanilla: 
Hidden state tensor failed verification.
Max diff: 0.0544878
Mismatch at 15: 0.0989682 != 0.0989682
/home/oleid/.cache/rua/build/miopen-hip/src/build/bin/test_rnn_vanilla_dropout --float --batch-size 17 --seq-len 51 --vector-len 31 --hidden-size 127 --num-layers 3 --flat-batch-fill 0 --use-dropout 1 --in-mode 1 --bias-mode 1 --dir-mode 0 --rnn-mode 0 --batch-seq 17 17 16 16 16 16 15 15 14 13 12 11 10 10 9 8 8 8 7 6 6 5 5 4 4 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
FAILED: 0.00585688
Iteration: 0
./bin/MIOpenDriver rnn -n 17,17,16,16,16,16,15,15,14,13,12,11,10,10,9,8,8,8,7,6,6,5,5,4,4,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 -m relu -k 51 -H 127 -W 127 -l 3 -F 0 -r 0 -b 1 -p 1 -U 1
Forward Train RNN vanilla: 
Weight tensor failed verification.
Max diff: 0.441525
Mismatch at 2: 0.031029 != 0.031029
/home/oleid/.cache/rua/build/miopen-hip/src/build/bin/test_rnn_vanilla_dropout --float --batch-size 17 --seq-len 51 --vector-len 31 --hidden-size 127 --num-layers 3 --flat-batch-fill 0 --use-dropout 1 --in-mode 1 --bias-mode 1 --dir-mode 0 --rnn-mode 0 --batch-seq 17 17 16 16 16 16 15 15 14 13 12 11 10 10 9 8 8 8 7 6 6 5 5 4 4 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
FAILED: 0.00518946
Iteration: 0
./bin/MIOpenDriver rnn -n 17,17,16,16,16,16,15,15,14,13,12,11,10,10,9,8,8,8,7,6,6,5,5,4,4,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 -m relu -k 51 -H 127 -W 127 -l 3 -F 0 -r 0 -b 1 -p 1 -U 1
Backward Data RNN vanilla: 
Output dx failed verification.
Max diff: 0.117308
Mismatch at 2: 0.00535342 != 0.00535343
/home/oleid/.cache/rua/build/miopen-hip/src/build/bin/test_rnn_vanilla_dropout --float --batch-size 17 --seq-len 51 --vector-len 31 --hidden-size 127 --num-layers 3 --flat-batch-fill 0 --use-dropout 1 --in-mode 1 --bias-mode 1 --dir-mode 0 --rnn-mode 0 --batch-seq 17 17 16 16 16 16 15 15 14 13 12 11 10 10 9 8 8 8 7 6 6 5 5 4 4 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
FAILED: 0.00415193
Iteration: 0
./bin/MIOpenDriver rnn -n 17,17,16,16,16,16,15,15,14,13,12,11,10,10,9,8,8,8,7,6,6,5,5,4,4,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 -m relu -k 51 -H 127 -W 127 -l 3 -F 0 -r 0 -b 1 -p 1 -U 1
Backward Data RNN vanilla: 
Hidden state dhx tensor failed verification.
Max diff: 0.00748828
Mismatch at 0: -0.00334076 != -0.00334076
/home/oleid/.cache/rua/build/miopen-hip/src/build/bin/test_rnn_vanilla_dropout --float --batch-size 17 --seq-len 51 --vector-len 31 --hidden-size 127 --num-layers 3 --flat-batch-fill 0 --use-dropout 1 --in-mode 1 --bias-mode 1 --dir-mode 0 --rnn-mode 0 --batch-seq 17 17 16 16 16 16 15 15 14 13 12 11 10 10 9 8 8 8 7 6 6 5 5 4 4 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
FAILED: 0.00436673
Iteration: 0
./bin/MIOpenDriver rnn -n 17,17,16,16,16,16,15,15,14,13,12,11,10,10,9,8,8,8,7,6,6,5,5,4,4,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 -m relu -k 51 -H 127 -W 127 -l 3 -F 0 -r 0 -b 1 -p 1 -U 1
Backward Data RNN vanilla: 
Reserved space tensor failed verification.
Max diff: 0.117308
Mismatch at 2: 0.00535342 != 0.00535343

      Start 39: test_sequences
41/55 Test #39: test_sequences ...................   Passed    0.01 sec
      Start 41: test_solver
42/55 Test #41: test_solver ......................   Passed    0.40 sec
      Start 42: test_sqlite_perfdb
43/55 Test #42: test_sqlite_perfdb ...............   Passed    3.65 sec
      Start 43: test_tensor_cast
44/55 Test #43: test_tensor_cast .................   Passed    0.90 sec
      Start 44: test_tensor_copy
45/55 Test #44: test_tensor_copy .................   Passed    0.92 sec
      Start 45: test_tensor_ops
46/55 Test #45: test_tensor_ops ..................   Passed    0.85 sec
      Start 46: test_tensor_scale
47/55 Test #46: test_tensor_scale ................   Passed    0.86 sec
      Start 47: test_tensor_set
48/55 Test #47: test_tensor_set ..................   Passed    0.88 sec
      Start 48: test_tensor_test
49/55 Test #48: test_tensor_test .................   Passed    0.06 sec
      Start 49: test_tensor_trans
50/55 Test #49: test_tensor_trans ................   Passed    1.23 sec
      Start 50: test_tensor_transform
51/55 Test #50: test_tensor_transform ............   Passed    0.99 sec
      Start 51: test_tensor_vec
52/55 Test #51: test_tensor_vec ..................   Passed    0.39 sec
      Start 52: test_test_errors
53/55 Test #52: test_test_errors .................   Passed    0.07 sec
      Start 53: test_type_name
54/55 Test #53: test_type_name ...................   Passed    0.01 sec
      Start 54: test_w_supertensor
55/55 Test #54: test_w_supertensor ...............   Passed    1.22 sec

89% tests passed, 6 tests failed out of 55

Total Test time (real) = 228.81 sec

The following tests FAILED:
     18 - test_find_db (Failed)
     20 - test_gru (Failed)
     21 - test_gru_dropout (Failed)
     29 - test_lstm_dropout (Failed)
     30 - test_main (Failed)
     38 - test_rnn_vanilla_dropout (Failed)
Errors while running CTest
make[3]: *** [test/CMakeFiles/check.dir/build.make:76: test/CMakeFiles/check] Fehler 8
make[2]: *** [CMakeFiles/Makefile2:8075: test/CMakeFiles/check.dir/all] Fehler 2
make[1]: *** [CMakeFiles/Makefile2:8082: test/CMakeFiles/check.dir/rule] Fehler 2
make: *** [Makefile:3622: check] Fehler 2
oleid commented 3 years ago

Found it, it would seem miopen was built without support for clang-offload-bundler, c.f. https://github.com/rocm-arch/rocm-arch/pull/417