ELMo nan training, works with nvidia

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): mint
TensorFlow installed from (source or binary): binary and docker
TensorFlow version (use command below): 1.14.2
Python version: 3.6.5 + the one from docker
ROCm/MIOpen version: 2.8
GPU model and memory: VII 16

Describe the current behavior Training an RNN for sequence labeling on top of ELMo (https://github.com/allenai/bilm-tf) produces nan loss. The same model runs fine on an Nvidia RTX5000.

Confirmed on:

latest docker image
rocm 2.8 + 1.14.2 + kernel 5.2
rocm 2.8 + 1.14.2 + kernel 5 (another machine also VII gpu)

Describe the expected behavior No nans.

Code to reproduce the issue My dataset is not publicly available, I've put together a reproducer with freely available data (taken from https://github.com/UniversalDependencies/UD_German-GSD). With my own data the nan loss occurs after 3 steps, with this dataset after the first step. The RTX5000 does not produce nans for both datasets. (ELMo weights taken from https://github.com/t-systems-on-site-services-gmbh/german-elmo-model nan also occurs with my own trained weights)

$ git clone https://github.com/allenai/bilm-tf.git
$ cd bilm-tf
$ python setup.py install

Download this archive (336MB)

$ tar -xzf nan_elmo.tar.gz
$ python nan.py dev_ud.conll

Output is:

number_of_batch loss

You should see the nan after the first batch.

ROCm / tensorflow-upstream

ELMo nan training, works with nvidia #660