Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): mint
TensorFlow installed from (source or binary): binary and docker
TensorFlow version (use command below): 1.14.2
Python version: 3.6.5 + the one from docker
ROCm/MIOpen version: 2.8
GPU model and memory: VII 16
Describe the current behavior
Training an RNN for sequence labeling on top of ELMo (https://github.com/allenai/bilm-tf) produces nan loss. The same model runs fine on an Nvidia RTX5000.
Confirmed on:
latest docker image
rocm 2.8 + 1.14.2 + kernel 5.2
rocm 2.8 + 1.14.2 + kernel 5 (another machine also VII gpu)
System information
Describe the current behavior Training an RNN for sequence labeling on top of ELMo (https://github.com/allenai/bilm-tf) produces nan loss. The same model runs fine on an Nvidia RTX5000.
Confirmed on:
Describe the expected behavior No nans.
Code to reproduce the issue My dataset is not publicly available, I've put together a reproducer with freely available data (taken from https://github.com/UniversalDependencies/UD_German-GSD). With my own data the nan loss occurs after 3 steps, with this dataset after the first step. The RTX5000 does not produce nans for both datasets. (ELMo weights taken from https://github.com/t-systems-on-site-services-gmbh/german-elmo-model nan also occurs with my own trained weights)
Download this archive (336MB)
Output is:
number_of_batch loss
You should see the nan after the first batch.