ROCm / tensorflow-upstream

TensorFlow ROCm port
https://tensorflow.org
Apache License 2.0
688 stars 95 forks source link

ELMo nan training, works with nvidia #660

Open twuebi opened 5 years ago

twuebi commented 5 years ago

System information

Describe the current behavior Training an RNN for sequence labeling on top of ELMo (https://github.com/allenai/bilm-tf) produces nan loss. The same model runs fine on an Nvidia RTX5000.

Confirmed on:

Describe the expected behavior No nans.

Code to reproduce the issue My dataset is not publicly available, I've put together a reproducer with freely available data (taken from https://github.com/UniversalDependencies/UD_German-GSD). With my own data the nan loss occurs after 3 steps, with this dataset after the first step. The RTX5000 does not produce nans for both datasets. (ELMo weights taken from https://github.com/t-systems-on-site-services-gmbh/german-elmo-model nan also occurs with my own trained weights)

$ git clone https://github.com/allenai/bilm-tf.git
$ cd bilm-tf
$ python setup.py install

Download this archive (336MB)

$ tar -xzf nan_elmo.tar.gz
$ python nan.py dev_ud.conll

Output is:

number_of_batch loss

You should see the nan after the first batch.

sunway513 commented 5 years ago

Thank you @twuebi , I will try to repro this issue locally and update you here.