ROCm / tensorflow-upstream

TensorFlow ROCm port
https://tensorflow.org
Apache License 2.0
685 stars 93 forks source link

Error when running this official tutorial for tensorflow 2.0beta1 #517

Open witeko opened 5 years ago

witeko commented 5 years ago

System information

Describe the current behavior when I run this official tutorial (https://www.tensorflow.org/beta/tutorials/text/text_classification_rnn) for tf2beta1 i get "E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] constant folding failed: Invalid argument: Unsupported type: 21" when training the model. The training progresses but it's increadibly slow: "5/Unknown - 446s 89s/step - loss: 0.6935 - acc: 0.5219" The expected speed is 200ms/step (not 90s/step).

Describe the expected behavior No error. The expected speed of training is 200ms/step (not 90s/step).

Code to reproduce the issue In the tutorial: https://www.tensorflow.org/beta/tutorials/text/text_classification_rnn

sunway513 commented 5 years ago

Hi @witeko , did you use our official docker image or built TF2.0-Beta1 from source? Can you try the following docker image if not yet? rocm/tensorflow:rocm2.5-tf2.0-beta1-config-v2

witeko commented 5 years ago

@sunway513 Ive built it from source. Theres a whole bunch of examples that work perfectly fine - this one is an exemption. OK, I can try to use the official docker image in my free time. Do You think this would change the result? :)

sunway513 commented 5 years ago

@witeko , thanks. Just FYI, there're two options to build the TF-2.0Beta1 source; if you don't specify "--config=v2" to bazel build command, you won't get TF2.0 feature enabled. You can refer to the following two build scripts we've included in the release branch: https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/r2.0-rocm/build_rocm_python3 https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/r2.0-rocm/build_rocm_v2

witeko commented 5 years ago

@sunway513 I checked again and this time I built the TF2 with the v2 option. There are no more errors, but this time I have warnings: "W tensorflow/core/grappler/optimizers/implementation_selector.cc:199] Skipping optimization due to error while loading function libraries: Invalid argument: Functions 'inference___backward_cudnn_lstm_1069_1247_specialized_for_Adam_gradients_bidirectional_StatefulPartitionedCall_1_grad_StatefulPartitionedCallatinference_keras_scratch_graph_3646' and '__inference___backward_standard_lstm_2111_2613' both implement 'lstm_b69b5a0f-2942-4e5d-892b-2bf1f341b279' but their signatures do not match. " Time per step is now 5s (improvement from 90s, but still slower than 200ms). I can check the docker image in my free time.

sunway513 commented 5 years ago

@witeko , do you still have questions on this issue?

witeko commented 5 years ago

@sunway513 I just created this issue so that You know that sth works incorrectly. :)

sunway513 commented 5 years ago

Thanks @witeko , may I know the source of the expected 200ms/step speed? Have you tried with the docker image? It looks like the workload can run on GPU right now instead of CPU in the initial situation.

witeko commented 5 years ago

@sunway513 the source is: https://www.tensorflow.org/beta/tutorials/text/text_classification_rnn Ive done many of the tutorials and always the speed between my computer and in the examples was consistent (its not a scientific proof but a strong argument).

toaomalkster commented 4 years ago

I'm having a hard job running that tutorial in Google Callaboratory too (https://www.tensorflow.org/tutorials/text/text_classification_rnn). I'm getting about 2s/step, instead of the 141ms/step in the saved output within notebook of the tutorial. Like @witeko, I generally get comparable performance to the saved output in the tutorial, so this stands out as doing something odd.

Epoch 1/10
391/391 [==============================] - 952s 2s/step - loss: 0.6384 - accuracy: 0.6134 - val_loss: 0.5606 - val_accuracy: 0.7036

What what I can tell, I'm running on TPU too.

toaomalkster commented 4 years ago

I've partially figured this out: in my case, my Colab notebook had somehow defaulted to running with a TPU. But there's something about the model or dataset that means it cannot run on TPU, so it falls back to CPU. When I switch the notebook to GPU, I get the documented speeds.

This is even easier to see when training the with/without CudNN example here: https://www.tensorflow.org/guide/keras/rnn If the notebook is setup to use TPU, there's no speed difference.