ROCm / tensorflow-upstream

TensorFlow ROCm port
https://tensorflow.org
Apache License 2.0
686 stars 93 forks source link

Tensorflow Probability bayesian_neural_network hits NaNs #303

Open parallelo opened 5 years ago

parallelo commented 5 years ago

System information

Describe the current behavior

Using tensorflow-rocm, note the Loss: nan results:

Step:   0 Loss: 27.881 Accuracy: 0.109
Step: 100 Loss: nan Accuracy: 0.465
Step: 200 Loss: nan Accuracy: 0.286
Step: 300 Loss: nan Accuracy: 0.225
 ... Held-out nats: -2.302
Step: 400 Loss: nan Accuracy: 0.193
Step: 500 Loss: nan Accuracy: 0.175
Step: 600 Loss: nan Accuracy: 0.164
Step: 700 Loss: nan Accuracy: 0.156
 ... Held-out nats: -2.302
Step: 800 Loss: nan Accuracy: 0.149
Step: 900 Loss: nan Accuracy: 0.144
Step: 1000 Loss: nan Accuracy: 0.141

Describe the expected behavior

Using tensorflow or tensorflow-gpu, no NaNs. For example:

Step:   0 Loss: 27.614 Accuracy: 0.078
Step: 100 Loss: 24.904 Accuracy: 0.514
Step: 200 Loss: 24.024 Accuracy: 0.699
Step: 300 Loss: 23.373 Accuracy: 0.776
 ... Held-out nats: -0.100
Step: 400 Loss: 22.675 Accuracy: 0.818
Step: 500 Loss: 22.166 Accuracy: 0.845
Step: 600 Loss: 21.402 Accuracy: 0.865
Step: 700 Loss: 20.810 Accuracy: 0.879
 ... Held-out nats: -0.066
Step: 800 Loss: 20.167 Accuracy: 0.890
Step: 900 Loss: 19.581 Accuracy: 0.899
Step: 1000 Loss: 18.984 Accuracy: 0.906

After some further experiments on ROCm... when we serialize all kernels & copies, that appears to be a successful workaround that matches other platforms.

export HCC_SERIALIZE_COPY=0x3
export HCC_SERIALIZE_KERNEL=0x3

This suggests some sort of synchronization issue inside tensorflow-rocm -- further triage work required.

Code to reproduce the issue Start the public ROCm TF docker image:

docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/tensorflow:rocm2.0-tf1.12-python3-dev

Run the example:

pip3 install --user tensorflow-probability matplotlib
git clone https://github.com/tensorflow/probability.git
cd probability/tensorflow_probability/examples
python3 ./bayesian_neural_network.py
parallelo commented 5 years ago

Doesn't appear to be a TF regresssion -- hit same loss=NaN results with tensorflow-rocm whl pkgs for 1.12, 1.11, and 1.10.

dagamayank commented 5 years ago

Any update here?

parallelo commented 5 years ago

Nothing recent to report. Have been tracking down multiple other issues.

This was self-reported, so the priority has been lower than typical.

aserio commented 4 years ago

@parallelo What is the current state of this ticket?