Open parallelo opened 5 years ago
Doesn't appear to be a TF regresssion -- hit same loss=NaN results with tensorflow-rocm
whl pkgs for 1.12, 1.11, and 1.10.
Any update here?
Nothing recent to report. Have been tracking down multiple other issues.
This was self-reported, so the priority has been lower than typical.
@parallelo What is the current state of this ticket?
System information
Describe the current behavior
Using
tensorflow-rocm
, note theLoss: nan
results:Describe the expected behavior
Using
tensorflow
ortensorflow-gpu
, no NaNs. For example:After some further experiments on ROCm... when we serialize all kernels & copies, that appears to be a successful workaround that matches other platforms.
This suggests some sort of synchronization issue inside
tensorflow-rocm
-- further triage work required.Code to reproduce the issue Start the public ROCm TF docker image:
Run the example: