NVIDIA / framework-reproducibility

Providing reproducibility in deep learning frameworks
Apache License 2.0
423 stars 40 forks source link

Unidentified model not training reproducibly #3

Closed invoxiaehu closed 4 years ago

invoxiaehu commented 4 years ago

or is it?

duncanriach commented 4 years ago

TensorFlow 1.5 is not supported by the patch. If you try to apply the patch to TensorFlow 1.5 then it will report an error.

Did you mean TensorFlow 1.15 ?

Assuming that you're referring to TF 1.15, I tried applying the patch to the GPU version of TF 1.15 in Google Colab and got the following message:

TensorFlow version 1.15.0 has been patched using tfdeterminism version 0.3.0

So, it's working to that extent, and the applied functionality has been thoroughly tested. Please share a colab notebook with me that demonstrates the problem you're seeing.

invoxiaehu commented 4 years ago

Yes i meant 1.15.0 sorry. I also had the message confirming the patch. Then I set the seeds as recommended.

os.environ['PYTHONHASHSEED']=str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
tf.set_random_seed(seed_value)

and train with tf.keras from Sequence. My first two training weren't reproducible. But if ou say it works for you, I'll check again. And if it's confirmed i'll make a small demonstration notebook. Thank you for your code. Very useful.

invoxiaehu commented 4 years ago

Hi, Unfortunately I couldn't make it work. I checked with the same CPU code and it's reproducible, so it's not a issue of seed. Attached the metrics for 20 iterations on both CPU and GPU. We can see that GPU is reproducible for a few steps then diverges.

Code is not easy to share right now cause uses generators and augmentations, but i'll try to make a basic version. image

duncanriach commented 4 years ago

Excellent. My awareness that there may be some holes in automated integration testing has increased; I'm addressing that. On the other hand, what you're seeing may be due to configuration (e.g. using XLA JIT) or model composition (e.g. using an op that has not yet been made deterministic).

Thanks for raising this issue, and I'm looking forward to seeing a simplified colab that demonstrates the non-determinism.

duncanriach commented 4 years ago

I've run more integration tests and not found an issue. In lieu of you providing a simplified test case, could you confirm that you're not using any ops based on these kernels and also that you're not doing bilinear filtering?

duncanriach commented 4 years ago

@invoxiaehu can you provide me with a simple-as-possible, self-contained example that demonstrates the non-determinism, so that it can be reproduced and debugged?

duncanriach commented 4 years ago

@invoxiaehu, please will you also check if you are using, either directly or indirectly, tf.nn.softmax_cross_entropy_with_logits or tf.nn.sparse_softmax_cross_entropy_with_logits? It's looking almost certain that these are injecting non-determinism (as tested in TF 2.0).

duncanriach commented 4 years ago

I'm closing this issue. Please feel free to re-open when you can provide code to reproduce the issue so that it can be debugged.

Sources of non-determinism can include incorrect configuration, use of an op we already know is a source of non-determinism, or the discovery of a new source of non-determinism.