NVIDIA / framework-reproducibility

Providing reproducibility in deep learning frameworks
Apache License 2.0
423 stars 40 forks source link

Lack of reproducibility when using Huggingface transformers library (TensorFlow version) #14

Open dmitriydligach opened 4 years ago

dmitriydligach commented 4 years ago

Dear developers,

I included in my code all the steps listed in this repository but still could not achieve reproducibility either using TF 2.1 or TF 2.0. Here's the link to my code:

https://github.com/dmitriydligach/Thyme/blob/master/Keras/et.py

Please help.

MFreidank commented 4 years ago

@dmitriydligach Did you ever get this resolved?

dmitriydligach commented 4 years ago

@MFreidank Nope. I switched to PyTorch, which has a more reliable way to enforce determinism.

MFreidank commented 4 years ago

@dmitriydligach Just to verify: your code becomes fully reproducible with pytorch?

duncanriach commented 4 years ago

PyTorch has potentially different non-deterministic ops than TensorFlow, and no general mechanism, yet, to enable deterministic op functionality. Both PyTorch and TensorFlow now have the ability to enable deterministic cuDNN functionality.

This code may use an op that happens to be non-deterministic in TensorFlow but deterministic in PyTorch.

I'm hoping to look at this code in detail soon, hopefully today.

dmitriydligach commented 4 years ago

@MFreidank In most cases, I get the exact same results every time I run my PyTorch code (including loss and accuracy for each epoch). In some (relatively infrequent) cases, there's still a difference, but it's not nearly as large as in the case of tensorflow.

MFreidank commented 4 years ago

PyTorch has potentially different non-deterministic ops than TensorFlow, and no general mechanism, yet, to enable deterministic op functionality. Both PyTorch and TensorFlow now have the ability to enable deterministic cuDNN functionality.

This code may use an op that happens to be non-deterministic in TensorFlow but deterministic in PyTorch.

I'm hoping to look at this code in detail soon, hopefully today.

@duncanriach Thanks for your blazingly fast response! :) I would still have an interest in resolving this issue in TF 2.2 and would highly appreciate it if you could help investigate.

A helpful starting point could be my colab example.

@dmitriydligach Thanks for those additional details, that sounds like there is still a slight non-determinism in pytorch as well, but it might not affect loss/accuracy as strongly. This is valuable information for me, thank you for sharing your experience :)

duncanriach commented 4 years ago

@dmitriydligach: I'm sorry that I didn't get to sorting this out for you in time to benefit from determinism in TensorFlow.

@MFreidank: I'll prioritize taking a look at these issues. They could have the same underlying cause, or source, or there could be different sources. Often in these kinds of problems there is an issue with setup that is easy to resolve. I intend to add better step-by-step instructions to the README for that. Sometimes a known (and not-yet-fixed) non-deterministic op is being used, and sometimes there is a new discovery, an op that is non-deterministic that we didn't know about about. We'll figure this out.

MFreidank commented 4 years ago

@duncanriach Thanks a lot for taking the time to look into this and for your encouragement. I feel much more confident about this now, knowing that someone with your experience will be having a look.

duncanriach commented 4 years ago

Hey @dmitriydligach, it looks like we have reproducibility in on issue 19 (Huggingface Transformers BERT for TensorFlow). @MFreidank is confirming. Looking at your code, I don't see any reason for there to be non-determinism. I want to repro what you're seeing so that I can debug it. I have it running, but it looks like I have to specify DATA_ROOT and provide data there. Can you give me instructions to repro with the data you're using?

MFreidank commented 4 years ago

@duncanriach Non-reproducibility of the code of @dmitriydligach may be related to him training for multiple epochs, see my update on issue #19.

dmitriydligach commented 4 years ago

@duncanriach Thank you very much for looking into this issue.

Unfortunately, I'm not able to provide the data (this is medical data that can only be distributed via a data use agreement). However, perhaps it would help you to know that the data consists of relatively short text fragments (max_len ~ 150 word pieces)...