NVIDIA / framework-reproducibility

Providing reproducibility in deep learning frameworks
Apache License 2.0
423 stars 40 forks source link

Tensorflow-determinism not working with conda tensorflow-gpu #5

Closed Eugen2525 closed 4 years ago

Eugen2525 commented 4 years ago

I have tested tensorflow-determinism on my windows computer with tensorflow-gpu 1.14. The tensorflow-determinism is installed in Anaconda prompt using pip.

However, the output of the program is different on each run.

I have tested everything on another computer with the same/identical code but the only difference is that tensorflow-determinism is installed via pip on windows cmd.

For some reasons unknown to me, I cannot install tensorflow via pip on windows cmd on the former machine and I think this is the reason why it is not working.

If possible, I hope you can address this issue. Thanks

duncanriach commented 4 years ago

Hi,

I'm not sure that I fully understand what you're doing. Let me try to repeat it back to you and then you can tell me if this is correct or not.

You have a Python program (e.g. myprog.py), which uses TensorFlow, and which contains the following:

import tensorflow as tf
from tfdeterminism import patch
patch()
# Run on one GPU, set all the seeds, if using tf.data then only use one worker.
# Now do some training and print a summary of trainable variables at the end of training

You have two PCs, both running windows. Let's call those two machines A and B.

On machine A, you do the following in Anaconda Prompt:

pip install tensorflow-gpu==1.14.0
pip install tensorflow-determinism
python myprog.py
python myprog.py

And you end up with different trainable variable summaries at the end of the two runs.

On machine B, you do exactly the same thing, but using Windows CMD rather than Anaconda Prompt. In that case you end up with the same trainable variable summaries at the end of the the two runs?

Eugen2525 commented 4 years ago

Thanks for the prompt feedback.

I will continue from the machine A and B point.

On machine A, I do the following in Anaconda prompt:

conda install tensorflow-gpu==1.14.0
pip install tensorflow-determinism
python myprog.py

The outcome is not deterministic (varies each time). I could not install TensorFlow through cmd in machine A due to Cuda installation failure (I do not know why but I spent almost 2 days to solve it). Hence I installed TensorFlow with conda but tensorflow-determinism through pip but inside Anaconda prompt.

On machine B, I do the following in windows CMD:

pip install tensorflow-gpu==1.14.0
pip install tensorflow-determinism
python myprog.py

The result is deterministic.

Machines have different CPU, RAM or GPU configurations (I did not check each, but I guess this should be irrelevant).

Hope it is more clear now.

AndreaPi commented 4 years ago

Hi @duncanriach, sorry for hijacking the thread, but I saw this comment from you that worried me:

# Run on one GPU, set all the seeds, if using tf.data then only use one worker.

You mention running on one GPU, so do you mean that tensorflow-determinism would not give deterministic behavior with more than one GPU? In my case I use Keras (with a TF backend) in a NGC TensorFlow container under Linux, so I don't actually use the patch() call, but I set the environment variable as described here. I perform multi-GPU training with Keras. Does this mean that I cannot expect deterministic behavior, even after setting the environment variables and all the seeds? Thanks!

duncanriach commented 4 years ago

@AndreaPi, the comment you quoted was specifically intended for debugging the OP's issue. I wanted to make sure that, for simplicity, we're dealing with his issue in a single-GPU environment.

Regarding deterministic multi-GPU training: once you have deterministic single-GPU training, multi-GPU training should be deterministic. I've only played with Horovod (and recommend it) for organizing the multi-GPU partial gradient reductions, and, as described here, you will need to disable Tensor Fusion (unless that has issue now been fixed) if training with more than two GPUs.

duncanriach commented 4 years ago

@Eugen2525, on machine A, please will you share the output from both the conda install tensorlflow-gpu==1.14.0 and the pip install tensorflow-determinism. Please also find the path to the python that is getting run. I don't know how to do that in Anaconda Prompt in Windows. It would be the equivalent of result of entering which python in Linux.

I suspect that the TensorFlow and tensorflow-determinism packages are getting installed with different instances of python, so they can't both be imported at the same time. If this is true, then I'm surprised that your program runs at all on machine A.

Also, at the Anaconda Prompt, enter the interactive python interpreter (by just running python, rather than running python myprog.py) and then manually enter the following, and show me what the responses are:

import tensorflow as tf
from tfdeterminism import patch
patch()
duncanriach commented 4 years ago

This issue seems to be related to mixing conda with pip installations. I don't think this is related to the tensorflow-determinism package in particular. Closing, but happy to reopen if requested.