NVIDIA / framework-reproducibility

Providing reproducibility in deep learning frameworks
Apache License 2.0
423 stars 40 forks source link

Running on stock TensorFlow version >= 2.1 #22

Closed emiliocoutinho closed 4 years ago

emiliocoutinho commented 4 years ago

I am trying to run the framework on a nightly version of TensorFlow (2.2.0-dev20200428)

"Exception: tfdeterminism: No patch available for version 2.2.0-dev20200428 of TensorFlow"

Is there a workaround to run on this TensorFlow version?

duncanriach commented 4 years ago

There's currently no patch available for TensorFlow since version 2.1. All previously patched functionality was included in stock TensorFlow from version 2.1 onwards. The work-around for your issue is to not apply tfdeterminism.patch. You should be able to achieve the determinism you require without patching, by setting TF_DETERMINISTIC_OPS=1 and ensuring you have the other elements of the recipe in place.

The next version (0.4.0) of the tfdeterminism package (to be called fwd9m) will include a function called enable_determinism that will apply a best effort to enable determinism for whichever version of TensorFlow you happen to be using (including future versions), so you will be able to "set it and forget it." For stock TensorFlow version 2.2, enable_determinism would (currently) set TF_DETERMINISTIC_OPS=1 and optionally also set seeds.

duncanriach commented 4 years ago

Closing this issue. Feel free to continue this discussion with me.

emiliocoutinho commented 4 years ago

@duncanriach , thanks for the feedback. Following your instructions, on the begging of my code I am now using the following lines:

SEED = 123
os.environ['PYTHONHASHSEED']=str(SEED)
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)
os.environ['TF_DETERMINISTIC_OPS'] = '1'

I am running this code on a CPU and on a GPU environment.

When I run the same code twice on a CPU, I was able to reproduce the results. That’s nice!!!

But when I run on my GPU environment, I was not able to reproduce the results in two consecutive runs of the same code. Here, the two runs were done in the GPU environment.

I was trying to track any source of randomness in my code, and I found 2 points:

I read on your publications that there are some other sources of randomness on GPU. Do you have any tips that I should try to get reproducible results on GPU?

On the GPU environment, I have:

On the CPU environment, I have:

Best Regards.

emiliocoutinho commented 4 years ago

Dear @duncanriach

After reading the README.md from this project, I realize that you point out the tf.image.resize_bilinear as a source of non-determinism. I am using a tf.keras.layers.UpSampling2D but with interpolation='nearest'.

Do you believe that PR 36243 that talks about interpolation='bilinear' can possibly solve a non-determinism problem on the interpolation='nearest'?

I update my TF version to 2.4.0.dev20200719 and the non-deterministic behavior is still happening on GPU.

Thank you very much.

duncanriach commented 4 years ago

Yes, I'm almost certain that nearest neighbor resampling will be non-deterministic because (1) the algorithm is probably almost exactly the same as the one for bilinear and (2) I listed resize_nearest_neighbor_op_gpu.cu.cc is listed in the section titled Other Possible GPU-Specific Sources of Non-Determinism. So, yes, this an op that needs to have deterministic functionality added. I'll address this more in issue 24.