NVIDIA / framework-reproducibility

Providing reproducibility in deep learning frameworks
Apache License 2.0
423 stars 40 forks source link

Why tensorflow-determinism and not simply determinism? #4

Closed kirk86 closed 4 years ago

kirk86 commented 4 years ago

I'm wondering why is this only specific to tensorflow if it's related to nvidia atomic operations shouldn't it work for any library that leverages nvidia cuda as underlying mechanism?

duncanriach commented 4 years ago

Great question. Thank you.

First of all, there are many possible sources of true randomness when using GPUs (and non-GPUs), even though the most common source of true randomness on CUDA-running GPUs comes from a particular way of using CUDA atomics; which is to convert the truly asynchronous operation of one thread block relative to another into truly random floating-point rounding errors.

Secondly, this repo and this project already do, to some degree, address non-determinism in other frameworks and in relation to compute devices other than CUDA-running GPUs.

I think your question might be why the repo is called tensorflow-determinism and why the code in it is targeted at TensorFlow. The reason is that, for now, this project is focusing on addressing non-determinism on GPUs in TensorFlow. The patch that is provided addresses a TensorFlow op that is implemented in the TensorFlow code-base, using CUDA, in such a way that it currently introduces true randomness. So the patch has to be TensorFlow-specific because it changes TensorFlow code. Version 2.1 of TensorFlow will not require this patch because the changes will be implemented in the open source code. When the underlying CUDA kernel code for that op (in the stock TensorFlow Github repo) has been enhanced to support determinism, then that too will, of course, be TensorFlow-specific.

The non-determinism debug tool that will released in this repo is also TensorFlow-specific since it uses TensorFlow ops and is designed to be used with TensorFlow. Future versions might be released that work with other frameworks. However, since each framework has a different API and code-base, it must be addressed independently.

Other frameworks have varying support for non-determinism, and we've been working to support them in achieving that. This repo and these tools may be enhanced to support other frameworks in the future, but that's not the focus right now. In some instances, those people working on other frameworks have, and can, benefit from the currently available information and examples presented in, and for, TensorFlow, including in this repo and its associated video and slides.

kirk86 commented 4 years ago

Great answer. Thanks!

duncanriach commented 4 years ago

This question has been answered and there is nothing more to do here. Closing.

duncanriach commented 4 years ago

Update: The repo name has now been changed to framework-determinism. In the next release, version 0.4.0, the distribution name will also be changed to framework-determinism and the package name will be changed to fwd9m. TensorFlow-related functionality will then be accessed via fwd9m.tensorflow.