Determinism across GPU architectures?

isarandi commented 3 years ago

I manage to get deterministic results within the same GPU architecture, but not across architectures. Is this expected or is there something I can do to get the same results on all cards?

To be explicit, I get this:

1080ti (Pascal): result 1
Titan X (Pascal): result 1
Titan X (Maxwell): result 1
2080Ti (Turing): result 2
Titan RTX (Turing): result 2

This is encouraging, as I get reproducible results across different computers, but it seems like the results are still GPU architecture-specific.

I'm using Ubuntu 18.04, Nvidia driver 440.100, CUDA 10.0, CuDNN 7.6.1 and TensorFlow 1.15.

duncanriach commented 3 years ago

Thanks for checking-in, @isarandi. Yes, bit-exact reproducibility is not expected across GPU architectures. This is due to various factors, including different partitioning of the (massively parallelized) arithmetic workloads and potentially slightly different implementations of arithmetic operations (including the fusing of operations), all of which is done with the intention, and the effect, of increasing performance.

Note that you will also probably get a different result if you change the batch size, the number of GPUs in a multi-GPU configuration, or the multi-GPU library used (e.g. Horovod vs tf.distribute.MirroredStrategy). Further, although less likely, you may get a different result if you change the version of the NVIDIA driver, CUDA, cuDNN, the DL framework, or the multi-GPU library. If you run any significant operations on the CPU, then there is also the possibility of getting a different result if you move to a new CPU architecture.

I'm going to close this issue now, but please feel free to continue the discussion.

isarandi commented 3 years ago

I understand. It's already a very useful step to have bit-exact reproducible results on the same setup, as there is a real chaos effect here in my experience: even just GPU-nondeterminism can lead to surprisingly different test accuracy metrics after hundreds of thousands of updates, even when all the input sequence is kept bitwise the same in two runs.

While it would be nice to have bitwise reproducible results regardless of GPU/CPU setup when releasing a model for the public, as you say it's never been the case even on CPU-based simulations, and the scientific significance of this kind of narrow reproducibility is also not that huge.

I guess, based on your talk, the intended use case here is more about auditing and debugging, root cause analysis, where you can have control over the environment and can go back to the same version of every piece of software when you do your investigation.

cbhushan commented 1 year ago

@duncanriach - You mentioned above that reproducibility is not expected across GPU architectures (even with same software/library/driver version).

Thanks for checking-in, @isarandi. Yes, bit-exact reproducibility is not expected across GPU architectures.

Has there been any change in behavior since this comment? Or more recent version of Tensorflow? I am asking because someone else mentioned that "...the models were reproducible across different gpus" with TF 2.8.: https://github.com/NVIDIA/framework-determinism/issues/38#issuecomment-1018249657

duncanriach commented 1 year ago

@cbhushan, closing the loop here based on the most recent interaction on #38. There has not been a change regarding between-stack-version-reproducibility. As noted before, if anything changes in the hardware-software stack then bit-exact reproducibility cannot be guaranteed. Changing hardware architecture is a significant change to the hardware-software stack.

cbhushan commented 1 year ago

Thanks!

NVIDIA / framework-reproducibility

Determinism across GPU architectures? #28