Flag for deterministic GPU operations.

lamblin commented 9 years ago

Some GPU operation use AtomicAdd for efficiency when accumulating numbers in a buffer, for summation operations for instance. The problem is that limited-precision floating-point arithmetic is not exactly associative, and AtomicAdd does not guarantee the order in which the operations will be done. This means that the results are not exactly reproducible numerically. Ops concerned by that would be:

[ ] AdvancedIncSubtensor1_dev20: uses AtomicAdd, no way to disable it, but AdvancedIncSubtensor1 (slow) is deterministic
[ ] CuDNN convolution gradients: some algorithms (the only ones present in old versions) use AtomicAdd for the gradients. deterministic, fft, and fft_tiling (if available), at least, should be deterministic.

We should have a global theano flag to select between the two behaviours, and maybe a finer-grained control at the Op level itself if feasible.

Allow faster, but non-deterministic version:

[ ] Gemm, Gemv, maybe other BLAS Ops: they do not use AtomicAdd for the moment, but could be more efficient if they did

TEMP WORK AROUND:

To have deterministic cudnn convolution, use those Theano flag: dnn.conv.algo_bwd_filter="deterministic" or "fft" and dnn.algo_bwd_data='deterministic' or 'fft or 'fft_tiling'.
To our information there is no deterministic 3d conv grad vs inputs and 3d conv grad vs filter.
To have deterministic AdvancedIncSubtensor1 on the GPU, apply those 2 diff:

diff --git a/theano/sandbox/cuda/opt.py b/theano/sandbox/cuda/opt.py
index 7a9b953..f9d60de 100644
--- a/theano/sandbox/cuda/opt.py
+++ b/theano/sandbox/cuda/opt.py
@@ -1121,8 +1121,8 @@ def local_gpu_advanced_incsubtensor1(node):
             compute_capability = device_properties(active_device_no)['major']
             if (compute_capability < 2 or
                 x.ndim != 2 or
-                y.ndim != 2):
-
+                y.ndim != 2 or
+                config.deterministic):
                 gpu_op = GpuAdvancedIncSubtensor1(
                     set_instead_of_inc=set_instead_of_inc)
             else:
@@ -1164,7 +1164,8 @@ def local_gpu_advanced_incsubtensor1(node):
             compute_capability = device_properties(active_device_no)['major']
             if (compute_capability < 2 or
                 x.ndim != 2 or
-                y.ndim != 2):
+                y.ndim != 2 or
+                config.deterministic):
                 gpu_op = GpuAdvancedIncSubtensor1(
                     set_instead_of_inc=set_instead_of_inc)
             else:

diff --git a/theano/sandbox/gpuarray/opt.py b/theano/sandbox/gpuarray/opt.py
index 92a115c..7332607 100644
--- a/theano/sandbox/gpuarray/opt.py
+++ b/theano/sandbox/gpuarray/opt.py
@@ -639,7 +639,8 @@ def local_gpua_advanced_incsubtensor(node, context_name):

     compute_capability = device_properties(active_device_no)['major']

-    if (compute_capability < 2 or x.ndim != 2 or y.ndim != 2):
+    if (compute_capability < 2 or x.ndim != 2 or y.ndim != 2 or
+            theano.config.deterministic):
         return GpuAdvancedIncSubtensor1(
             set_instead_of_inc=set_instead_of_inc)
     else:

nouiz commented 9 years ago

cuBLAS have a way to get some of its implementation faster by AtomicAdd, we only need to provide that interface, we don't need to implement the GPU code ourself.

pdoongarwal commented 8 years ago

@nouiz I would like to would like to work on it? How should i gain background for it?

nouiz commented 8 years ago

The first step would be to make a deterministic Theano flag in theano/configdefault.py. Then use it in the opt local_gpu_advanced_incsubtensor1 and local_gpua_advanced_incsubtensor to select which implementation (point 2 above) to use.

After that, we need a way to change point 3. For this, in the cudnn op, I would check the that algo must be deterministic if that flag is true.

point 1 is a possible future case to handle if we want the extra speed from it. So nothing to do about it now.

gyom commented 8 years ago

Created a pull request to resolve this issue.

https://github.com/Theano/Theano/pull/4209

ozancaglayan commented 8 years ago

Just for information,

I tested a GRU based NMT code with and without that AdvancedIncSubtensor1_dev20 optimization. Here is the mean minibatch update time for 520 updates on a Tesla K40, CUDA 6.5, without CUDNN:

In [19]: opt.mean()
Out[19]: 0.8737307692307692

In [20]: noopt.mean()
Out[20]: 0.88592307692307692

So I decided to go without the optimization since it doesn't seem to hurt the performance in my setup.

Thanks!

nouiz commented 8 years ago

thanks for the information.

On Wed, Mar 23, 2016 at 12:17 PM, Ozan Çağlayan notifications@github.com wrote:

Just for information,

I tested a GRU based NMT code with and without that AdvancedIncSubtensor1_dev20 optimization. Here is the mean minibatch update time for 520 updates on a Tesla K40, CUDA 6.5, without CUDNN:

In [19]: opt.mean() Out[19]: 0.8737307692307692

In [20]: noopt.mean() Out[20]: 0.88592307692307692

So I decided to go without the optimization since it doesn't seem to hurt the performance in my setup.

Thanks!

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/Theano/Theano/issues/3029#issuecomment-200419289

ozancaglayan commented 8 years ago

I have another comment here.

I run my NMT system 4 times with the default Theano having the non-deterministic behaviour and 2 systems by patching the sandbox/cuda/opt.py to always use the AdvancedIncSubtensor1 op.

BLEU (translation evaluation metric, higher is better) scores for 4 systems with non-deterministic behaviour:

The deterministic systems gave a BLEU of 36.88. This is a pretty significant margin in terms of BLEU score. A BLEU gain of +1 generally deserves a new paper in these days. Since we have lots of data around, I don't think that people trains 5-10 systems to average the scores or to do significance tests. So I'm kindly offering to disable this GPU opt by default to also be in sync with the CPU behaviour. What do you think?

nouiz commented 8 years ago

Good point. I'm not sure what to do. In some cases, it gave pretty good speed up and the non-deterministic behavior wasn't a problem.

I think the good solution is to switch to a default deterministic behavior for your reason, but make it much faster then the current one. So that the speed difference isn't as big.

We are working on making a simple Theano flag to easily switch between those version to make it easy.

I would like to know other people thinking on which default we should use. I don't know when we will have the time to make a new version. Maybe we need to change it now until there is a better version.

On Thu, Mar 24, 2016 at 5:25 AM, Ozan Çağlayan notifications@github.com wrote:

I have another comment here.

I run my NMT system 4 times with the default Theano having the non-deterministic behaviour and 2 systems by patching the sandbox/cuda/opt.py to always use the AdvancedIncSubtensor1 op.

BLEU (translation evaluation metric, higher is better) scores for 4 systems with non-deterministic behaviour:

35.86 36.13 36.29 35.35

The deterministic systems gave a BLEU of 36.88. This is a pretty significant margin in terms of BLEU score. A BLEU gain of +1 generally deserves a new paper in these days. Since we have lots of data around, I don't think that people trains 5-10 systems to average the scores or to do significance tests. So I'm kindly offering to disable this GPU opt by default to also be in sync with the CPU behaviour. What do you think?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/Theano/Theano/issues/3029#issuecomment-200751700

ozancaglayan commented 8 years ago

Do we have an empirical evidence about the slowness of the non-optimized op? As I mentioned above, I see no difference at all in terms of minibatch update times. I can also compare the numbers on a GTX980 + CUDA 7.5 + CuDNNv4 setup if you'd like to.

ozancaglayan commented 8 years ago

Hi nouiz,

Do we have any updates regarding this?

nouiz commented 8 years ago

It was started here:

https://github.com/Theano/Theano/pull/4212

But with NIPS deadline, I needed to put it aside. But you can use that PR with the flag to just get rid of the non-deterministic. It is just that we don't have all the options we need.

nouiz commented 7 years ago

I don't think we have time for the 0.9 release, so I bump that to 0.10 release.

wulabs commented 7 years ago

Should it be deterministic across the same architecture/environment but different machines? Currently with these patches I am able to get same results for successive runs on either GPU or CPU on a single machine but not across different machines even with similar architectures/environments.

nouiz commented 7 years ago

The word "similar" is probably your problem.

Small difference of libraries like OpenBLAS vs mkl can change the numerical results due to how float are reprensented in hardware

https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

On Wed, Dec 7, 2016 at 9:42 PM, wulabs notifications@github.com wrote:

Should it be deterministic across the same architecture/environment but different machines? Currently with these patches I am able to get same results for successive runs on either GPU or CPU on a single machine but not across different machines even with similar architectures/environments.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Theano/Theano/issues/3029#issuecomment-265567490, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-9W0k86CQ1R1i-h9eMUbWaa6PYAeks5rFxpTgaJpZM4FBhkq .

nouiz commented 7 years ago

Did you set numpy random number generator seed? Probably.

Is this in Python 2 or 3? Python 3 have more non-deterministic strucuture by default.

On Wed, Dec 7, 2016 at 4:35 PM, wulabs notifications@github.com wrote:

Just to be clear, we tested a number of OSX machines with core i7's on OSX 10.12.1 with nVidia GPUs (tested both CPU and GPU) with the same XCode, Cuda, CuDNN, Python, Keras, NumPy, Theano versions. Maybe there is a lower level library or two that is not the same.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Theano/Theano/issues/3029#issuecomment-265580734, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-_ys9QSMzkl5v56FUMoXJWF4kxQYks5rFya9gaJpZM4FBhkq .

wulabs commented 7 years ago

Yes, we are using np seed at the top before any Keras imports on Python 2.

We were able to achieve the same results across GPU on different machines with the latest Xcode8 CLT, Cuda 8.0.55, CUDNN 5105, Theano TOT, Keras TOT, and the theano dnn.conv params set to deterministic. For NumPy and all other Python frameworks we use the default from pip. The machines are the same architecture (Core i7, nvidia 650/750 GT) on OSX 10.12.2.

On CPU we are able to achieve same results across CPU on different machines with theano optimizer=None. It seems the Keras MaxPool layer is what caused the divergence.

nouiz commented 7 years ago

optimizer=None is a very bad practice. You should never use it. Try with optimizer=fast_compile.

nouiz commented 7 years ago

Update, now there is the flag deterministic=more that will do what the diff in this issue does.

Theano / Theano

Flag for deterministic GPU operations. #3029