inplace dnn convolution seems dangerously broken

f0k commented 9 years ago

I've recently noticed that all my experiments run slower (which I attributed to other processes running on our server) and that I couldn't reproduce results (which I attributed to said experiments being in an early stage with possibly badly chosen hyperparameters, such that earlier runs might just have been lucky). After a while, as I couldn't get any network to give reasonable predictions and ruled out anything else I've changed in between, I've looked into my git reflog for Theano and undid the last update. Io and behold, everything was running three times faster and networks learned as they ought to.

Bisecting, I found that the offending commit is a0964ac, the merge commit for #2559, which introduced inplace operations for cuDNN convolutions along with the necessary optimizers. Profiling that commit and the one immediately before, I get:

at commit b1963ef (before #2559):

<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  12.3%    12.3%       1.670s       4.09e-03s    408   199   GpuDnnConvGradW(GpuContiguous.0, GpuContiguous.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='cross'}.0, ScalarFromTensor.0, ScalarFromTensor.0)
   9.2%    21.4%       1.245s       3.05e-03s    408    61   GpuDnnConv{workmem='small'}(CopyOnNegativeStrides.0, GpuContiguous.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv'}.0)
   6.6%    28.0%       0.897s       2.20e-03s    408   154   GpuDnnPoolGrad(GpuContiguous.0, GpuContiguous.0, GpuContiguous.0, GpuDnnPoolDesc{ws=(3, 6), stride=(3, 6), mode='max', pad=(0, 0)}.0)
   6.4%    34.4%       0.869s       2.13e-03s    408   147   GpuDnnConvGradI(GpuContiguous.0, GpuContiguous.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='cross'}.0, ScalarFromTensor.0, ScalarFromTensor.0)

at commit a0964ac (after #2559):

<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  26.2%    26.2%      11.450s       2.84e-02s    403    76   HostFromGpu(GpuElemwise{Add}[(0, 1)].0)
  25.0%    51.3%      10.921s       2.71e-02s    403    78   GpuFromHost(HostFromGpu.0)
   3.7%    54.9%       1.596s       3.96e-03s    403   201   GpuDnnConvGradW{inplace=True}(CopyOnNegativeStrides.0, GpuContiguous.0, GpuAlloc{memset_0=True}.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='cross'}.0, Constant{1.0})
   3.6%    58.5%       1.554s       3.86e-03s    403    80   GpuDnnConv{workmem='small', inplace=True}(CopyOnNegativeStrides.0, GpuContiguous.0, GpuElemwise{Composite{(Abs(i0) + i1)}}[(0, 1)].0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv'}.0, Constant{0.5})
   3.0%    61.5%       1.321s       3.28e-03s    403    77   GpuDnnConv{workmem='small', inplace=True}(CopyOnNegativeStrides.0, GpuContiguous.0, GpuElemwise{Add}[(0, 1)].0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv'}.0, Constant{1.0})

That is, it is copying data from the device to the host just to copy it back to the device. Furthermore, the networks trained after that commit tend to output zeros only when computing validation or test data predictions, while training seems to progress normally (judging from the training error). This indicates that something is wrong with the computations; something not caught by the tests.

For lack of time I'll just continue working with b1963ef for now. I fear I cannot easily turn my code into a minimal working example, but I'll happily answer questions that aid in debugging this (enabling/disabling optimizations, printing the graph before/after optimization, etc).

nouiz commented 9 years ago

Thanks for all that info. Which version of cudnn do you use?

On Thu, Mar 12, 2015 at 2:49 PM, Jan Schlüter notifications@github.com wrote:

I've recently noticed that all my experiments run slower (which I attributed to other processes running on our server) and that I couldn't reproduce results (which I attributed to said experiments being in an early stage with possibly badly chosen hyperparameters, such that earlier runs might just have been lucky). After a while, as I couldn't get any network to give reasonable predictions and ruled out anything else I've changed in between, I've looked into my git reflog for Theano and undid the last update. Io and behold, everything was running faster and networks learned as they ought to.

Bisecting, I found that the offending commit is a0964a, the merge commit for #2559 https://github.com/Theano/Theano/pull/2559, which introduced inplace operations for cuDNN convolutions along with the necessary optimizers. Profiling that commit and the one immediately before, I get:

at commit b1963ef (before #2559):

<% time> <sum %> <#call> 12.3% 12.3% 1.670s 4.09e-03s 408 199 GpuDnnConvGradW(GpuContiguous.0, GpuContiguous.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='cross'}.0, ScalarFromTensor.0, ScalarFromTensor.0) 9.2% 21.4% 1.245s 3.05e-03s 408 61 GpuDnnConv{workmem='small'}(CopyOnNegativeStrides.0, GpuContiguous.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv'}.0) 6.6% 28.0% 0.897s 2.20e-03s 408 154 GpuDnnPoolGrad(GpuContiguous.0, GpuContiguous.0, GpuContiguous.0, GpuDnnPoolDesc{ws=(3, 6), stride=(3, 6), mode='max', pad=(0, 0)}.0) 6.4% 34.4% 0.869s 2.13e-03s 408 147 GpuDnnConvGradI(GpuContiguous.0, GpuContiguous.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='cross'}.0, ScalarFromTensor.0, ScalarFromTensor.0)

at commit a0964a (after #2559):

<% time> <sum %> <#call> 26.2% 26.2% 11.450s 2.84e-02s 403 76 HostFromGpu(GpuElemwise{Add}[(0, 1)].0) 25.0% 51.3% 10.921s 2.71e-02s 403 78 GpuFromHost(HostFromGpu.0) 3.7% 54.9% 1.596s 3.96e-03s 403 201 GpuDnnConvGradW{inplace=True}(CopyOnNegativeStrides.0, GpuContiguous.0, GpuAlloc{memset_0=True}.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='cross'}.0, Constant{1.0}) 3.6% 58.5% 1.554s 3.86e-03s 403 80 GpuDnnConv{workmem='small', inplace=True}(CopyOnNegativeStrides.0, GpuContiguous.0, GpuElemwise{Composite{(Abs(i0) + i1)}}[(0, 1)].0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv'}.0, Constant{0.5}) 3.0% 61.5% 1.321s 3.28e-03s 403 77 GpuDnnConv{workmem='small', inplace=True}(CopyOnNegativeStrides.0, GpuContiguous.0, GpuElemwise{Add}[(0, 1)].0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv'}.0, Constant{1.0})

That is, it is copying data from the device to the host just to copy it back to the device. Furthermore, the networks trained after that commit tend to output zeros only when computing validation or test data predictions, while training seems to progress normally (judging from the training error). This indicates that something is wrong with the computations; something not caught by the tests.

For lack of time I'll just continue working with b1963ef https://github.com/Theano/Theano/commit/b1963ef92b231eb5deb6013f650621f8c49ba16b for now. I fear I cannot easily turn my code into a minimal working example, but I'll happily answer questions that aid in debugging this (enabling/disabling optimizations, printing the graph before/after optimization, etc).

— Reply to this email directly or view it on GitHub https://github.com/Theano/Theano/issues/2607.

f0k commented 9 years ago

Which version of cudnn do you use?

Had cuDNN v2 RC2 first, then updated to cuDNN v2 RC3, but that didn't help. (The cuDNN version cannot lead Theano to introduce GpuFromHost(HostFromGpu) anyway, I guess? But that's only part of the problem.)

nouiz commented 9 years ago

Sure, but this could help to know from the wrong value. The extra transfer is much easier to fix then the wrong error.

How do you create your convolution? via the cpu conv2d? dnn_conv? direc use of the op? via the meta conv opt?

On Thu, Mar 12, 2015 at 3:15 PM, Jan Schlüter notifications@github.com wrote:

Which version of cudnn do you use?

Had cuDNN v2 RC2 first, then updated to cuDNN v2 RC3, but that didn't help. (The cuDNN version cannot lead Theano to introduce GpuFromHost(HostFromGpu) anyway, I guess? But that's only part of the problem.)

— Reply to this email directly or view it on GitHub https://github.com/Theano/Theano/issues/2607#issuecomment-78576084.

f0k commented 9 years ago

Via the standard cpu theano.nnet.conv2d(), with fully specified input and kernel shape. Meta-optimization is not turned on, all optimizers are at their defaults. The only Theano flags in use are floatX=float32,device=gpu0,allow_gc=0. The GpuElemwise{Composite{(Abs(i0) + i1)}}[(0, 1)].0 in the graph is Lasagne's supposedly faster formulation of ReLUs (abs(x) + x), the GpuElemwise{Add}[(0, 1)].0 doesn't ring a bell for me.

f0k commented 9 years ago

allow_gc=0 -- could this be the culprit? I've got experiments running so I cannot swap out the Theano version right now. But checking the diff, it seems that beta is always set to 1.0 and the mode is always set to CUDNN_RESULT_ACCUMULATE now? Is it possible that the output isn't properly initialized when inplace=False? And wouldn't that also incur a performance penalty when inplace=False because the output needs to be explicitly zeroed beforehands and then the gemm calls inside cuDNN need to add the output instead of just overwriting it? /edit: That could also explain my observation that with a function compiled just for the forward pass the network predictions are broken, while a function compiled for training seems to work correctly.

f0k commented 9 years ago

And the duplicated transfer could be related to the changes starting here. @abergeron might be able to explain what's going on.

lamblin commented 9 years ago

If you have the time to run a test case in DebugMode, that may help pinpoint which optimization (possibly which op) is to blame.

nouiz commented 9 years ago

Why do you think that?

On Thu, Mar 12, 2015 at 3:56 PM, Jan Schlüter notifications@github.com wrote:

And the duplicated transfer could be related to the changes starting here https://github.com/Theano/Theano/pull/2559/files#diff-bdde79e1910cddcdda2215fb1f1cd7fcL648 .

— Reply to this email directly or view it on GitHub https://github.com/Theano/Theano/issues/2607#issuecomment-78592326.

nouiz commented 9 years ago

Can you dump the theano function that cause problem?

http://deeplearning.net/software/theano/library/compile/function.html#theano.compile.function.function_dump

We merged change that fix crash related to this PR that introduced the problem you see. I do not understand why it would fix the problem you have, but if you can the time to try again after giving us the dump, you could try to update Theano.

On Thu, Mar 12, 2015 at 4:41 PM, Frédéric Bastien < frederic.bastien@gmail.com> wrote:

Why do you think that?

On Thu, Mar 12, 2015 at 3:56 PM, Jan Schlüter notifications@github.com wrote:

And the duplicated transfer could be related to the changes starting here https://github.com/Theano/Theano/pull/2559/files#diff-bdde79e1910cddcdda2215fb1f1cd7fcL648 .

— Reply to this email directly or view it on GitHub https://github.com/Theano/Theano/issues/2607#issuecomment-78592326.

f0k commented 9 years ago

Can you dump the theano function that cause problem?

All right. Training function (computes training cost and performs updates): http://ofai.at/~jan.schlueter/downloads/train_fn.dump.gz (about 13 Mib) Validation function (just computes validation cost): http://ofai.at/~jan.schlueter/downloads/val_fn.dump.gz (about 12 MiB)

Running with allow_gc=1 does not fix any of the two problems (redundant transfer and wrong predictions).

We merged change that fix crash related to this PR that introduced the problem you see.

Updating to the latest master (4b27770) does not fix any of the two problems either.

I'm trying to summon @abergeron again, last time I did it in an edit and I don't know if github sends out a notification then. The truth must be in the diff, and I guess I'm on the right track in https://github.com/Theano/Theano/issues/2607#issuecomment-78591798 and https://github.com/Theano/Theano/issues/2607#issuecomment-78592326, I just don't feel at home in that code enough to track down the problem. I'm also still quite convinced that using CUDNN_RESULT_ACCUMULATE will pose a performance penalty in some cases.

nouiz commented 9 years ago

thanks. I'm looking into it.

On Fri, Mar 13, 2015 at 9:11 AM, Jan Schlüter notifications@github.com wrote:

Can you dump the theano function that cause problem?

All right. Training function (computes training cost and performs updates): http://ofai.at/~jan.schlueter/downloads/train_fn.dump.gz (about 13 Mib) Validation function (just computes validation cost): http://ofai.at/~jan.schlueter/downloads/val_fn.dump.gz (about 12 MiB)

Running with allow_gc=1 does not fix any of the two problems (redundant transfer and wrong predictions).

We merged change that fix crash related to this PR that introduced the problem you see.

Updating to the latest master (4b27770 https://github.com/Theano/Theano/commit/4b27770c19635901327e7ab9f58cf838ed3ebd4a) does not fix any of the two problems either.

I'll try to summon @abergeron https://github.com/abergeron again, last time I did it in an edit and I don't know if github sends out a notification then. The truth must be in the diff, and I guess I'm on the right track in #2607 (comment) https://github.com/Theano/Theano/issues/2607#issuecomment-78591798 and #2607 (comment) https://github.com/Theano/Theano/issues/2607#issuecomment-78592326, I just don't feel at home in that code enough to track down the problem. I'm also still quite convinced that using CUDNN_RESULT_ACCUMULATE will pose a performance penalty in some cases.

— Reply to this email directly or view it on GitHub https://github.com/Theano/Theano/issues/2607#issuecomment-78964296.

abergeron commented 9 years ago

I just want you to know that I've read you comments, but I was working on some other issue until now.

nouiz commented 9 years ago

@f0k, I'm not able to reproduce the extra transfer with this version of Theano:

cc463fb921b3d593aaacd4dbb03d5aea48ff49e2

Do you remember with which version of Thean you had this problem the last time?

When I compile your function, how can I call them to help me reproduce the bad value problem.

On Fri, Mar 13, 2015 at 10:20 AM, abergeron notifications@github.com wrote:

I just want you to know that I've read you comments, but I was working on some other issue until now.

— Reply to this email directly or view it on GitHub https://github.com/Theano/Theano/issues/2607#issuecomment-78991797.

f0k commented 9 years ago

Do you remember with which version of Thean you had this problem the last time?

It occurred from a0964ac up to at least 4b27770 (cc463fb wasn't up yet in my last comment). How is it possible that it doesn't include the duplicate transfer? Isn't the function compiled on my machine and then dumped? If not, then I'll probably have to try investigating closer here.

When I compile your function, how can I call them to help me reproduce the bad value problem.

Hmm, that will be a tad difficult because it includes an untrained network now. You can try calling the second function with two inputs of shape (64, 11500) and (64, 1), the first one with np.random.randn and the second with np.random.rand > .5 or so. It will output the validation cost, which isn't very helpful, I guess (sorry, I didn't know what the dumped functions would be used for). My suspicion is that it reuses variables without zeroing them, so calling it multiple times would give different outputs.

Would it help to do a debugprint of the graph or the compiled function on my machine?

nouiz commented 9 years ago

The fix for the transfer is merged.

We tried to remove some GpuContiguous by being less stricks on the strides. But it seem this isn't always save with cudnn. We had case we didn't expected that raised not supported. But it could be that some case aren't supported, but that cudnn didn't raise any errors. So we want back to the safe case of making sure the input is c contiguous. THis is also merged. But as currently you are the only one with the error, can you update Theano and tests again?

thanks

On Fri, Mar 13, 2015 at 12:34 PM, Jan Schlüter notifications@github.com wrote:

Do you remember with which version of Thean you had this problem the last time?

It occurred from a0964ac https://github.com/Theano/Theano/commit/a0964ac0bc64e755a0bdfe9911f250f6d29c0bff up to at least 4b27770 https://github.com/Theano/Theano/commit/4b27770c19635901327e7ab9f58cf838ed3ebd4a (cc463fb https://github.com/Theano/Theano/commit/cc463fb921b3d593aaacd4dbb03d5aea48ff49e2 wasn't up yet in my last comment). How is it possible that it doesn't include the duplicate transfer? Isn't the function compiled on my machine and then dumped? If not, then I'll probably have to try investigating closer here.

When I compile your function, how can I call them to help me reproduce the bad value problem.

Hmm, that will be a tad difficult because it includes an untrained network now. You can try calling the second function with two inputs of shape (64, 11500) and (64, 1), the first one with np.random.randn and the second with np.random.rand > .5 or so. It will output the validation cost, which isn't very helpful, I guess (sorry, I didn't know what the dumped functions would be used for). My suspicion is that it reuses variables without zeroing them, so calling it multiple times would give different outputs.

Would it help to do a debugprint of the graph or the compiled function on my machine?

— Reply to this email directly or view it on GitHub https://github.com/Theano/Theano/issues/2607#issuecomment-79108298.

f0k commented 9 years ago

The fix for the transfer is merged.

Confirmed, the redundant transfers are gone. Thanks, one problem less!

We tried to remove some GpuContiguous by being less stricks on the strides. But it seem this isn't always save with cudnn.

This was not related to my problem, though. In b1963ef the cp_if_negative_strides was present already and worked correctly for me. It might nevertheless be a good idea to wait for cuDNN to mature a bit more (did you report the case which was not supported although it should have been?).

But as currently you are the only one with the error, can you update Theano and tests again?

The outputs for the validation function are still wrong: the validation error is three orders of magnitude larger than the training error, and it's instantly fine when I switch back to b1963ef. Furthermore, b1963ef is about 30% faster than the latest master. And this is not because of the gpu_contiguous, but because of the convolutions:

latest master:
  23.7%    23.7%      78.702s       2.83e-03s     C     27810        4   GpuDnnConv{workmem='small', inplace=False}
  19.5%    43.2%      64.861s       2.33e-03s     C     27810        4   GpuDnnConv{workmem='small', inplace=True}
   8.2%    51.5%      27.354s       9.84e-04s     C     27810        4   GpuElemwise{Add}[(0, 1)]

my favorite (b1963ef):
  24.6%    24.6%      57.012s       2.03e-03s     C     28136        4   GpuDnnConv{workmem='small'}
  15.4%    40.0%      35.659s       2.76e-03s     C     12902        2   GpuDnnConvGradW
   7.6%    47.7%      17.651s       7.72e-04s     C     22851        3   GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)]

I'd still assume that this is because of the inplace operations, as this is the most directly related change of #2559. @abergeron, @nouiz, can you please explain to me the following: From the documentation, I see that self.destroy_map in the cuDNN convolution ops indicates to Theano that the third input will be destroyed when computing the output, if inplace=True. inplace=True is only set by one of the optimizers, and used to add the result of the convolution to the third input. In all instantiations of the op with inplace=False, the third input is either set to something.zeros_like() or to gpu_alloc(_zero.clone(), shape1, ...), and then the destroy map is not set. What exactly happens in this case? Will the constant tensor of zeros be stored on the GPU and copied whenever the op runs, so cuDNN can add the convolution results to it? Where does that happen? Is it possible that something goes wrong there? And wouldn't it be more performant to let the op behave as before the change, i.e., let Theano allocate a result tensor once and then let cuDNN overwrite it whenever the op runs, if inplace=False?

abergeron commented 9 years ago

There is no constant tensor of zeros since the alloc operations are never constant folded.

Also, inside the op, in the inplace=False case the zeros (or whatever else is the output buffer) will be copied to a new output buffer (that is allocated inside the op) and this is used to run the convolution inplace.

Since the interface change, the old behavior is not always valid because the output buffer can be something else than zeros so we can't just ignore it. This is because there is another optimization that tries to identify addtions that happen after the convolution and combine them using the output buffer.

In the graph you dumped the convolutions are combined with one addition that follows so the timing comparison should reflect that too. However the timings you post don't seem to come from the same graph since it uses GpuDnnConv in one case and GpuDnnConvGradW in the other. The changes I made don't touch that (unless perhaps you are using the meta-optimizer and the timings changed).

nouiz commented 9 years ago

The profile probably do not include all convolution. That could explain the change in the number of convolution.

I'll try to run the valid function in debugmode to catch the wrong value.

On Sat, Mar 14, 2015 at 4:42 PM, abergeron notifications@github.com wrote:

There is no constant tensor of zeros since the alloc operations are never constant folded.

Also, inside the op, in the inplace=False case the zeros (or whatever else is the output buffer) will be copied to a new output buffer (that is allocated inside the op) and this is used to run the convolution inplace.

Since the interface change, the old behavior is not always valid because the output buffer can be something else than zeros so we can't just ignore it. This is because there is another optimization that tries to identify addtions that happen after the convolution and combine them using the output buffer.

In the graph you dumped the convolutions are combined with one addition that follows so the timing comparison should reflect that too. However the timings you post don't seem to come from the same graph since it uses GpuDnnConv in one case and GpuDnnConvGradW in the other. The changes I made don't touch that (unless perhaps you are using the meta-optimizer and the timings changed).

— Reply to this email directly or view it on GitHub https://github.com/Theano/Theano/issues/2607#issuecomment-80714452.

f0k commented 9 years ago

There is no constant tensor of zeros since the alloc operations are never constant folded.

Okay, good. This also applies to zeros_like(), I assume?

Also, inside the op, in the inplace=False case the zeros (or whatever else is the output buffer) will be copied to a new output buffer (that is allocated inside the op) and this is used to run the convolution inplace.

And this copy is done every time the convolution is run? Or can something go wrong when calling the Op multiple times? (Based on my limited understanding, this is what might be happening in my case, and couldn't be caught in the tests because they run each function only once.)

Since the interface change, the old behavior is not always valid because the output buffer can be something else than zeros so we can't just ignore it. This is because there is another optimization that tries to identify addtions that happen after the convolution and combine them using the output buffer.

Ah, I see. I'm not sure I understand the purpose of that bit, though -- couldn't this optimizer just create a copy and use the inplace=True version on that? The interface for the inplace=False version could be reduced to GpuDnnConv(inplace=False)(img, kern, desc, alpha) then, so it doesn't take an output buffer element at all and can use CUDNN_RESULT_NO_ACCUMULATE again.

However the timings you post don't seem to come from the same graph since it uses GpuDnnConv in one case and GpuDnnConvGradW in the other.

The profile probably do not include all convolution. That could explain the change in the number of convolution.

Yes, I've just posted the top three operations for the two cases to illustrate the difference. With the current master, GpuDnnConv makes up 142 seconds of the running time (adding the inplace=True and inplace=False instances), but with the previous version it only accounted for 57 seconds. Both versions have GpuDnnConvGradW in there. And I'm currently not using the meta-optimizer.

In the graph you dumped the convolutions are combined with one addition that follows so the timing comparison should reflect that too.

I see, so that could explain part of the difference of the time spent in GpuDnnConv? It doesn't explain how the total runtime differs by 30%, though.

I'll try to run the valid function in debugmode to catch the wrong value.

I can also try some things tomorrow. I'll try to investigate how the prediction function behaves, that may be easier to assess than the validation cost function.

nouiz commented 9 years ago

The validation function passed debugmode. Can you run the function that give wrong answer with those flags:

device=gpu,floatX=float32,mode=DebugMode,DebugMode.patience=1,DebugMode.check_py=False

On Sat, Mar 14, 2015 at 6:07 PM, Jan Schlüter notifications@github.com wrote:

There is no constant tensor of zeros since the alloc operations are never constant folded.

Okay, good. This also applies to zeros_like(), I assume?

Also, inside the op, in the inplace=False case the zeros (or whatever else is the output buffer) will be copied to a new output buffer (that is allocated inside the op) and this is used to run the convolution inplace.

And this copy is done every time the convolution is run? Or can something go wrong when calling the Op multiple times? (Based on my limited understanding, this is what might be happening in my case, and couldn't be caught in the tests because they run each function only once.)

Since the interface change, the old behavior is not always valid because the output buffer can be something else than zeros so we can't just ignore it. This is because there is another optimization that tries to identify addtions that happen after the convolution and combine them using the output buffer.

Ah, I see. I'm not sure I understand the purpose of that bit, though -- couldn't this optimizer just create a copy and use the inplace=True version on that? The interface for the inplace=False version could be reduced to GpuDnnConv(inplace=False)(img, kern, desc, alpha) then, so it doesn't take an output buffer element at all and can use CUDNN_RESULT_NO_ACCUMULATE again.

However the timings you post don't seem to come from the same graph since it uses GpuDnnConv in one case and GpuDnnConvGradW in the other.

The profile probably do not include all convolution. That could explain the change in the number of convolution.

Yes, I've just posted the top three operations for the two cases to illustrate the difference. With the current master, GpuDnnConv makes up 142 seconds of the running time (adding the inplace=True and inplace=False instances), but with the previous version it only accounted for 57 seconds. Both versions have GpuDnnConvGradW in there. And I'm currently not using the meta-optimizer.

In the graph you dumped the convolutions are combined with one addition that follows so the timing comparison should reflect that too.

I see, so that could explain part of the difference of the time spent in GpuDnnConv? It doesn't explain how the total runtime differs by 30%, though.

I'll try to run the valid function in debugmode to catch the wrong value.

I can also try some things tomorrow. I'll try to investigate how the prediction function behaves, that may be easier to assess than the validation cost function.

— Reply to this email directly or view it on GitHub https://github.com/Theano/Theano/issues/2607#issuecomment-80741532.

f0k commented 9 years ago

I'll try to investigate how the prediction function behaves, that may be easier to assess than the validation cost function.

All right, got it nailed down a bit now. I've computed predictions with a properly trained network once with Theano directly before #2559 (commit b1963ef) and once with the latest master (commit d2c39c4). Otherwise it's the same code, the same inputs, the same pretrained network (i.e., I'm just computing the predictions, not training it).

Please check the following download: http://ofai.at/~jan.schlueter/downloads/theano_issue_2607.tgz (24 MiB)

{good,latest}_predict_fn.dump: Dump of the prediction function. Expects inputs of (64, 11500) and (64, 1), returns two outputs of (64, 1)
{good,latest}_predictions.h5: Network outputs. One matrix, times, is just a vector of timestamps, nothing wrong with that. The other, outputs, is the network output. For b1963ef, everything looks good, but with the latest master, it just gives a vector of ones. (Note that the output unit is sigmoid.)
{good,latest}_profile.txt: The profile output. The latest master is a lot slower.

device=gpu,floatX=float32,mode=DebugMode,DebugMode.patience=1,DebugMode.check_py=False

Will do in a minute.

nouiz commented 9 years ago

I can't check it now, but just a new hypothesis, @abergeron, is there case for the case inplace=False? Maybe it is that case that have a problem. Normally, it should not happen in a normal CNN model I think.

On Sat, Mar 14, 2015 at 6:51 PM, Jan Schlüter notifications@github.com wrote:

I'll try to investigate how the prediction function behaves, that may be easier to assess than the validation cost function.

All right, got it nailed down a bit now. I've computed predictions with a properly trained network once with Theano directly before #2559 https://github.com/Theano/Theano/pull/2559 (commit b1963ef https://github.com/Theano/Theano/commit/b1963ef92b231eb5deb6013f650621f8c49ba16b) and once with the latest master (commit d2c39c4 https://github.com/Theano/Theano/commit/d2c39c4ebd21eeb066681db0c74fd9ab3b0798e8). Otherwise it's the same code, the same inputs, the same network.

Please check the following download: http://ofai.at/~jan.schlueter/downloads/theano_issue_2607.tgz

{good,latest}_predict_fn.dump: Dump of the prediction function. Expects inputs of (64, 11500) and (64, 1), returns two outputs of (64, 1)

{good,latest}_predictions.h5: Network outputs. One matrix, times, is just a vector of timestamps, nothing wrong with that. The other, outputs, is the network output. For b1963ef https://github.com/Theano/Theano/commit/b1963ef92b231eb5deb6013f650621f8c49ba16b, everything looks good, but with the latest master, it just gives a vector of ones.

{good,latest}_profile.txt: The profile output. The latest master is a lot slower.

device=gpu,floatX=float32,mode=DebugMode,DebugMode.patience=1,DebugMode.check_py=False

Will do in a minute.

— Reply to this email directly or view it on GitHub https://github.com/Theano/Theano/issues/2607#issuecomment-80751166.

f0k commented 9 years ago

Had to reduce the batchsize to not get a memory error, but now it found something:

Using gpu device 0: GeForce GTX 780 Ti
Traceback (most recent call last):
  [...]
  File "/raid/user/jan/install/Theano-git/theano/compile/function_module.py", line 595, in __call__
    outputs = self.fn()
  File "/raid/user/jan/install/Theano-git/theano/compile/debugmode.py", line 2151, in deco
    return f()
  File "/raid/user/jan/install/Theano-git/theano/compile/debugmode.py", line 2079, in f
    r_vals)
  File "/raid/user/jan/install/Theano-git/theano/compile/debugmode.py", line 958, in _find_bad_optimizations0
    new_graph=new_graph_str)
theano.compile.debugmode.BadOptimization: BadOptimization Error 
  Variable: id 140046519237840 GpuDnnConv{workmem='small', inplace=False}.0
  Op GpuDnnConv{workmem='small', inplace=False}(GpuContiguous.0, GpuContiguous.0, GpuContiguous.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv'}.0, Constant{0.5})
  Value Type: <type 'CudaNdarray'>
  Old Value shape, dtype, strides: (8, 32, 108, 95) float32 (328320, 10260, 95, 1)

  Old Value:  <CudaNdarray object at 0x7f5f1a487af0>
  New Value shape, dtype, strides: (8, 32, 108, 95) float32 (328320, 10260, 95, 1)

  New Value:  <CudaNdarray object at 0x7f5f1a487870>
  Max Abs Diff:  10.1399
  Mean Abs Diff:  1.35135
  Median Abs Diff:  0.831596
  Std Abs Diff:  1.51666
  Max Rel Diff:  1.0
  Mean Rel Diff:  0.566683
  Median Rel Diff:  0.613413
  Std Rel Diff:  0.433405

  Reason:  local_dnn_conv_alpha_merge
  Old Graph:
  GpuElemwise{mul,no_inplace} [@A] ''   
   |GpuFromHost [@B] ''   
   | |TensorConstant{(1, 1, 1, 1) of 0.5} [@C]
   |GpuFromHost [@D] ''   
     |HostFromGpu [@E] ''   
       |GpuDnnConv{workmem='small', inplace=False} [@F] ''   
         |GpuContiguous [@G] ''   
         | |GpuReshape{4} [@H] ''   
         |GpuContiguous [@I] ''   
         | |GpuFromHost [@J] ''   
         |GpuContiguous [@K] ''   
         | |GpuElemwise{add,no_inplace} [@L] ''   
         |GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv'} [@M] ''   
         | |MakeVector [@N] ''   
         | |MakeVector [@O] ''   
         |Constant{1.0} [@P]

  New Graph:
  GpuDnnConv{workmem='small', inplace=False} [@A] ''   
   |GpuContiguous [@B] ''   
   | |GpuReshape{4} [@C] ''   
   |   |GpuFromHost [@D] ''   
   |   | |HostFromGpu [@E] 'melspect'   
   |   |   |GpuFromHost [@F] ''   
   |   |TensorConstant{[  8   1 115  -1]} [@G]
   |GpuContiguous [@H] ''   
   | |GpuFromHost [@I] ''   
   |   |HostFromGpu [@J] ''   
   |     |W [@K]
   |GpuContiguous [@L] ''   
   | |GpuElemwise{add,no_inplace} [@M] ''   
   |   |GpuFromHost [@N] ''   
   |   | |HostFromGpu [@O] ''   
   |   |   |GpuElemwise{Abs,no_inplace} [@P] ''   
   |   |GpuFromHost [@Q] ''   
   |     |HostFromGpu [@R] ''   
   |       |GpuContiguous [@S] ''   
   |GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv'} [@T] ''   
   | |MakeVector [@U] ''   
   | | |TensorConstant{8} [@V]
   | | |TensorConstant{1} [@W]
   | | |TensorConstant{115} [@X]
   | | |Elemwise{floor_div,no_inplace} [@Y] ''   
   | |   |Elemwise{mul,no_inplace} [@Z] ''   
   | |   | |Shape_i{0} [@BA] ''   
   | |   | |Shape_i{1} [@BB] ''   
   | |   |TensorConstant{920} [@BC]
   | |MakeVector [@BD] ''   
   |   |Shape_i{0} [@BE] ''   
   |   | |W [@K]
   |   |Shape_i{1} [@BF] ''   
   |   | |W [@K]
   |   |Shape_i{2} [@BG] ''   
   |   | |W [@K]
   |   |Shape_i{3} [@BH] ''   
   |     |W [@K]
   |ScalarFromTensor [@BI] ''   
     |Elemwise{mul,no_inplace} [@BJ] ''   
       |DimShuffle{} [@BK] ''   
       | |TensorConstant{(1, 1, 1, 1) of 0.5} [@BL]
       |TensorFromScalar [@BM] ''   
         |Constant{1.0} [@BN]

Hint: relax the tolerance by setting tensor.cmp_sloppy=1
  or even tensor.cmp_sloppy=2 for less-strict comparison

So something's wrong with the local_dnn_conv_alpha_merge.

/edit: As a side note, the multiplication with 0.5 probably stems from Lasagne's formulation of rectified linear units as 0.5 * (x + abs(x)), where x would be the convolution result (based on anecdotal evidence that this is faster than T.maximum(0, x)). Maybe this got transformed to 0.5 * x + abs(0.5 * x) and is then alpha-merged.

f0k commented 9 years ago

And regarding the performance difference, are the optimizations executed late enough so it's sure there are no other clients in the graph that need the convolution results? Maybe some convolutions are computed twice now because they cannot be merged any more with a different alpha or an inplace version? (Just throwing in some ideas.)

abergeron commented 9 years ago

@nouiz I have no idea what you are asking me.

As far as I know the inplace optimizations are done late so that shouldn't impact mergability. The alpha and output merge could cause multiple convolutions to not be merged though.

Also I think I have an idea of what the problem is. The alpha_merge optimization is broken since it's converting essentially alpha * (conv(...) + b) into (alpha * conv(...)) + b. So we would need to distribute the alpha properly for the results to be good. I'll get a PR to do that.

abergeron commented 9 years ago

I've also discovered that the gradient is broken for alpha != 1.0 (which is not the usual case in the gradient step, but still).

abergeron commented 9 years ago

I did my own tests, but if you can confirm that #2621 fixes the wrong output problem it would help.

This may or may not have an impact on the execution speed, I didn't check.

abergeron commented 9 years ago

Actually the alpha and output merge could duplicate convolutions in the graph if they have more than one client. This is something I don't check for and could disallow. But I'll do that another day, I'm too tired now.

f0k commented 9 years ago

All right, the wrong results are fixed with #2621 now, but profiling again, it's still a bit slower than b1963ef (my previous timings were influenced by other processes running on the GPU): 2.33 seconds vs. 2.20 seconds for a small forward pass test, and very consistently so. Disabling the optimizers with optimizer_excluding=local_dnn_conv_inplace,local_dnn_convgw_inplace,local_dnn_convgi_inplace,local_dnn_conv_alpha_merge,local_dnn_convw_alpha_merge,local_dnn_convi_alpha_merge,local_dnn_conv_output_merge,local_dnn_convw_output_merge,local_dnn_convi_output_merge doesn't help. Is there anything else I could try to disable, or is there any other reason for why it could be 6% slower than before?

f0k commented 9 years ago

Is there anything else I could try to disable, or is there any other reason for why it could be 6% slower than before?

OK, found it. Will send a PR after lunch.

f0k commented 9 years ago

OK, found it. Will send a PR after lunch.

No, it was a red herring. I thought the (now unnecessary) zeros_like() in dnn.py were the culprit, but they didn't matter because the respective ops were converted to inplace ops anyway. Still the difference in the profile is fully accounted for by 490 GpuAlloc{memset_0=True} calls (one for each call of the compiled prediction function). Is it possible that the former code was suited better for THEANO_FLAGS=allow_gc=0? Is there a way to make the new code as fast as the old one? Why is there no version of GpuAlloc that leaves the memory uninitialized?

abergeron commented 9 years ago

Can you try with #2629? It may help a bit.

f0k commented 9 years ago

Can you try with #2629? It may help a bit.

I doubt so, since I've already tested with all new optimizations disabled, and the only timing difference in my latest profile was caused by the GpuAlloc ops. I guess because of the GpuAlloc{memset_0=True} Theano will reallocate and zero the output for GpuDnnConv{inplace=True} even if beta=0, while before your change, it would allocate the output once and leave it in place when allow_gc=0. Is that correct? The increase in runtime is not that large (3% in my case), but keep in mind that this will also increase the energy usage of all the Theano userbase that uses CNNs by the same amount (or make them run fewer experiments). Think green :)

f0k commented 9 years ago

Is that correct?

Reading the code myself, I see that it will not reallocate the output each time, but it will unnecessarily zero it each time before applying GpuDnnConv. Before #2559, the output was only allocated and then overwritten by the convolution, without zeroing it before. I've created issue #2632 as a possible solution.

nouiz commented 9 years ago

I think your analysis is right.

On Tue, Mar 17, 2015 at 4:55 AM, Jan Schlüter notifications@github.com wrote:

Is that correct?

Reading the code myself, I see that it will not reallocate the output each time, but it will unnecessarily zero it each time before applying GpuDnnConv. Before #2559 https://github.com/Theano/Theano/pull/2559, the output was only allocated and then overwritten by the convolution, without zeroing it before. I've created issue #2632 https://github.com/Theano/Theano/issues/2632 as a possible solution.

— Reply to this email directly or view it on GitHub https://github.com/Theano/Theano/issues/2607#issuecomment-82301703.

nouiz commented 9 years ago

This issue is fixed and for the speed, there is another one. So I close this one.

Theano / Theano

inplace dnn convolution seems dangerously broken #2607