Newly introduced vector-vector product bug

jlowin commented 11 years ago

I'm experiencing what appears to be a serious bug. According to git bisect, the recent commit 14af439 is the culprit, merged in gh-1202.

The following code now gives the wrong result on my machine. It should be 30.0, but after this commit results in 0.0.

import numpy as np

import theano
import theano.tensor as tt
v = tt.vector()
f = theano.function([v], tt.dot(v, v))
testval = np.arange(5).astype(theano.config.floatX)
print f(testval)

jlowin commented 11 years ago

More details:

The example code fails on the CPU with theano.config.floatX = 'float32'
The code works on the CPU with theano.config.floatX = 'float64'
The code works on the GPU

jlowin commented 11 years ago

This only appears to impact symbolic vectors. The nose tests don't use symbolic vectors, they convert numpy arrays to tensor variables, and I guess that's why tests pass.

For example, this code works when the symbolic version above fails:

import numpy as np
import theano
import theano.tensor as tt

testval = np.arange(5).astype(theano.config.floatX)
testvar = tt.as_tensor_variable(testval)
f_val = theano.function([], tt.dot(testvar, testvar))
print f_val()

nouiz commented 11 years ago

For me both version work.

Can you tell me the blas version used? The theano flags blas.ldflags? Your OS version? python/numpy/scipy version?

Can you run your tests and add this line and give me the output of this line:

theano.printing.debugprint(f, print_type=True)

I have this output:

>>> theano.printing.debugprint(f, print_type=True)
InplaceDimShuffle{} [@A] <TensorType(float32, scalar)> ''   2
 |CGemv{no_inplace} [@B] <TensorType(float32, (True,))> ''   1
   |TensorConstant{(1,) of 0.0} [@C] <TensorType(float32, (True,))>
   |TensorConstant{1.0} [@D] <TensorType(float32, scalar)>
   |InplaceDimShuffle{x,0} [@E] <TensorType(float32, row)> ''   0
   | |<TensorType(float32, vector)> [@F] <TensorType(float32, vector)>
   |<TensorType(float32, vector)> [@F] <TensorType(float32, vector)>
   |TensorConstant{0.0} [@G] <TensorType(float32, scalar)>

jlowin commented 11 years ago

Sure --

debug output:

InplaceDimShuffle{} [@A] <TensorType(float32, scalar)> ''   2
 |CGemv{no_inplace} [@B] <TensorType(float32, (True,))> ''   1
   |TensorConstant{(1,) of 0.0} [@C] <TensorType(float32, (True,))>
   |TensorConstant{1.0} [@D] <TensorType(float32, scalar)>
   |InplaceDimShuffle{x,0} [@E] <TensorType(float32, row)> ''   0
   | |<TensorType(float32, vector)> [@F] <TensorType(float32, vector)>
   |<TensorType(float32, vector)> [@F] <TensorType(float32, vector)>
   |TensorConstant{0.0} [@G] <TensorType(float32, scalar)>

OS: Mac OS X 10.8

Theano flags: floatX = float32

blas.ldflags: -lblas

numpy version : 1.8.0.dev-c932553 scipy version : 0.11.0 blas version is built-in Apple Accelerate platform. Using otool to look at the linked file:

>>> otool -L /usr/local/lib/python2.7/site-packages/numpy/core/_dotblas.so
/usr/local/lib/python2.7/site-packages/numpy/core/_dotblas.so:
    /System/Library/Frameworks/Accelerate.framework/Versions/A/Accelerate (compatibility version 1.0.0, current version 4.0.0)
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 169.3.0)

I'm going to see if it's possibly a numpy incompatibility since I'm using a dev version by uninstalling/reinstalling numpy/scipy/theano. I'll update once that's done.

lamblin commented 11 years ago

What is the error message in DebugMode? It might be informative.

jlowin commented 11 years ago

Good point -- this is the DebugMode error:

>>> v = tt.vector()
>>> f = theano.function([v], tt.dot(v, v), mode='DebugMode')
>>> testval = np.arange(5).astype(theano.config.floatX)
>>> print f(testval)
WARNING: ('Stride mismatch', ((1, 5), (1, 5), (20, 4), (4, 4), 'DimShuffle{1,0}'))
---------------------------------------------------------------------------
BadThunkOutput                            Traceback (most recent call last)
<ipython-input-1-e15c4b233b89> in <module>()
      2 f = theano.function([v], tt.dot(v, v), mode='DebugMode')
      3 testval = np.arange(5).astype(theano.config.floatX)
----> 4 print f(testval)

/Users/jlowin/git/Theano/theano/compile/function_module.pyc in __call__(self, *args, **kwargs)
    578         t0_fn = time.time()
    579         try:
--> 580             outputs = self.fn()
    581         except Exception:
    582             if hasattr(self.fn, 'position_of_error'):

/Users/jlowin/git/Theano/theano/compile/debugmode.pyc in deco()
   2065                 TensorType.filter_checks_isfinite = self.maker.mode.check_isfinite
   2066                 try:
-> 2067                     return f()
   2068                 finally:
   2069                     # put back the filter_checks_isfinite

/Users/jlowin/git/Theano/theano/compile/debugmode.pyc in f()
   1944                                             thunk1='perform', val1=r_vals[r],
   1945                                             thunk2='c_code', val2=storage_map[r][0],
-> 1946                                             inputs_val=inputs_val)
   1947                             else:
   1948                                 #print >> sys.stderr, i, "DEBUGMODE storing reference output %x" % id(storage_map[r][0])

BadThunkOutput: BadThunkOutput
  variable    : CGemv{no_inplace}.0
  Outputs Type: TensorType(float32, (True,))
  Outputs Shape: (1,)
  Outputs Strides: (4,)
  Inputs Type : [TensorType(float32, (True,)), TensorType(float32, scalar), TensorType(float32, row), TensorType(float32, vector), TensorType(float32, scalar)]
  Inputs Shape: [(1,), (), (1, 5), (5,), ()]
  Inputs Strides: [(4,), (), (20, 4), (4,), ()]
  Apply   : CGemv{no_inplace}(TensorConstant{(1,) of 0.0}, TensorConstant{1.0}, InplaceDimShuffle{x,0}.0, <TensorType(float32, vector)>, TensorConstant{0.0})
  thunk1  : perform
  thunk2  : c_code
  val1    : [ 30.]
  val2    : [ 0.]
  op      : <class 'theano.tensor.blas_c.CGemv'>
  Value 1 : shape, dtype, strides, min, max, n_inf, n_nan: (1,) float32 (4,) 30.0 30.0 0 0
  Value 2 : shape, dtype, strides, min, max, n_inf, n_nan: (1,) float32 (4,) 0.0 0.0 0 0
  Max Abs Diff:  30.0
  Mean Abs Diff:  30.0
  Median Abs Diff:  30.0
  Std Abs Diff:  0.0
  Max Rel Diff:  1.0
  Mean Rel Diff:  1.0
  Median Rel Diff:  1.0
  Std Rel Diff:  0.0

jlowin commented 11 years ago

I returned to stable NumPy 1.7 but am still having this issue.

nouiz commented 11 years ago

So from my understanding the problem is related to your blas library. To confirm this, can you test with this Theano flag?

blas.ldflags=

This will force to use numpy call. It should work. Do you have another blas? I see 2 possible causes, it don't like how we call it and generate an error that we don't catch, or your blas library is bugged.

In all cases, we will need to add as a test your code to make sure it is detected when people run Theano's tests.

jlowin commented 11 years ago

So the good news is that with that flag, it does work.

The bad news is that my blas is the standard one that Apple ships with every Mac in the Accelerate framework, so I've never touched it.

lamblin commented 11 years ago

It may be because we are using BLAS calls with undefined behaviour, that work correctly where we tested, but that are in general incorrect. For instance, I'm not sure the value we pass as Sx1 and Sx0 are correct for a call to dot_, because they were mucked with to be acceptable for gemv_ in the case x was a row or column vector. I'll try to have access to a machine that can reproduce the bug. Otherwise, could you help me test instrumented versions of the code, so we understand exactly what's happening?

jlowin commented 11 years ago

Of course, let me know anything you want to run. I will try to find some other Macs as well.

I should note that that on the two machines I am testing on right now, check_blas.py runs about 20% slower with blas.ldflags =, so if it's possible to find a better solution, that's definitely preferable.

nouiz commented 11 years ago

It is normal that this flag slow thing down. It mean we don't call blas directly, but we always use the numpy python interface.

If you can tell us the exact value of paremetter passed to sdot_(), it would help us as @lamblin think this could be an issue.

jlowin commented 11 years ago

Sure -- I'm not too familiar with c, though. Maybe you could write a line here for me to paste into blas_c at the appropriate spot, and I'll tell you the result? Sorry for the inconvenience.

nouiz commented 11 years ago

What about this line?

You need to modify the method c_code_cache_version of the same op to return an empty tuple. This mean to always recompile it, so if needed you can change more that line without always changing the version.

printf("?dot_ params: %%d %%p %%d %%p %%d\\n", &Nx1, (float*)(PyArray_DATA(%(xx)s)), &Sx1, (float*)yy_data, &Sy);

The multiple %% and // are needed as python will interprete the line before it get passed to g++

lamblin commented 11 years ago

It would also be nice to have the actual PyArray_STRIDES(%(xx)s)[0] and [1] in addition of Sx0 and Sx1.

jlowin commented 11 years ago

This is the output, if it helps:

?dot_ params: 1430203228 0x7f8c5ad0e9f0 1430203220 0x7f8c5ad0e9f0 1430203212
?dot_ strides: 20 4

nouiz commented 11 years ago

I made an error in my line, can you use this one:

 printf("?dot_ params: %%d %%p %%d %%p %%d\\n", Nx1, (float*)(PyArray_DATA(%(xx)s)), Sx1, (float*)yy_data, Sy);

I got:

?dot_ params: 5 0x4025560 1 0x4025560 1
?dot_ strides: 20 4

jlowin commented 11 years ago

I have:

?dot_ params: 5 0x7fb38a4475a0 1 0x7fb38a4475a0 1
?dot_ strides: 20 4

nouiz commented 11 years ago

@lamblin do you see a problem with the fct input? I don't. The prt are aligned and the other paramter seam good.

Do one of you know how to check for BLAS error?

lamblin commented 11 years ago

Regarding BLAS errors, I remember that you can declare a function with an appropriate name, and then this function will be used instead of BLAS error checking one (that does not do much). I remember having tried to do that in the past, but it's quite blurry in my mind. Regarding the values, let me think about it for a while. The problem might be that the memory is shared between both arguments of dot_.

nouiz commented 11 years ago

@jlowin can you test with 2 differents vector? Do this fix this case?

jlowin commented 11 years ago

Unfortunately I still get 0.0 with this code:

import numpy as np

import theano
import theano.tensor as tt
v = tt.vector()
u = tt.vector()
f = theano.function([v, u], tt.dot(v, u))
testval = np.arange(5).astype(theano.config.floatX)
print f(testval, testval)

lamblin commented 11 years ago

The error handling function seems to be xerbla. You may have to define it as xerbla_. @jlowin Can you also try with something else than range(5)? Maybe range(3), range(7), range(12). Also, I think that it would be worth testing it with two different ndarrays, for instance f(testval, testval+1).

jlowin commented 11 years ago

@lamblin here are more test cases. I get 0.0 for all of them with floatX=='float32'.

import numpy as np

import theano
import theano.tensor as tt

v = tt.vector()
u = tt.vector()
f = theano.function([v, u], tt.dot(v, u))

for n in [3, 5, 7, 12]:
    testval1 = np.arange(n).astype(theano.config.floatX)
    testval2 = 10 + np.arange(n).astype(theano.config.floatX)
    print f(testval1, testval2)

lamblin commented 11 years ago

OK, thanks. I tried to reproduce your problem on a Mac, but I did not succeed yet. What is in theano.config.blas.ldflags exactly? Did you change other things in theano.config (besides floatX, I guess)?

jlowin commented 11 years ago

Hm, so strange. One of the machines I'm testing on is almost brand new, I can't think of anything that would have affected the blas but I will keep trying to find something off. FWIW, I'm using Homebrew to set up my environment.

theano.config.blas.ldflags is simply '-lblas'

At the moment the only active flag in my .theanorc is floatX:

[global]
floatX = float32

(and there are a couple other flags, like device, that I've commented out while working on this problem)

nouiz commented 11 years ago

@lamblin how did you installed python/numpy/blas? Can you make sure it use apple blas?

I think that if you used EPD, it would work, as it use another BLAS.

lamblin commented 11 years ago

@nouiz: It's not EPD, but I'm not sure how it was installed. theano.config.blas.ldflags is -lblas, too, I don't know how to check how it's resolved. @jlowin: can you add the following printf statements, to help us understand where things go wrong? The first ones should be the same as before.

printf("?dot_ params: %%d %%p %%d %%p %%d\\n", Nx1, (float*)(PyArray_DATA(%(xx)s)), Sx1, (float*)yy_data, Sy);
printf("?dot_ strides: %%d %%d\\n", PyArray_STRIDES(%(xx)s)[0], PyArray_STRIDES(%(xx)s)[1]);
printf("?dot_ dims: %%d %%d\\n", PyArray_DIMS(%(xx)s)[0], PyArray_DIMS(%(xx)s)[1]);
printf("?dot_ alpha: %%f fbeta %%f dbeta\\n", alpha, fbeta, dbeta);
float dot_result = sdot_(&Nx1, (float*)(PyArray_DATA(%(xx)s)), &Sx1, (float*)yy_data, &Sy);
printf("?dot_ result: %%f\\n", dot_result);

lamblin commented 11 years ago

@nouiz: so, "otool -L" (the equivalent of ldd) gives, in particular:

/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib (compatibility version 1.0.0, current version 219.0.0)

which looks like Apple's BLAS, but it's probably an older one.

jlowin commented 11 years ago

@lamblin here's the output for the four test cases

?dot_ params: 3 0x7fac3e1c5800 1 0x7fac3e1d13c0 1
?dot_ strides: 12 4
?dot_ dims: 1 3
?dot_ alpha: 1.000000 fbeta 0.000000 dbeta
?dot_ result: 0.000000
0.0

?dot_ params: 5 0x7fac3e1b66a0 1 0x7fac3e1b4400 1
?dot_ strides: 20 4
?dot_ dims: 1 5
?dot_ alpha: 1.000000 fbeta 0.000000 dbeta
?dot_ result: 0.000000
0.0

?dot_ params: 7 0x7fac3e1b4fd0 1 0x7fac3e1b6c10 1
?dot_ strides: 28 4
?dot_ dims: 1 7
?dot_ alpha: 1.000000 fbeta 0.000000 dbeta
?dot_ result: 0.000000
0.0

?dot_ params: 12 0x7fac3e09abf0 1 0x7fac3e1e7cf0 1
?dot_ strides: 48 4
?dot_ dims: 1 12
?dot_ alpha: 1.000000 fbeta 0.000000 dbeta
?dot_ result: 0.000000
0.0

lamblin commented 11 years ago

OK, so the call to sdot_ itself returns 0.0, which is puzzling. I'll try to come up with a pure C file that you can compile and run, to see if it reproduces the problem.

In the mean time, can you add the following printing statement? I just want to check that the actual data being used is not 0.

printf("?dot_ data[:3]: %%f %%f %%f\\n", ((float*)(PyArray_DATA(%(xx)s)))[0], ((float*)(PyArray_DATA(%(xx)s)))[1], ((float*)(PyArray_DATA(%(xx)s)))[2]);

jlowin commented 11 years ago

Sure:

?dot_ params: 3 0x7fc644500570 1 0x7fc644500c80 1
?dot_ strides: 12 4
?dot_ dims: 1 3
?dot_ alpha: 1.000000 fbeta 0.000000 dbeta
?dot_ result: 0.000000
?dot_ data[:3]: 10.000000 11.000000 12.000000
0.0

?dot_ params: 5 0x7fc644500c60 1 0x7fc644503430 1
?dot_ strides: 20 4
?dot_ dims: 1 5
?dot_ alpha: 1.000000 fbeta 0.000000 dbeta
?dot_ result: 0.000000
?dot_ data[:3]: 10.000000 11.000000 12.000000
0.0

?dot_ params: 7 0x7fc6445014f0 1 0x7fc644501460 1
?dot_ strides: 28 4
?dot_ dims: 1 7
?dot_ alpha: 1.000000 fbeta 0.000000 dbeta
?dot_ result: 0.000000
?dot_ data[:3]: 10.000000 11.000000 12.000000
0.0

?dot_ params: 12 0x7fc644500550 1 0x7fc644500000 1
?dot_ strides: 48 4
?dot_ dims: 1 12
?dot_ alpha: 1.000000 fbeta 0.000000 dbeta
?dot_ result: 0.000000
?dot_ data[:3]: 10.000000 11.000000 12.000000
0.0

lamblin commented 11 years ago

Sorry about the iterative debug process, I did not realize that you still have x != y. Could you add this last printing statement?

printf("?dot_ y data[:3]: %%f %%f %%f\\n", ((float*)yy_data)[0], ((float*)yy_data)[1], ((float*)yy_data)[2]);

jlowin commented 11 years ago

That's ok, I appreciate the help. Here's the output. It appears that the data is there.

?dot_ params: 3 0x7fc6ee01d2e0 1 0x7fc6ee0c4380 1
?dot_ strides: 12 4
?dot_ dims: 1 3
?dot_ alpha: 1.000000 fbeta 0.000000 dbeta
?dot_ result: 0.000000
?dot_ x data[:3]: 10.000000 11.000000 12.000000
?dot_ y data[:3]: 0.000000 1.000000 2.000000
0.0

?dot_ params: 5 0x7fc6ee442270 1 0x7fc6ee44b220 1
?dot_ strides: 20 4
?dot_ dims: 1 5
?dot_ alpha: 1.000000 fbeta 0.000000 dbeta
?dot_ result: 0.000000
?dot_ x data[:3]: 10.000000 11.000000 12.000000
?dot_ y data[:3]: 0.000000 1.000000 2.000000
0.0

?dot_ params: 7 0x7fc6ee32fc70 1 0x7fc6ee3d61d0 1
?dot_ strides: 28 4
?dot_ dims: 1 7
?dot_ alpha: 1.000000 fbeta 0.000000 dbeta
?dot_ result: 0.000000
?dot_ x data[:3]: 10.000000 11.000000 12.000000
?dot_ y data[:3]: 0.000000 1.000000 2.000000
0.0

?dot_ params: 12 0x7fc6ee443200 1 0x7fc6ee44b220 1
?dot_ strides: 48 4
?dot_ dims: 1 12
?dot_ alpha: 1.000000 fbeta 0.000000 dbeta
?dot_ result: 0.000000
?dot_ x data[:3]: 10.000000 11.000000 12.000000
?dot_ y data[:3]: 0.000000 1.000000 2.000000
0.0

lamblin commented 11 years ago

Thanks! So, there really seems to be something wrong with your sdot_ function.

Here is a small C program, that you can compile with gcc blas_dot_test.c -o blas_dot_test -lblas. It is completely independent from Python and Theano. When I compile and run it, it prints r: 30.000000. Can you check if it fails for you? If so, is there any message printed on stdout or stderr?

#include <stdio.h>
float sdot_(int*, float*, int*, float*, int*);

int main(int argc, char** argv)
{
    int Nx = 5;
    int Sx = 1;
    float x[5] = {0, 1, 2, 3, 4};
    float r = sdot_(&Nx, x, &Sx, x, &Sx);
    printf("r: %f\n", r);
    return 0;
}

lamblin commented 11 years ago

It may be related to http://stackoverflow.com/questions/6887229/problem-on-multiplying-a-matrix-and-a-vector-with-veclib-framework-of-mac-os-x-1. So, maybe we should not use sdot_ on MacOS, or maybe there is another lib than veclib that we should link with first.

jlowin commented 11 years ago

Thank you -- so this fails, it returns 0.000000. It looks like you're definitely right and there is a problem with sdot.

With these clues, I found some other people experiencing something similar with sdot in recent versions of Mac OS. See for example https://github.com/mxcl/homebrew/issues/6649 and http://www.macresearch.org/lapackblas-fortran-106. This is all a little over my head, but here's one response in particular (from the second link), wrt to Apple's implementation of sdot:

The issue with SDOT on 64-bit Intel is a legitimate bug with the BLAS not properly obeying the g77/f2c calling conventions. Specifically, the g77/f2c convention is for single precision results to be returned as double precision, but the OS X BLAS is incorrectly returning the result in single precision. When that single precision result is interpreted as a double precision value by the calling routine, the result is a very tiny (denormal) double-precision value, which rounds to exactly zero when converted to single.

So I don't know if that's right or not, but it seems to fit the symptoms. Maybe some unique circumstances are making this only affect a small number of people. I will continue to see what I can learn.

lamblin commented 11 years ago

Thanks, that would make sense. The obvious solution is not to use sdot_ anymore. If we want to continue using it, we'll have to figure out how to make it work with recent MacOS versions. Does it work if you replace the prototype at the top of the C file with the following?

double sdot_(int*, float*, int*, float*, int*);

jlowin commented 11 years ago

Sorry for the delay -- yes, declaring it as double does work. Maybe that confirms the description of the problem above.

lamblin commented 11 years ago

So, If I try the same (defining sdot_ as returning a double) on Linux, I have a "symbol not defined" error. If I try that on the Mac I have access to (Mac OS X 10.6.8), I actually get 0.00... One solution would be testing which prototype should be used, at each run, by compiling and running code similar to the sample above. Simply reverting the commit in question would not be enough, because sdot_ is also used in other Ops, where it does not have an easy allternative. Another solution (if it works for you) would be to use sdsdot instead of sdot, and remove the prototype for sdot_ from the blas headers included in Theano.

lamblin commented 11 years ago

So, when you have time, can you try the following?

#include <stdio.h>
float sdsdot_(int*, float*, float*, int*, float*, int*);

int main(int argc, char** argv)
{
    int Nx = 5;
    int Sx = 1;
    float x[5] = {0, 1, 2, 3, 4};
    float z = 0;
    float r = sdsdot_(&Nx, &z, x, &Sx, x, &Sx);
    printf("z: %f, r: %f\n", z, r);
    return 0;
}

z should still be 0 at the end. Edit: corrected sdot_ -> sdsdot_.

jlowin commented 11 years ago

Sorry for the delay, I'm travelling today and tomorrow. This did not work (still 0). Also, if you're running 10.6.8, it bears out that this was a change in Lion (10.7).

Based on the suggestions in the stackoverflow post you mentioned as well as here I tried calling the Accelerate framework's own sdot and it did actually work, returning 30.0. (need to add #include <Accelerate/Accelerate.h> and call cblas_sdot(Nx, x, Sx, x, Sx)). Now, I suppose that might be a Mac-only solution, and perhaps this isn't useful at all, but I'll mention it just in case.

lamblin commented 11 years ago

That seems a good solution to me. We could change the BLAS headers for Mac to #include <Accelerate/Accelerate.h> and redefine sdot_ to call cblas_sdot, instead of just having the forward declaration.

lamblin commented 11 years ago

The thing is that, however, we only want to do that if the specified blas flags are "-lblas", because other versions of blas would not have that problem, and we don't want to use Accelerate in that case.

lamblin commented 11 years ago

Also, do you have to specify different linking flags to use the cblas interface, like -lcblas? That would make it more difficult to integrate cleanly.

jlowin commented 11 years ago

No -- I used the '-lblas' flag. Actually exactly the same as your original compile instruction.

lamblin commented 11 years ago

@jlowin Cool, then I have a plan to solve the problem in an unobtrusive way. I'll start a development branch for that on Monday. Thanks for your help!

jlowin commented 11 years ago

Great to hear! Thanks for your help too!

lamblin commented 11 years ago

@jlowin: When you have time, can you check out the branch at gh-1250, and tell me if it correctly solves your problem? I'm still not able to test the fix here.

jlowin commented 11 years ago

@lamblin -- unfortunately it's not there yet (the assertion fails and kills my Python). I'm tracing it as best as I can -- it looks like the first code snippet compiles but gives the wrong answer (as expected), but the second code snippet fails to compile. Do we need to link to the Accelerate library explicitly?

Theano / Theano

Newly introduced vector-vector product bug #1240