Closed andreasnoack closed 11 years ago
Cc: @xianyi
Might be a stack issue. Wouldn't be surprised.
Is it an openblas bug, or something in our windows port?
Can't tell yet. I can give you a windows VM if you want to debug.
I should add that this seems to be limited to older Windows machines. The example is from Windows Server 2003.
Don't have a test machine for that yet. Will set up a couple of VMs for it on julia.mit.edu
I am able to get the same crash on a Windows Server 2008 but to do so I need a 65x65 matrix. I cannot crash Julia on Windows 8.
What's your compiler version? GCC 4.7?
Xianyi
+1 Running on vista lu(randn(33,33)) is ok lu(randn(34,34)) breaks
@xianyi Is there a way to get openblas to work reliably on Windows? Any specific compiler versions that you recommend?
@vtjnash Do you think we can build julia 0.1 with ATLAS as a backup, until some of these things are sorted out?
can someone tell me how to run the equivalent of lud(rand(33))
on the current version of julia? i'll bundle something that works when I make the 0.1 binaries for windows. however, this shouldn't be critical for the ubuntu release.
lufact(rand(33,33))
Hi @ViralBShah ,
We are in Chinese New Year holiday. I think we can address this issue next week.
Xianyi
Ok. Have fun! Let me know if I should file this as an issue on openblas.
I think we should ship the Windows version with Reference BLAS, if we can't get ATLAS working in the meanwhile, and until OpenBLAS can be stabilized.
@andreasnoackjensen We should probably add some of these windows crashes as tests in test/linalg.jl
once we resolve them.
Hi all,
I don't know why it calls csyr in lufact(rand(33,33))
. I thinks it is the double precision real matrix.
I just uploaded a simple dgetrf sample to gist https://gist.github.com/xianyi/4771129 It works fine with OpenBLAS develop branch (gcc-4.7) on my Win7 64-bit box.
Xianyi
No it is not so obvious why csyr is called. However, the problem seems again to be be related to multithreading. If I set the number of threads to one I don't get the error.
Hi @andreasnoackjensen ,
Is it 32 bit or 64 bit? Could you try OpenBLAS develop branch?
Could you try my dgetrf test https://gist.github.com/xianyi/4771129 ?
Thank you
Xianyi
Hi @xianyi,
It was on a Windows Server 2008 64 bit machine, but I don't know much about the Windows build of Julia. Therefore I cannot try a build with the develop branch. Maybe @loladiro and @vtjnash can help here. I'll see if I can run your example, but I don't have access to a Windows machine with privileges to install programs.
i added comments to xianyi's gist.
current workaround for julia may be to add export OPENBLAS_NUM_THREADS=1
to prepare-julia-env.bat
@xianyi I've narrowed this down to the stack being corrupted by the line in your gist: LAPACK_dgetrf(&N, &N, m, &LDA,ipiv, &info); somewhere in _zpotrf. the apparent stack trace is
Program received signal SIGSEGV, Segmentation fault.
0x6d7e9243 in zupmtr_ () from c:\users\jameson\desktop\julia-64966d6e8c\libopenblas.dll
(gdb) bt
#0 0x6d7e9243 in zupmtr_ () from c:\users\jameson\desktop\julia-64966d6e8c\libopenblas.dll
#1 0x6cc2c9f6 in zupmtr_ () from c:\users\jameson\desktop\julia-64966d6e8c\libopenblas.dll
#2 0x6cc2c9f6 in zupmtr_ () from c:\users\jameson\desktop\julia-64966d6e8c\libopenblas.dll
#3 0x6cc2c9f6 in zupmtr_ () from c:\users\jameson\desktop\julia-64966d6e8c\libopenblas.dll
#4 0x6cc2c9f6 in zupmtr_ () from c:\users\jameson\desktop\julia-64966d6e8c\libopenblas.dll
#5 0x6c4d6996 in libopenblas!DLANSB () from c:\users\jameson\desktop\julia-64966d6e8c\libopenblas.dll
#6 0x0028fdf0 in ?? ()
#7 0x004013fa in __tmainCRTStartup ()
#8 0x749033aa in KERNEL32!BaseCleanupAppcompatCacheSupport () from C:\Windows\syswow64\kernel32.dll
#9 0x0028ffd4 in ?? ()
#10 0x77149ef2 in ntdll!RtlpNtSetValueKey () from C:\Windows\system32\ntdll.dll
#11 0x7efde000 in ?? ()
#12 0x77149ec5 in ntdll!RtlpNtSetValueKey () from C:\Windows\system32\ntdll.dll
#13 0x004014e0 in WinMainCRTStartup ()
#14 0x7efde000 in ?? ()
#15 0x00000000 in ?? ()
0x6d7e9243 in zupmtr_ () from c:\users\jameson\desktop\julia-
(gdb) info reg
eax 0x3440 13376
ecx 0x92b80 600960
edx 0x8 8
ebx 0x8 8
esp 0xf6b74 0xf6b74
ebp 0xf6bb8 0xf6bb8
esi 0x28fdf0 2686448
edi 0xffffc000 -16384
eip 0x6d7e9243 0x6d7e9243 <zupmtr_+13701139>
eflags 0x10202 [ IF RF ]
cs 0x23 35
ss 0x2b 43
ds 0x2b 43
es 0x2b 43
fs 0x53 83
gs 0x2b 43
Does this happen only in LU, or does it happen for other decompositions too?
I have tested the other factorizations and the problem seems to be for LU only. However, that includes the solution of a general linear system which also crashes Julia.
@vtjnash Lets set number of threads to 1 on windows if that will solve the immediate release issue.
@vtjnash ,
I also added the comment in my gist. You narrowed down this issue to dgetrf function. Do you include cblas.h and lapacke.h?
Xianyi
CBLAS does get linked into the openblas used by julia.
Bumping to post 0.1.
@xianyi Would it be possible to fix this in a few days? If so, we can build julia windows binaries with openblas now that we have released 0.1.
@zchothia Could you investigate this issue? Thank you.
Hi @vtjnash ,
I read your comments in my gist. However, when I built OpenBLAS on Linux and test_dgetrf on Windows, I didn't meet the SEGFAULT bug on Windows.
What's the i686-w64-mingw32-gcc version on Linux and gcc version on Windows?
Thank you
Xianyi
$ i686-w64-mingw32-gcc --version
i686-w64-mingw32-gcc (GCC) 4.6.3
Copyright (C) 2011 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
built with max OPENBLAS_NUM_THREADS of 80
tested with
$ /c/MinGW64/bin/gcc --version
gcc.exe (Built by MinGW-builds project) 4.7.2
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
and
$ gcc --version
gcc.exe (GCC) 4.6.1
Copyright (C) 2011 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
and (which is really the same as the first one):
$ /c/MinGW64/bin/x86_64-w64-mingw32-gcc --version -m32
x86_64-w64-mingw32-gcc.exe (Built by MinGW-builds project) 4.7.2
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Oh, and my machine is a VMware instance with 2 processors (sometimes 4) running on a Core i7 2620m with 4 processors (all x86_64 / 64-bit).
are any of these make flags for openblas potentially at fault (or insufficient)?
make CC="i686-w64-mingw32-gcc" FC="i686-w64-mingw32-gfortran" RANLIB="i686-w64-mingw32-ranlib" \
CFLAGS="-g" FFLAGS="-g -O2 " USE_THREAD=1 TARGET= DYNAMIC_ARCH=1 OSNAME=WINNT \
CROSS=1 BINARY=32
Your i686-w64-mingw32-gcc is 4.6 version. Did you use gcc 4.6 on Windows? I remember that 4.6 and 4.7 have the different calling conventions on Windows.
Xianyi
IIUC, It appears that only the calling convention of C++11 changed: http://gcc.gnu.org/gcc-4.7/changes.html. I tried all three compilers mentioned above (4.6.1-i386, 4.7.2-i386, 4.7.2-x86_6) I am putting together a Virtual Machine for more testing.
Hi @vtjnash ,
Please give me the access to the VM. I cannot reproduce this bug on my machine :(
Xianyi
@xianyi I haven't started it yet (I think I need to find my windows install disk). However, I just identified the problem as stack overflow. The default stack on windows is 1MB, increasing it to 16MB fixes the problem (-Wl,--stack,16777216
). Any idea what a good size would be and why this was a problem? (default stack on linux is 8MB, IIRC)
Julia itself can use quite a bit of stack space; can we bump the default to 8MB on windows (if that's enough to fix this)?
16MB was enough to bump the max number of openblas threads up to somewhere between 10 and 60, then we run into some other segfault (which appears to be caused by a null pointer)
note: fixing #1971 converted this segfault into a julia stack overflow exception for OPENBLAS_NUM_THREADS<30 (or so) at which point it turns into a MemoryError (or an openblas/lapack crash?)
(edit: now named
lufact(rand(33,33))
)julia>lud(rand(33))
ok!but (see also #1543)