Closed drupol closed 2 months ago
Strange. Is this with a recent version of OpenBLAS ? The only obvious source of a floating point exception in the gemm_driver code that I can think of was fixed years ago. Also, what is your cpu type please ?
Strange. Is this with a recent version of OpenBLAS ? The only obvious source of a floating point exception in the gemm_driver code that I can think of was fixed years ago. Also, what is your cpu type please ?
I just added some information, I really hope this helps. Feel free to let me know if I can do something else.
Thanks, so basically a quad-core Haswell cpu without hyperthreading. The one patch should be irrelevant to the problem.
Do you happen to know if the same code worked with earlier versions of OpenBLAS ? (There was one fairly recent change, #4655, which maybe should simply not be applied when the number of threads is so small, but right now it is far from obvious to me that the computations done there would cause a floating point exception)
Do you happen to know if the same code worked with earlier versions of OpenBLAS ?
No, I don't know at all. However, I can change the library in Nix and see. Do you recommend a particular version?
(There was one fairly recent change, #4655, which maybe should simply not be applied when the number of threads is so small, but right now it is far from obvious to me that the computations done there would cause a floating point exception)
That was added on the 18 of April, the current version we have in Nix is 0.3.27
is from the 4th of April.
Do you want me to cherry-pick that patch and test?
Regarding the CPU I use, it has Hyper Threading capabilities, here are more information using dmidecode
:
Oh, right, I was one (upcoming) release ahead again, sorry... the changes immediately preceding that should only affect OpenMP builds, and going back a full year ago, #3843 (appearing in 0.3.24) should only matter for DYNAMIC_ARCH builds. Perhaps it would already help to use an OpenBLAS built with DEBUG=1 to get line number information in the backtrace. (Looks like I misread your screenshot and landed on the GPU line - Xeon E3-1200v3 may have been the server version of the i7-47xx. In any case it is Haswell which makes this all the more strange - together with all its refreshes, I'd think this must be the most widespread cpu in use with OpenBLAS, and a full-blown crash in GEMM should have come up much earlier. Unless it is some aggressive compiler omitting a conditional or something)
I will build it with DEBUG=1 and keep this thread up to date as soon as I have more information.
Thanks!
I compiled it with DEBUG=1
, it took a while because it's the first time I do that...
Here are further info:
From what it looks like, it would be an Arithmetic exception (division by zero) from here: https://github.com/OpenMathLib/OpenBLAS/blob/1ba1b9c357fb7d5916a56a6c388b4aea47aad395/common_x86_64.h#L216
Thank you - can you go one step "up" there please so that it is clear which of the calls to blas_quickdivide() it is (frustrating that this is not printed in the backtrace already) ? (Though obviously none of them should have an y argument that evaluates to zero, and the x looks a bit suspicious too - maybe this is a stack thrashed by something going wrong elsewhere in OpenBLAS or the calling program)
Here's a more complete backtrace:
Thank you - this remains strange, it appears there is a weird corner condition where the loop that distributes the workload onto threads does an extra iteration although the number of available threads has already reached zero. I did not think this was possible, and somehow nobody managed to hit this situation before. The fix would be to add if (num_parts == nthreads_m) break
after the num_parts++
but there may be more to this story - looks like I'll have to learn ollama and open-webui
We can make a live demo at some point if you want to get started.
Different angle - do you know approximately how big your arrays are ? (Background: Will array indexes still fit into a 32bit int, or would you need 64bit integers (build option INTERFACE64), also will the default size of the buffer for exchanging submatrix data between threads be sufficient (build option BUFFERSIZE, less likely to cause problems unless the nix build uses a smaller value than the default "25"))
Let's see...
In the derivation file that produce openblas
, it is defined as such:
makeFlags = mkMakeFlagsFromConfig (config // {
FC = "${stdenv.cc.targetPrefix}gfortran";
CC = "${stdenv.cc.targetPrefix}${if stdenv.cc.isClang then "clang" else "cc"}";
PREFIX = placeholder "out";
OPENBLAS_INCLUDE_DIR = "${placeholder "dev"}/include";
NUM_THREADS = 64;
INTERFACE64 = blas64;
There's no trace of BUFFERSIZE
anywhere. However, INTERFACE64
is depending on blas64
which is initialized as null by default:
# Most packages depending on openblas expect integer width to match
# pointer width, but some expect to use 32-bit integers always
# (for compatibility with reference BLAS).
, blas64 ? null
With further exploration, I found the parameters and information that were used to build the openblas
library on my system:
"blas64": "1"
"makeFlags": "BINARY=64 CC=cc CROSS=0 DYNAMIC_ARCH=1 FC=gfortran HOSTCC=cc INTERFACE64=1 MAKE_NB_JOBS=0 NO_AVX512=1 NO_BINARY_MODE= NO_SHARED=0 NO_STATIC=1 NUM_THREADS=64 OPENBLAS_INCLUDE_DIR=/02qcpld1y6xhs5gz9bchpxaw0xdhmsp5dv88lh25r2ss44kh8dxz/include PREFIX=/1rz4g4znpzjwh1xymhjpm42vipw92pr73vdgl6xs1hycac8kf2n9 TARGET=ATHLON USE_OPENMP=1 DEBUG=1",
I was able to raise NUM_THREADS
to 128
, just for testing, but the issue is still there, and that was expected.
However, I was not able to raise BUFFERSIZE
to pretty much anything, because of:
DTBSV PASSED THE COMPUTATIONAL TESTS ( 1153 CALLS)
DTPSV PASSED THE TESTS OF ERROR-EXITS
DTPSV PASSED THE COMPUTATIONAL TESTS ( 289 CALLS)
DGER PASSED THE TESTS OF ERROR-EXITS
DGER PASSED THE COMPUTATIONAL TESTS ( 484 CALLS)
DSYR PASSED THE TESTS OF ERROR-EXITS
DSYR PASSED THE COMPUTATIONAL TESTS ( 145 CALLS)
DSPR PASSED THE TESTS OF ERROR-EXITS
DSPR PASSED THE COMPUTATIONAL TESTS ( 145 CALLS)
DSYR2 PASSED THE TESTS OF ERROR-EXITS
******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
EXPECTED RESULT COMPUTED RESULT
1 0.604173 0.254176
THESE ARE THE RESULTS FOR COLUMN 1
******* DSYR2 FAILED ON CALL NUMBER:
18: DSYR2 ('U', 1, 1.0, X, 2, Y, 2, A, 2) .
DSPR2 PASSED THE TESTS OF ERROR-EXITS
******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
EXPECTED RESULT COMPUTED RESULT
1 0.604173 0.254176
THESE ARE THE RESULTS FOR COLUMN 1
******* DSPR2 FAILED ON CALL NUMBER:
18: DSPR2 ('U', 1, 1.0, X, 2, Y, 2, AP) .
END OF TESTS
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./cblat2 < ./cblat2.dat
malloc(): invalid size (unsorted)
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0 0x7ffff7a51f3f in ???
#1 0x7ffff7aa1efc in __pthread_kill_implementation
#2 0x7ffff7a51e95 in raise
#3 0x7ffff7a3a934 in abort
#4 0x7ffff7a3b7e5 in __libc_message_impl.cold
#5 0x7ffff7aabbd4 in malloc_printerr
#6 0x7ffff7aaed8b in _int_malloc
#7 0x7ffff7ab0089 in __GI___libc_malloc
#8 0x7ffff7ab72ce in __GI___strndup
#9 0x7ffff7e7313f in data_transfer_init
#10 0x4085eb in cmvch_
#11 0x40a89e in cchk6_
#12 0x41941f in MAIN__
#13 0x40806e in main
/nix/store/agkxax48k35wdmkhmmija2i2sxg8i7ny-bash-5.2p26/bin/bash: line 1: 65568 Aborted (core dumped) OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./cblat2 < ./cblat2.dat
make[1]: *** [Makefile:112: level2] Error 134
make[1]: *** Waiting for unfinished jobs....
make[1]: Leaving directory '/build/source/test'
make: *** [Makefile:167: tests] Error 2
Do you think I should change something else?
Oops, sorry, this looks suspiciously like the effective buffersize value going negative (overflowing into the sign bit). Suspect an "L" or better yet "UL" suffix got dropped in common_x86_64.h like
--- a/common_x86_64.h
+++ b/common_x86_64.h
@@ -253,7 +253,7 @@ static __inline unsigned int blas_quickdivide(unsigned int x, unsigned int y){
#ifndef BUFFERSIZE
#define BUFFER_SIZE (32 << 22)
#else
-#define BUFFER_SIZE (32 << BUFFERSIZE)
+#define BUFFER_SIZE (32UL << BUFFERSIZE)
#endif
The patch seems to work, at least I don't have the error (I created a PR to facilitate the use of the patch on my server at https://github.com/OpenMathLib/OpenBLAS/pull/4769).
However, it's been now running since more than 90 minutes now. With the following values:
NUM_THREADS=128
BUFFERSIZE=35
Compiling Openblas usually takes 8 minutes on that machine, maybe those changes are too much. Do you recommend some other values to test?
The corresponding Nix code (for the record) is:
I gave the code a new look and noticed that using BUFFERSIZE=35
was way too huge. Therefore, I've reduced it to BUFFERSIZE=2
and gave another run... but there's still an issue:
I also tested with BUFFERSIZE=3
and got the same result.
please stay closer to the default value of 25 - this is a left shift on a binary number (for "historic" reasons) so 35 probably corresponds to some terabyte value while 3 is far too small to be useful
OK, testing with 27
at the moment, I'll keep you posted.
Same issue with BUFFERSIZE=27
and NUM_THREADS=128
:
so it went from SIGFPE to SIGBUS, still indicative of overflowing/overwritten memory. guess I'll really need to reproduce this with your entire ollama/webui context as I'm running out of ideas for quick checks. does your setup run normally when you swap out openblas for the reference blas/lapack?
does your setup run normally when you swap out openblas for the reference blas/lapack?
I think it's my first time using openblas, I don't really know that library before encountering the current issue. So far, I have absolutely no other issue on my server running NixOS for years now.
guess I'll really need to reproduce this with your entire ollama/webui context
Feel free to email me if you want some assistance to get you started quickly
Thanks!
This appears to be no longer reproducible with the latest Open WebUI, so may not have been an OpenBLAS bug
Indeed, this is extremely curious case. I'm now struggling to reproduce the bug, without luck so far.
I would like to thank you for your availability and kindness with this matter!
Keep up the good work.
Hello,
I'm a Nix package maintainer and I'm using
open-webui
withOllama
on my own server for testing and evaluating LLMs.While using Open-WebUI, I noticed that enabling the
Hybrid search
feature is making Open-WebUI crashing. I must restart it to get it working again.Here's the logs:
By checking the coredump info, I can find:
Here are some information of the computer:
Here's a dump of
lscpu
:I'm using
openblas-0.3.27
, you can find the Nix recipe to build it at https://github.com/NixOS/nixpkgs/blob/master/pkgs/development/libraries/science/math/openblas/default.nixWe are using one patch: https://github.com/NixOS/nixpkgs/blob/e6e4cd92ad886b91a6de120ce61c81b7e6072530/pkgs/development/libraries/science/math/openblas/default.nix#L157
The list of Make flags are at https://github.com/NixOS/nixpkgs/blob/e6e4cd92ad886b91a6de120ce61c81b7e6072530/pkgs/development/libraries/science/math/openblas/default.nix#L200
On Matrix, I've been told that the issue comes from
sgemm
. At this point, I don't know what else I could add to this thread to help you finding the problem... but feel free to suggest some ideas and I'll update this thread.