JuliaLang / LinearAlgebra.jl

Julia Linear Algebra standard library
Other
17 stars 4 forks source link

Segmentation fault in Hermitian `eigen`/`eigvals` on nightly #1086

Closed jishnub closed 3 months ago

jishnub commented 3 months ago
julia> using LinearAlgebra

julia> H = Hermitian(rand(1024, 1024));

julia> eigen(H);
[1]    56499 segmentation fault (core dumped)  julia +nightly

Similarly,

julia> using LinearAlgebra

julia> H = Hermitian(rand(1024, 1024));

julia> eigvals(H);
[1]    56649 segmentation fault (core dumped)  julia +nightly

Version info:

julia> versioninfo()
Julia Version 1.12.0-DEV.1038
Commit 9d222b87d77 (2024-08-12 01:13 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 × Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 12 virtual cores)
Environment:
  JULIA_EDITOR = subl

I suspect this is an OpenBLAS issue, as there is no error when using MKL.

Some testing shows that this error happens from 64x64 matrices, and there is no error for smaller matrices.

Probably https://github.com/JuliaLang/julia/commit/9d222b87d77d8a76f806fb296e33916dde8c9411 is what caused this, as there is no error on the commit before this.

ViralBShah commented 3 months ago

We just bumped openblas yesterday, so it is quite likely an upstream bug.

On my mac I'm not getting a segmentation fault, but it seems to keep running endlessly. We should report this upstream.

ViralBShah commented 3 months ago

Works fine for size 63x63 for me and stops working 64x64 onwards.

jishnub commented 3 months ago

For the record, I am being able to call dsyev in C using OpenBLAS's master branch when compiling with gcc directly. E.g., using the following code:

#include <stdio.h>
#include <stdlib.h>
#include <lapacke.h>

int main() {
    int n = 64;
    double *a;
    a = (double*) malloc(n*n*sizeof(double));

    // Check if the memory has been successfully
    // allocated by malloc or not
    if (a == NULL) {
        printf("Memory not allocated.\n");
        exit(0);
    }

    for(int row=0; row<n; row++){
        for(int col=0; col<=row; col++){
            a[row*n + col] = (double)rand()/(double)(RAND_MAX);
        }
    }

    // Allocate space for eigenvalues
    double *w;
    w = (double*) malloc(n * sizeof(double));

    if (w == NULL) {
        printf("Memory not allocated.\n");
        exit(0);
    }

    // Leading dimension of the matrix
    int lda = n;

    // Compute eigenvalues and eigenvectors
    LAPACKE_dsyev(LAPACK_ROW_MAJOR, 'N', 'U', n, a, lda, w);

    // Print eigenvalues
    printf("Eigenvalues:\n");
    for (int i = 0; i < n; i++) {
        printf("%f\n", w[i]);
    }

    free(a);
    free(w);

    return 0;
}

Compiling with

gcc -o hermeig hermeig.c -Iopenblasinstallation/include -Lopenblasinstallation/lib -lopenblas

This runs correctly and produces results.

jishnub commented 3 months ago

I've noticed that this error occurs in julia when I set BLAS.set_num_threads(4) or higher. There is no segfault for 3 or fewer threads.

rr trace attached: https://julialang-dumps.s3.amazonaws.com/reports/2024-08-13T11-47-58-jishnub.tar.zst

giordano commented 3 months ago

For the record, I am being able to call dsyev in C using OpenBLAS's master branch when compiling with gcc directly.

It'd be useful to make sure you can reproduce the segfault when linking to Julia's libopenblas build: if not, that test may not be useful.

ViralBShah commented 3 months ago

Doesn't segfault when linking to our openblas with 1 or 4 threads. Used slightly modified driver below:

➜  lib git:(master) gcc -o hermeig hermeig.c -Iopenblasinstallation/include -Lopenblasinstallation/lib -L. -lopenblas64_ -I../include -Wno-implicit-function-declaration
ld: warning: search path 'openblasinstallation/lib' not found
ld: warning: reexported library with install name '@rpath/libunwind.1.dylib' found at '/Users/viral/julia/usr/lib/libunwind.1.0.dylib' couldn't be matched with any parent library and will be linked directly
➜  lib git:(master) OPENBLAS_NUM_THREADS=4 DYLD_LIBRARY_PATH=. ./hermeig
Eigenvalues:
0.000008
0.022556
0.036368
0.050084
#include <stdio.h>
#include <stdlib.h>
#include <lapacke.h>

int main() {
    long n = 64;
    double *a;
    a = (double*) malloc(n*n*sizeof(double));

    openblas_set_num_threads64_(4);

    // Check if the memory has been successfully
    // allocated by malloc or not
    if (a == NULL) {
        printf("Memory not allocated.\n");
        exit(0);
    }

    for(long row=0; row<n; row++){
        for(long col=0; col<=row; col++){
            a[row*n + col] = (double)rand()/(double)(RAND_MAX);
        }
    }

    // Allocate space for eigenvalues
    double *w;
    w = (double*) malloc(n * sizeof(double));

    if (w == NULL) {
        printf("Memory not allocated.\n");
        exit(0);
    }

    // Leading dimension of the matrix
    long lda = n;

    // Compute eigenvalues and eigenvectors
    LAPACKE_dsyev64_(LAPACK_ROW_MAJOR, 'N', 'U', n, a, lda, w);

    // Print eigenvalues
    printf("Eigenvalues:\n");
    for (int i = 0; i < n; i++) {
        printf("%f\n", w[i]);
    }

    free(a);
    free(w);

    return 0;
}
giordano commented 3 months ago

Then that doesn't seem to be a good reproducer 🙂

ViralBShah commented 3 months ago

Or there could perhaps be a bug on the Julia side?

The LAPACKE interface seems to hide the lwork discovery.

giordano commented 3 months ago

Or there could perhaps be a bug on the Julia side?

Perhaps, but I'd spend more time to make sure the C reproducer is 100% faithful to what we're doing on the Julia side. Also, if you want to build OpenBLAS locally, to reproduce the Yggdrasil build I'd suggest compiling OpenBLAS with something like

make USE_THREAD=1 GEMM_MULTITHREADING_THRESHOLD=400 NO_AFFINITY=1 NO_STATIC=1 NUM_THREADS=512 INTERFACE64=1 SYMBOLSUFFIX=64_ DYNAMIC_ARCH=1 TARGET=GENERIC all
giordano commented 3 months ago

Also, do we have a backtrace with gdb?

ViralBShah commented 3 months ago

@jishnub The default algorithm seems to call syevd and not syev, based on the docs. But all the 3 algorithms are failing. :-(

jishnub commented 3 months ago

The segfault happens for all the algorithms

ViralBShah commented 3 months ago

svd also fails. The failures are only on Hermitian matrices 64x64 and larger. Other factorizations seem to be working fine on Hermitians.

Of course svd failing may not be additional data, in that it is probably using the same internal LAPACK routines.

ViralBShah commented 3 months ago

@giordano I could try get a gdb trace, but on M1 mac, it just goes into an infinite loop (should I ctrl-C?) - not a segfault. @jishnub Which platform are you seeing the segfault on?

giordano commented 3 months ago

Also, do we have a backtrace with gdb?

julia> eigen(H);

Thread 9 "julia" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd23f9640 (LWP 3380628)]
0x00007fffde86df6c in dgemm_itcopy_ZEN () from /home/mose/.julia/juliaup/julia-nightly/bin/../lib/julia/libopenblas64_.so
(gdb) where
#0  0x00007fffde86df6c in dgemm_itcopy_ZEN () from /home/mose/.julia/juliaup/julia-nightly/bin/../lib/julia/libopenblas64_.so
JuliaLang/julia#1  0x00007fffdda5f70f in dsyr2k_UN () from /home/mose/.julia/juliaup/julia-nightly/bin/../lib/julia/libopenblas64_.so
JuliaLang/julia#2  0x00007fffddb99cd8 in exec_threads () from /home/mose/.julia/juliaup/julia-nightly/bin/../lib/julia/libopenblas64_.so
JuliaLang/julia#3  0x00007fffddb99efc in blas_thread_server () from /home/mose/.julia/juliaup/julia-nightly/bin/../lib/julia/libopenblas64_.so
JuliaLang/julia#4  0x00007ffff7e08ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
JuliaLang/julia#5  0x00007ffff7e9a850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

So, it looks to me the segmentation fault is in OpenBLAS. Certainly the issue could be in what Julia passes to OpenBLAS, but this again begs for a more faithful C reproducer.

jishnub commented 3 months ago

x86_64-linux-gnu

ViralBShah commented 3 months ago

So, if I set BLAS num threads to 1, then no crash on the Hermitian 64x64.

julia> H =Hermitian(rand(64,64))

julia> BLAS.set_num_threads(4)

julia> eigen(H)
^C^C^C^C^CWARNING: Force throwing a SIGINT
^C^C^C^C^CERROR: ^C^C^C^C^C^CInterruptException:

julia> ^C

julia> BLAS.set_num_threads(1)

julia> eigen(H)
Eigen{Float64, Float64, Matrix{Float64}, Vector{Float64}}
values:
64-element Vector{Float64}:
 -4.105758427499317
 -3.971213922931517
ViralBShah commented 3 months ago

I believe a more faithful reproducer using the direct fortran API calling the workspace query, which does not crash:

➜  lib git:(vs/55471) ✗ gcc -g -o hermeig hermeig.c -L. -lopenblas64_ -I../include -Wno-implicit-function-declaration && DYLD_LIBRARY_PATH=. ./hermeig
ld: warning: reexported library with install name '@rpath/libunwind.1.dylib' found at '/Users/viral/julia/usr/lib/libunwind.1.0.dylib' couldn't be matched with any parent library and will be linked directly
First 5 Eigenvalues:
-4.396484
-4.053237
-3.783926
-3.699819
-3.197730
#include <stdio.h>
#include <stdlib.h>

int main() {
    long n = 64;
    double *a;
    a = (double*) malloc(n*n*sizeof(double));

    openblas_set_num_threads64_(4);

    // Check if the memory has been successfully                                                                                                     
    // allocated by malloc or not                                                                                                                    
    if (a == NULL) {
        printf("Memory not allocated.\n");
        exit(0);
    }

    for(long row=0; row<n; row++){
        for(long col=0; col<=row; col++){
            a[row*n + col] = (double)rand()/(double)(RAND_MAX);
        }
    }

    // Allocate space for eigenvalues                                                                                                                
    double *w;
    w = (double*) malloc(n * sizeof(double));

    if (w == NULL) {
        printf("Memory not allocated.\n");
        exit(0);
    }

    // Leading dimension of the matrix                                                                                                               
    long lda = n;

    // Allocate work                                                                                                                                 
    long lwork = -1;
    double *work = (double*) malloc(sizeof(double));
    long info = -100;

    char nchar = 'N';
    char uchar = 'U';

    // Compute eigenvalues and eigenvectors                                                                                                          
    dsyev_64_(&nchar, &uchar, &n, a, &lda, w, work, &lwork, &info);
    lwork = (long)work[0];
    work = (double*)realloc(work, lwork*sizeof(double));
    dsyev_64_(&nchar, &uchar, &n, a, &lda, w, work, &lwork, &info);

    // Print eigenvalues                                                                                                                             
    printf("First 5 Eigenvalues:\n");
    for (int i = 0; i < 5; i++) {
        printf("%f\n", w[i]);
    }

    free(a);
    free(w);

    return 0;
}
giordano commented 3 months ago

@jishnub can you try with https://github.com/giordano/OpenBLAS_jll.jl/releases/download/OpenBLAS-v0.3.28%2B1/OpenBLAS.v0.3.28.x86_64-linux-gnu-libgfortran5.tar.gz?

I get no crashes:

julia> using LinearAlgebra

julia> BLAS.lbt_forward("./libopenblas64_.so"; clear=true)
5037

julia> H = Hermitian(ones(1024, 1024));

julia> eigen(H);

julia>

This was compiled with GCC 12. As I mentioned in https://github.com/OpenMathLib/OpenBLAS/issues/4868#issuecomment-2286464170, the segmentation fault appears to go away depending on the version of GCC used (but I can't tell whether it's a bug in OpenBLAS which is hidden/surfaced by some compiler versions, or a genuine compiler error)

jishnub commented 3 months ago

Oddly, I seem to still face the issue using this:

julia> using LinearAlgebra

julia> BLAS.lbt_forward("./libopenblas64_.so"; clear=true)
5037

julia> H = Hermitian(ones(1024, 1024));

julia> eigen(H);

julia> BLAS.get_num_threads()
1

julia> BLAS.set_num_threads(4)

julia> eigen(H);

[88907] signal 11 (1): Segmentation fault

[88907] signal 11 (1): Segmentation fault
in expression starting at REPL[7]:1
Allocations: 2069146 (Pool: 2069051;Allocations: 2069146 (Pool: 2069051; Big: 95); GC: 4
Allocations: 2069146 (Pool: 2069051; Big: 95); GC: 4
[1]    88907 segmentation fault (core dumped)  julia +nightly
giordano commented 3 months ago

Aaah, right, when forwarding the blas library the number of threads is reset.

giordano commented 3 months ago

which does not crash:

I think we need two cases:

  1. a code which linked to julia's build of openblas crashes in the same way as within julia
  2. the exact same code which doesn't crash when linked to a different build of openblas, but please provide all information of compiler used and build configuration

Any other combination of reproducers which doesn't involve the two above is probably not very useful to narrow down the issue, because it doesn't necessarily show much.

giordano commented 3 months ago

@jishnub can you please test again https://github.com/giordano/OpenBLAS_jll.jl/releases/download/OpenBLAS-v0.3.28%2B1/OpenBLAS.v0.3.28.x86_64-linux-gnu-libgfortran5.tar.gz (same URL as before, but it's a new build, which includes https://github.com/OpenMathLib/OpenBLAS/pull/4871)? It seems to work for me now:

julia> using LinearAlgebra

julia> BLAS.lbt_forward("./libopenblas64_.so"; clear=true)
5037

julia> BLAS.set_num_threads(4)

julia> H = Hermitian(ones(1024, 1024));

julia> eigen(H);

julia>
jishnub commented 3 months ago

This works for me as well, thanks!

giordano commented 3 months ago

A more minimal Julia reproducer is

using LinearAlgebra
LAPACK.syevd!('V', 'U', ones(1024, 1024));

We're entering this function https://github.com/JuliaLang/julia/blob/e1aefebe1e3c62339be4b46043625170ec538137/stdlib/LinearAlgebra/src/lapack.jl#L5432 The lapack function being called is DSYEVD. Can someone please confirm this is a sensible program (NOTE: I updated the code compared the a previous version of this message):

#include <stdio.h>
#include <stdlib.h>

int main() {
    long n = 1024;
    double *a;
    a = (double*) malloc(n*n*sizeof(double));

    openblas_set_num_threads64_(64);

    // Check if the memory has been successfully                                                                                                     
    // allocated by malloc or not                                                                                                                    
    if (a == NULL) {
        printf("Memory not allocated.\n");
        exit(0);
    }

    for(long row=0; row<n; row++){
        for(long col=0; col<=row; col++){
            a[row*n + col] = 1.0;
        }
    }

    // Allocate space for eigenvalues                                                                                                                
    double *w;
    w = (double*) malloc(n * sizeof(double));

    if (w == NULL) {
        printf("Memory not allocated.\n");
        exit(0);
    }

    // Leading dimension of the matrix                                                                                                               
    long lda = n;

    // Allocate work                                                                                                                                 
    double *work = (double*) malloc(sizeof(double));
    long lwork = -1;
    long *iwork = (long*) malloc(sizeof(long));
    long liwork = -1;
    long info = -100;

    char jobz = 'V';
    char uplo = 'U';

    dsyevd_64_(&jobz, &uplo, &n, a, &lda, w, work, &lwork, iwork, &liwork, &info, 1, 1);

    // Print eigenvalues                                                                                                                             
    printf("First 5 Eigenvalues:\n");
    for (int i = 0; i < 5; i++) {
        printf("%f\n", w[i]);
    }

    free(a);
    free(w);

    return 0;
}

But this doesn't crash for me

$ gcc -o test test.c -L${HOME}/.julia/juliaup/julia-nightly/lib/julia -lopenblas64_ -Wl,-rpath,${HOME}/.julia/juliaup/julia-nightly/lib/julia
test.c: In function ‘main’:
test.c:9:5: warning: implicit declaration of function ‘openblas_set_num_threads64_’ [-Wimplicit-function-declaration]
     openblas_set_num_threads64_(64);
     ^~~~~~~~~~~~~~~~~~~~~~~~~~~
test.c:46:5: warning: implicit declaration of function ‘dsyevd_64_’ [-Wimplicit-function-declaration]
     dsyevd_64_(&jobz, &uplo, &n, a, &lda, w, work, &lwork, iwork, &liwork, &info, 1, 1);
     ^~~~~~~~~~
$ ./test
First 5 Eigenvalues:
0.000000
0.000000
0.000000
0.000000
0.000000
ViralBShah commented 3 months ago

You have to call dsyevd twice. First time for the workspace query. Then take that value and allocate the work array and then actually call it again. It should segfault in the second call where the computation will actually happen.

#include <stdio.h>
#include <stdlib.h>

int main() {
    long n = 1024;
    double *a;
    a = (double*) malloc(n*n*sizeof(double));

    openblas_set_num_threads64_(64);

    // Check if the memory has been successfully
    // allocated by malloc or not
    if (a == NULL) {
        printf("Memory not allocated.\n");
        exit(0);
    }

    for(long row=0; row<n; row++){
        for(long col=0; col<=row; col++){
            a[row*n + col] = 1.0;
        }
    }

    // Allocate space for eigenvalues
    double *w;
    w = (double*) malloc(n * sizeof(double));

    if (w == NULL) {
        printf("Memory not allocated.\n");
        exit(0);
    }

    // Leading dimension of the matrix
    long lda = n;

    // workspace query
    double *work = (double*) malloc(sizeof(double));
    long lwork = -1;
    long *iwork = (long*) malloc(sizeof(long));
    long liwork = -1;
    long info = -100;

    char jobz = 'V';
    char uplo = 'U';

    dsyevd_64_(&jobz, &uplo, &n, a, &lda, w, work, &lwork, iwork, &liwork, &info, 1, 1);

    // Workspace allocation
    lwork = work[0];
    work = (double *) malloc(lwork * sizeof(double));
    liwork = iwork[0];
    iwork = (long *) malloc(liwork * sizeof(long));

    dsyevd_64_(&jobz, &uplo, &n, a, &lda, w, work, &lwork, iwork, &liwork, &info, 1, 1);

    // Print eigenvalues
    printf("First 5 Eigenvalues:\n");
    for (int i = 0; i < 5; i++) {
        printf("%f\n", w[i]);
    }

    return 0;
}
giordano commented 3 months ago

Thanks. Sadly, adding

    lwork = (long)work[0];
    work = (double*)realloc(work, lwork*sizeof(double));
    liwork = (long)iwork[0];
    iwork = (long*)realloc(iwork, liwork*sizeof(long));
    dsyevd_64_(&jobz, &uplo, &n, a, &lda, w, work, &lwork, iwork, &liwork, &info, 1L, 1L);

after the first dsyevd_64_ call doesn't seem to change the outcome 😞 However it looks like https://github.com/OpenMathLib/OpenBLAS/pull/4879 works for me on the machines where I was still observing a segmentation fault after the first patch. Hopefully we'll be good after that, but I'm still puzzled by the fact we can't prepare a C reproducer. I also tried to use dlopen/dlsym, to even more closely match what we do in Julia, to no avail.

ViralBShah commented 3 months ago

The bug is real, but it seems that Julia is just better at triggering it than C - whether it is our allocator or LBT or something else.

ViralBShah commented 3 months ago

The C script does crash for me with the second call to dsyevd on mac.

giordano commented 3 months ago

The C script does crash for me with the second call to dsyevd on mac.

Ah, interesting, I can confirm it segfaults for me as well on an M1 with OpenBLAS from Julia nightly, but not with OpenBLAS from Julia v1.10.4:

% clang -Wno-implicit-function-declaration -o test test.c -L${HOME}/repo/julia/usr/lib -Wl,-rpath,${HOME}/repo/julia/usr/lib -lopenblas64_
% ./test
zsh: segmentation fault  ./test
% clang -Wno-implicit-function-declaration -o test test.c -L${HOME}/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/lib/julia -Wl,-rpath,${HOME}/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/lib/julia -lopenblas64_
% ./test
First 5 Eigenvalues:
-0.000000
-0.000000
-0.000000
-0.000000
-0.000000

However it never segfaults for me on an x86_64-linux-gnu machine where I could reproduce the segfault in Julia