TRIQS / nda

C++ library for multi-dimensional arrays
https://triqs.github.io/nda
Other
13 stars 11 forks source link

MKL ABI problems #67

Closed hmenke closed 3 weeks ago

hmenke commented 2 months ago

Prerequisites

Description

When I configure nda with -DBLA_VENDOR=Intel10_64lp_seq to use a particular version of MKL, I get a segfault sometimes even with stack smashing inside the MKL, but when I configure with -DBLA_VENDOR=Intel10_64_dyn (which is the default) it works fine.

[==========] Running 13 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 13 tests from lapack
[ RUN      ] lapack.gtsv
[       OK ] lapack.gtsv (198 ms)
[ RUN      ] lapack.zgtsv
[       OK ] lapack.zgtsv (0 ms)
[ RUN      ] lapack.cgtsv
[       OK ] lapack.cgtsv (0 ms)
[ RUN      ] lapack.gesvd
[       OK ] lapack.gesvd (0 ms)
[ RUN      ] lapack.zgesvd
[       OK ] lapack.zgesvd (0 ms)
[ RUN      ] lapack.geqp3_tall
[       OK ] lapack.geqp3_tall (0 ms)
[ RUN      ] lapack.zgeqp3_tall
[       OK ] lapack.zgeqp3_tall (0 ms)
[ RUN      ] lapack.geqp3_wide
[       OK ] lapack.geqp3_wide (0 ms)
[ RUN      ] lapack.zgeqp3_wide
[       OK ] lapack.zgeqp3_wide (0 ms)
[ RUN      ] lapack.gelss
[       OK ] lapack.gelss (0 ms)
[ RUN      ] lapack.zgelss

Program received signal SIGSEGV, Segmentation fault.
0x00007fffecf1f1cb in mkl_blas_avx512_xzdotc () from /mpcdf/soft/SLE_15/packages/x86_64/intel_oneapi/2024.0/mkl/latest/lib/libmkl_avx512.so.2
Missing separate debuginfos, use: zypper install libz1-debuginfo-1.2.11-150000.3.48.1.x86_64
(gdb) bt
#0  0x00007fffecf1f1cb in mkl_blas_avx512_xzdotc () from /mpcdf/soft/SLE_15/packages/x86_64/intel_oneapi/2024.0/mkl/latest/lib/libmkl_avx512.so.2
#1  0x00007ffff688049c in zdotc_ () from /mpcdf/soft/SLE_15/packages/x86_64/intel_oneapi/2024.0/mkl/latest/lib/libmkl_intel_lp64.so.2
#2  0x00000000004579e0 in nda::blas::f77::dotc (M=0, x=0x502ab0, incx=1, Y=0x502ab0, incy=1)
    at /home/abuild/nda_src/c++/nda/blas/interface/cxx_interface.cpp:81
#3  0x0000000000442ad4 in nda::blas::dotc<nda::basic_array<std::complex<double>, 1, nda::C_layout, (char)86, nda::heap_basic<nda::mem::mallocator<(nda::mem::AddressSpace)1> > >, nda::basic_array<std::complex<double>, 1, nda::C_layout, (char)86, nda::heap_basic<nda::mem::mallocator<(nda::mem::AddressSpace)1> > > > (x=
..., y=...) at /home/abuild/nda_src/c++/nda/lapack/../blas/dot.hpp:74
#4  0x0000000000443619 in nda::norm<nda::basic_array<std::complex<double>, 1, nda::C_layout, (char)86, nda::heap_basic<nda::mem::mallocator<(nda::mem::AddressSpace)1> > > > (x=..., p=2) at /home/abuild/nda_src/c++/nda/lapack/../linalg/norm.hpp:46
#5  0x00000000004347f0 in nda::lapack::gelss_worker<std::complex<double> >::operator() (this=0x7fffffffc920, B=...)
    at /home/abuild/nda_src/c++/nda/lapack/gelss_worker.hpp:93
#6  0x000000000041e4dd in test_gelss<std::complex<double> > () at /home/abuild/nda_src/test/c++/nda_lapack.cpp:200
#7  0x000000000040a197 in lapack_zgelss_Test::TestBody (this=0x502850) at /home/abuild/nda_src/test/c++/nda_lapack.cpp:213
#8  0x0000000000495393 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (object=0x502850,
    method=&virtual testing::Test::TestBody(), location=0x4ad883 "the test body") at /home/abuild/nda_src/_build/deps/GTest_src/googletest/src/gtest.cc:2635
#9  0x000000000048df9d in testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=0x502850,
    method=&virtual testing::Test::TestBody(), location=0x4ad883 "the test body") at /home/abuild/nda_src/_build/deps/GTest_src/googletest/src/gtest.cc:2671
#10 0x000000000046abb4 in testing::Test::Run (this=0x502850) at /home/abuild/nda_src/_build/deps/GTest_src/googletest/src/gtest.cc:2710
#11 0x000000000046b509 in testing::TestInfo::Run (this=0x5019d0) at /home/abuild/nda_src/_build/deps/GTest_src/googletest/src/gtest.cc:2856
#12 0x000000000046bda1 in testing::TestSuite::Run (this=0x500890) at /home/abuild/nda_src/_build/deps/GTest_src/googletest/src/gtest.cc:3034
#13 0x000000000047b372 in testing::internal::UnitTestImpl::RunAllTests (this=0x5004c0)
    at /home/abuild/nda_src/_build/deps/GTest_src/googletest/src/gtest.cc:5964
#14 0x0000000000496255 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x5004c0,
    method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x47afba <testing::internal::UnitTestImpl::RunAllTests()>,
    location=0x4ae3c8 "auxiliary test code (environments or event listeners)") at /home/abuild/nda_src/_build/deps/GTest_src/googletest/src/gtest.cc:2635
#15 0x000000000048f013 in testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x5004c0,
    method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x47afba <testing::internal::UnitTestImpl::RunAllTests()>,
    location=0x4ae3c8 "auxiliary test code (environments or event listeners)") at /home/abuild/nda_src/_build/deps/GTest_src/googletest/src/gtest.cc:2671
#16 0x0000000000479c55 in testing::UnitTest::Run (this=0x4edd60 <testing::UnitTest::GetInstance()::instance>)
    at /home/abuild/nda_src/_build/deps/GTest_src/googletest/src/gtest.cc:5543
#17 0x0000000000459958 in RUN_ALL_TESTS () at /home/abuild/nda_src/_build/deps/GTest_src/googletest/include/gtest/gtest.h:2334
#18 0x0000000000459944 in main (argc=1, argv=0x7fffffffd788) at /home/abuild/nda_src/_build/deps/GTest_src/googletest/src/gtest_main.cc:64

Steps to Reproduce

  1. cmake -S . -B _build -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=install -DBLA_VENDOR=Intel10_64lp_seq
  2. cmake --build _build/
  3. ./_build/test/c++/nda_lapack

Expected behavior: No segfault

Actual behavior: Segfault

Versions

https://github.com/TRIQS/nda/commit/a3a3119bae0b684de35a23439c4fae52c2306437

MKL 2024.0

Formatting

Please use markdown in your issue message. A useful summary of commands can be found here.

Additional Information

Any additional information, configuration or data that might be necessary to reproduce the issue.

hmenke commented 2 months ago

CMake defaults to (and forces) Intel ABI when there is no GNU Fortran compiler loaded ^1. The difference between Intel and GNU ABI is that in Intel ABI complex*16 values are returned as an implicit first argument instead of by value, i.e.

// GNU
double complex zdotc_(const int *N, const void *ZX, const int *INCX, const void *ZY, const int *INCY);
// Intel
void zdotc_(const void *res, const int *N, const void *ZX, const int *INCX, const void *ZY, const int *INCY);

Now one might be tempted to say “Intel ABI is not supported, you have to use Intel10_64_dyn and we just select GNU ABI at runtime ^2”. However, this does not work when the Python distribution is Anaconda, because Anaconda builds NumPy against MKL with Intel ABI, so if there is import numpy in a TRIQS Python script, this will load MKL with Intel ABI and the zdotc_ symbol will be resolved in this library, irrespective of whether there is MKL linked with another ABI somewhere, still resulting in segfault. The fix will definitely be non-trivial.

hmenke commented 2 months ago

Intel actually has some documentation on this:

https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-linux/2024-1/call-blas-funcs-return-complex-values-in-c-code.html

Their recommendation is “use the CBLAS interface”.

Wentzell commented 3 weeks ago

Fixed by #69