OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.28k stars 1.49k forks source link

Something weird with the serial version of OpenBLAS #191

Closed susilehtola closed 11 years ago

susilehtola commented 11 years ago

Hi,

I've been recently trying to benchmark how faster my code, ERKALE (http://erkale.googlecode.com), becomes when it's linked against OpenBLAS instead of ATLAS.

However, I've run into a seriously strange bug. Some calculations run through just fine, but others give weird results with the serial version of the library. However, the same program linked against the OpenMP version of OpenBLAS give the correct results.

My program is mostly using just matrix-vector multiplication (maybe some matrix-matrix as well), dot products, and most importantly eigenvector analysis (for symmetric matrices). I have currently no idea on what the problem really is, but maybe you could look into this?

The system I've been running on is a Sandy Bridge (Intel(R) Core(TM) i7-2600).

xianyi commented 11 years ago

Hi,

Thank you for the feedback.

Do you use OpenBLAS develop branch? Is it 32-bit OS or 64-bit OS? What's your compiler? I think we need the minimal working set to reproduce your error.

Xianyi

susilehtola commented 11 years ago

I use version 0.2.5, and the os is 64-bit Fedora 18. The compiler is gcc (GCC) 4.7.2 20121109 (Red Hat 4.7.2-8)

xianyi commented 11 years ago

Hi @ jussilehtola ,

Could you try OpenBLAS develop branch https://github.com/xianyi/OpenBLAS/archive/develop.zip ?

Xianyi

susilehtola commented 11 years ago

Doesn't work with the development version either.

xianyi commented 11 years ago

We will try to reproduce this bug.

xianyi commented 11 years ago

@zchothia , Could you investigate this issue? I think we need to narrow down the BLAS function.

zchothia commented 11 years ago

Hello @jussilehtola.

Could you please provide some details on what to run, preferably a simple test so we know what results are expected. I tried running examples/01-scf.sh but that didn't work:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Could not find basis set aug-cc-pVTZ!

Sidenote: you may want to update compile.sh. The current link gives a 404: export HDF5VER="1.8.10-patch1"

--Zaheer

susilehtola commented 11 years ago

Hi, you need to set the environment variable ERKALE_LIBRARY to the basis/ directory of the source tree.

You can use the test program that is compiled to src/test/erkale_tests (or erkale_tests_omp if compiled with OpenMP) of the build tree.

The odd thing is that a lot of the tests pass succesfully even with the serial version of the library, but things seem to go always wrong when a certain method is used (density fitting, in density_fitting.cpp). The testset fails at

Water, PBEPBE/cc-pVTZ, E=417.896605 fail, dp=5.584827 fail, orbital energies 0 ok, 58 failed (18.63 s) Relative difference of total energy is -6.471783e+00, difference in dipole moment is 4.852591e+00. Maximum difference of orbital energy is 2.674047e+02.

and before that the same calculation without the use of the routines in density_fitting.cpp goes through succesfully.

susilehtola commented 11 years ago

Well, I stumbled on something that might guide you towards the right direction. I was debugging something in my program but didn't realize that I'd forgotten that the scripts still linked against OpenBLAS. It's seems that the problem is caused by some kind of a stack overflow, which is less likely to happen in parallel mode:

Serial trace: ==11846== Invalid read of size 8 ==11846== at 0x5F53255: dswap_k_SANDYBRIDGE (swapsse2.S:440) ==11846== by 0x4D43A0F: dswap (swap.c:90) ==11846== by 0x62D71F2: dsteqr (in /usr/lib64/libopenblas-r0.2.5.so) ==11846== by 0x62DDAEB: dsyev (in /usr/lib64/libopenblas-r0.2.5.so) ==11846== by 0x498567: bool arma::auxlib::eig_sym<double, arma::Mat >(arma::Col&, arma::Mat&, arma::Base<double, arma::Mat > const&) (lapack_wrapper.hpp:144) ==11846== by 0x495888: eig_sym_ordered(arma::Col&, arma::Mat&, arma::Mat const&) (fn_eig.hpp:121) ==11846== by 0x497DDF: BasOrth(arma::Mat const&, bool) (linalg.cpp:153) ==11846== by 0x45B111: RHF(std::vector<bf_t, std::allocator > const&, int, rscf_t&, convergence_t, bool, bool) (solvers.cpp:302) ==11846== by 0x429070: main (main.cpp:178) ==11846== Address 0x6b497f8 is 0 bytes after a block of size 9,800 alloc'd ==11846== at 0x4A07A2F: operator new[](unsigned long, std::nothrow_t const&) (vg_replace_malloc.c:394) ==11846== by 0x460E7A: arma::Mat::operator=(arma::Mat const&) (memory.hpp:63) ==11846== by 0x4984A9: bool arma::auxlib::eig_sym<double, arma::Mat >(arma::Col&, arma::Mat&, arma::Base<double, arma::Mat > const&) (auxlib_meat.hpp:1255) ==11846== by 0x495888: eig_sym_ordered(arma::Col&, arma::Mat&, arma::Mat const&) (fn_eig.hpp:121) ==11846== by 0x497DDF: BasOrth(arma::Mat const&, bool) (linalg.cpp:153) ==11846== by 0x45B111: RHF(std::vector<bf_t, std::allocator > const&, int, rscf_t&, convergence_t, bool, bool) (solvers.cpp:302) ==11846== by 0x429070: main (main.cpp:178) ==11846==

Serial trace, part II ==11846== Invalid read of size 8 ==11846== at 0x5F53221: dswap_k_SANDYBRIDGE (swapsse2.S:424) ==11846== by 0x4D43A0F: dswap (swap.c:90) ==11846== by 0x62D71F2: dsteqr (in /usr/lib64/libopenblas-r0.2.5.so) ==11846== by 0x62DDAEB: dsyev (in /usr/lib64/libopenblas-r0.2.5.so) ==11846== by 0x498567: bool arma::auxlib::eig_sym<double, arma::Mat >(arma::Col&, arma::Mat&, arma::Base<double, arma::Mat > const&) (lapack_wrapper.hpp:144) ==11846== by 0x495888: eig_sym_ordered(arma::Col&, arma::Mat&, arma::Mat const&) (fn_eig.hpp:121) ==11846== by 0x4A63BE: DIIS::solve(arma::Mat&, bool) (diis.cpp:146) ==11846== by 0x45BE25: RHF(std::vector<bf_t, std::allocator > const&, int, rscf_t&, convergence_t, bool, bool) (solvers.cpp:383) ==11846== by 0x429070: main (main.cpp:178) ==11846== Address 0x6d89018 is 0 bytes after a block of size 200 alloc'd ==11846== at 0x4A07A2F: operator new[](unsigned long, std::nothrow_t const&) (vg_replace_malloc.c:394) ==11846== by 0x460E7A: arma::Mat::operator=(arma::Mat const&) (memory.hpp:63) ==11846== by 0x4984A9: bool arma::auxlib::eig_sym<double, arma::Mat >(arma::Col&, arma::Mat&, arma::Base<double, arma::Mat > const&) (auxlib_meat.hpp:1255) ==11846== by 0x495888: eig_sym_ordered(arma::Col&, arma::Mat&, arma::Mat const&) (fn_eig.hpp:121) ==11846== by 0x4A63BE: DIIS::solve(arma::Mat&, bool) (diis.cpp:146) ==11846== by 0x45BE25: RHF(std::vector<bf_t, std::allocator > const&, int, rscf_t&, convergence_t, bool, bool) (solvers.cpp:383) ==11846== by 0x429070: main (main.cpp:178)

Parallel trace: ==11848== Thread 8: ==11848== Invalid read of size 8 ==11848== at 0x6113255: dswap_k_SANDYBRIDGE (swap_sse2.S:440) ==11848== by 0x50817AD: ??? (blas_server_omp.c:116) ==11848== by 0x31FB608829: gomp_thread_start (team.c:116) ==11848== by 0x31E4A07D14: start_thread (pthread_create.c:308) ==11848== by 0x31E42F246C: clone (clone.S:114) ==11848== Address 0x6d3a808 is 0 bytes after a block of size 9,800 alloc'd ==11848== at 0x4A07A2F: operator new[](unsigned long, std::nothrow_t const&) (vg_replace_malloc.c:394) ==11848== by 0x461DDA: arma::Mat::operator=(arma::Mat const&) (memory.hpp:63) ==11848== by 0x499089: bool arma::auxlib::eig_sym<double, arma::Mat >(arma::Col&, arma::Mat&, arma::Base<double, arma::Mat > const&) (auxlib_meat.hpp:1255) ==11848== by 0x496468: eig_sym_ordered(arma::Col&, arma::Mat&, arma::Mat const&) (fn_eig.hpp:121) ==11848== by 0x4989BF: BasOrth(arma::Mat const&, bool) (linalg.cpp:153) ==11848== by 0x45C071: RHF(std::vector<bf_t, std::allocator > const&, int, rscf_t&, convergence_t, bool, bool) (solvers.cpp:302) ==11848== by 0x42A050: main (main.cpp:178) ==11848==

When I link against ATLAS these kinds of errors do not appear.

susilehtola commented 11 years ago

So, any progress? This memory leak might well be the root cause of the problem.

xianyi commented 11 years ago

Hi @jussilehtola ,

I think this happened in dswap function. ==11846== by 0x4D43A0F: dswap_ (swap.c:90)

Could you provide the argument for dswap, including N, the address of x and y, incx and incy?

Xianyi

susilehtola commented 11 years ago

Well, here's a couple of entries.

Calling swap with arguments (18, 0, 0, 0.000000, 0x4d853f0, 1, 0x4d85480, 1, (nil), 0) Calling swap with arguments (18, 0, 0, 0.000000, 0x4d85510, 1, 0x4d855a0, 1, (nil), 0) ==7456== Invalid read of size 8 ==7456== at 0x645A21: dswap_k (swapsse2.S:424) ==7456== by 0x73F8FC: dswap (swap.c:91) ==7456== by 0x73C512: dsteqr (dsteqr.f:562) ==7456== by 0x738DAB: dsyev (dsyev.f:264) ==7456== by 0x4869B7: bool arma::auxlib::eig_sym<double, arma::Mat >(arma::Col&, arma::Mat&, arma::Base<double, arma::Mat > const&) (lapack_wrapper.hpp:144) ==7456== by 0x483CD8: eig_sym_ordered(arma::Col&, arma::Mat&, arma::Mat const&) (fn_eig.hpp:121) ==7456== by 0x5507DE: DIIS::solve(arma::Mat&, bool) (diis.cpp:146) ==7456== by 0x4B270F: SCF::ROHF(uscf_t&, int, int, convergence_t) const (scf-solvers.cpp.in:438) ==7456== by 0x4F0740: atomic_guess(BasisSet const&, arma::Mat&, arma::Mat&, bool) (guess.cpp:137) ==7456== by 0x49D993: calculate(BasisSet const&, Settings&) (scf-base.cpp:926) ==7456== by 0x4327C4: main (main.cpp:100) ==7456== Address 0x576ead8 is 0 bytes after a block of size 200 alloc'd ==7456== at 0x4A07A2F: operator new[](unsigned long, std::nothrow_t const&) (vg_replace_malloc.c:394) ==7456== by 0x45FE36: arma::Mat::init_warm(unsigned int, unsigned int) (memory.hpp:63) ==7456== by 0x4605EA: arma::Mat::operator=(arma::Mat const&) (Mat_meat.hpp:627) ==7456== by 0x4868F9: bool arma::auxlib::eig_sym<double, arma::Mat >(arma::Col&, arma::Mat&, arma::Base<double, arma::Mat > const&) (auxlib_meat.hpp:1255) ==7456== by 0x483CD8: eig_sym_ordered(arma::Col&, arma::Mat&, arma::Mat const&) (fn_eig.hpp:121) ==7456== by 0x5507DE: DIIS::solve(arma::Mat&, bool) (diis.cpp:146) ==7456== by 0x4B270F: SCF::ROHF(uscf_t&, int, int, convergence_t) const (scf-solvers.cpp.in:438) ==7456== by 0x4F0740: atomic_guess(BasisSet const&, arma::Mat&, arma::Mat&, bool) (guess.cpp:137) ==7456== by 0x49D993: calculate(BasisSet const&, Settings&) (scf-base.cpp:926) ==7456== by 0x4327C4: main (main.cpp:100) ==7456== Calling swap with arguments (5, 0, 0, 0.000000, 0x576ea10, 1, 0x576ea38, 1, (nil), 0) Calling swap with arguments (5, 0, 0, 0.000000, 0x576ea88, 1, 0x576eab0, 1, (nil), 0)

Calling swap with arguments (41, 0, 0, 0.000000, 0x6729518, 1, 0x672a0a0, 1, (nil), 0) Calling swap with arguments (41, 0, 0, 0.000000, 0x6729660, 1, 0x672a478, 1, (nil), 0) Calling swap with arguments (41, 0, 0, 0.000000, 0x67297a8, 1, 0x6729e10, 1, (nil), 0) Calling==7456== Invalid read of size 8 ==7456== at 0x6459D1: dswap_k (swapsse2.S:400) ==7456== by 0x73F8FC: dswap (swap.c:91) ==7456== by 0x73C512: dsteqr (dsteqr.f:562) ==7456== by 0x738DAB: dsyev (dsyev.f:264) ==7456== by 0x4869B7: bool arma::auxlib::eig_sym<double, arma::Mat >(arma::Col&, arma::Mat&, arma::Base<double, arma::Mat > const&) (lapack_wrapper.hpp:144) ==7456== by 0x483CD8: eig_sym_ordered(arma::Col&, arma::Mat&, arma::Mat const&) (fn_eig.hpp:121) ==7456== by 0x49A819: form_NOs(arma::Mat const&, arma::Mat const&, arma::Mat&, arma::Mat&, arma::Col&) (scf-base.cpp:354) ==7456== by 0x4F1280: atomic_guess(BasisSet const&, arma::Mat&, arma::Mat&, bool) (guess.cpp:189) ==7456== by 0x49D993: calculate(BasisSet const&, Settings&) (scf-base.cpp:926) ==7456== by 0x4327C4: main (main.cpp:100) ==7456== Address 0x672aeb8 is 0 bytes after a block of size 13,448 alloc'd ==7456== at 0x4A07A2F: operator new[](unsigned long, std::nothrow_t const&) (vg_replace_malloc.c:394) ==7456== by 0x45FE36: arma::Mat::init_warm(unsigned int, unsigned int) (memory.hpp:63) ==7456== by 0x4605EA: arma::Mat::operator=(arma::Mat const&) (Mat_meat.hpp:627) ==7456== by 0x4868F9: bool arma::auxlib::eig_sym<double, arma::Mat >(arma::Col&, arma::Mat&, arma::Base<double, arma::Mat > const&) (auxlib_meat.hpp:1255) ==7456== by 0x483CD8: eig_sym_ordered(arma::Col&, arma::Mat&, arma::Mat const&) (fn_eig.hpp:121) ==7456== by 0x49A819: form_NOs(arma::Mat const&, arma::Mat const&, arma::Mat&, arma::Mat&, arma::Col&) (scf-base.cpp:354) ==7456== by 0x4F1280: atomic_guess(BasisSet const&, arma::Mat&, arma::Mat&, bool) (guess.cpp:189) ==7456== by 0x49D993: calculate(BasisSet const&, Settings&) (scf-base.cpp:926) ==7456== by 0x4327C4: main (main.cpp:100) ==7456== done (9.58 s)

Initializing density fitting calculation, requiring 542 ki memory ... ==7456== Invalid read of size 8 ==7456== at 0x6459D1: dswap_k (swapsse2.S:400) ==7456== by 0x73F8FC: dswap (swap.c:91) ==7456== by 0x73356D: dgetri_ (dgetri.f:253) ==7456== by 0x46C54F: bool arma::auxlib::inv_inplace_lapack(arma::Mat&) (lapack_wrapper.hpp:76) ==7456== by 0x51FFE5: DensityFit::fill(BasisSet const&, BasisSet const&, bool, double, bool) (auxlib_meat.hpp:47) ==7456== by 0x4963D6: SCF::SCF(BasisSet const&, Settings const&, Checkpoint&) (scf-base.cpp:226) ==7456== by 0x49BE74: calculate(BasisSet const&, Settings&) (scf-base.cpp:958) ==7456== by 0x4327C4: main (main.cpp:100) ==7456== Address 0x65cd7f8 is 0 bytes after a block of size 273,800 alloc'd ==7456== at 0x4A07A2F: operator new[](unsigned long, std::nothrow_t const&) (vg_replace_malloc.c:394) ==7456== by 0x520064: DensityFit::fill(BasisSet const&, BasisSet const&, bool, double, bool) (memory.hpp:63) ==7456== by 0x4963D6: SCF::SCF(BasisSet const&, Settings const&, Checkpoint&) (scf-base.cpp:226) ==7456== by 0x49BE74: calculate(BasisSet const&, Settings&) (scf-base.cpp:958) ==7456== by 0x4327C4: main (main.cpp:100) ==7456== Calling swap with arguments (185, 0, 0, 0.000000, 0x65caf80, 1, 0x65ccc68, 1, (nil), 0) Calling swap with arguments (185, 0, 0, 0.000000, 0x65ca9b8, 1, 0x65cd230, 1, (nil), 0) Calling swap with arguments (185, 0, 0, 0.000000, 0x65c9e28, 1, 0x65cd230, 1, (nil), 0) Calling swap with arguments (185, 0, 0, 0.000000, 0x65c9860, 1, 0x65c9e28, 1, (nil), 0) Calling swap with arguments (185, 0, 0, 0.000000, 0x65c9298, 1, 0x65ccc68, 1, (nil), 0)

xianyi commented 11 years ago

Hi @jussilehtola

I reproduce this error in build/erkale/serial/src/test/ as following.

Water, PBEPBE/cc-pVTZ, E=417.896605 fail, dp=5.584827 fail, orbital energies 0 ok, 58 failed (3 min 11.\ 10 s) Relative difference of total energy is -6.471783e+00, difference in dipole moment is 4.852591e+00. Maximum difference of orbital energy is 2.674047e+02.

I modified the test code to only run this test. How could I only rebuild test?

Could you give me the minimal working set and codes with this error "==7456== at 0x645A21: dswap_k (swap_sse2.S:424)" ? Thus, I can debug the library.

Xianyi

susilehtola commented 11 years ago

Umm, to rebuild just the test you can just run make in the src/test subdirectory of the binary tree.

I'm not sure what you mean by the minimal working set and codes.

opoplawski commented 11 years ago

We appear to be holding off releasing openblas into Fedora and EPEL until this is fixed, would be nice to see it addressed. @xianyi - do you still need anything?

xianyi commented 11 years ago

@opoplawski ,

I will meet a project deadline next week. Then, I will address this issue.

Xianyi

susilehtola commented 11 years ago

Any progress?

xianyi commented 11 years ago

Sorry, I didn't start to debug it. I got a fever.

2013/4/16 Susi Lehtola notifications@github.com

Any progress?

— Reply to this email directly or view it on GitHubhttps://github.com/xianyi/OpenBLAS/issues/191#issuecomment-16415558 .

susilehtola commented 11 years ago

Sorry to hear that. Get well soon!

susilehtola commented 11 years ago

Any progress..?

xianyi commented 11 years ago

I have replaced OpenBLAS functions with netlib BLAS to narrow down the error.

susilehtola commented 11 years ago

And? :)

xianyi commented 11 years ago

@susilehtola , OpenBLAS/GotoBLAS implements some LAPACK functions, including LU, cholesky factorization. I found this is a bug in OpenBLAS LAPACK implementation. When I replace those functions with netlib reference implementation, the error is gone.

However, I didn't have enough time to debug this error. I will meet a project deadline at the beginning of July.

Xianyi

susilehtola commented 11 years ago

Thanks for the info. I've removed the special LAPACK functions from the Fedora package for the time being, so that I can finally submit into the stable distribution.

susilehtola commented 11 years ago

Cool, fixed in 0.2.7.