Closed susilehtola closed 11 years ago
Hi,
Thank you for the feedback.
Do you use OpenBLAS develop branch? Is it 32-bit OS or 64-bit OS? What's your compiler? I think we need the minimal working set to reproduce your error.
Xianyi
I use version 0.2.5, and the os is 64-bit Fedora 18. The compiler is gcc (GCC) 4.7.2 20121109 (Red Hat 4.7.2-8)
Hi @ jussilehtola ,
Could you try OpenBLAS develop branch https://github.com/xianyi/OpenBLAS/archive/develop.zip ?
Xianyi
Doesn't work with the development version either.
We will try to reproduce this bug.
@zchothia , Could you investigate this issue? I think we need to narrow down the BLAS function.
Hello @jussilehtola.
Could you please provide some details on what to run, preferably a simple test so we know what results are expected. I tried running examples/01-scf.sh but that didn't work:
terminate called after throwing an instance of 'std::runtime_error'
what(): Could not find basis set aug-cc-pVTZ!
Sidenote: you may want to update compile.sh. The current link gives a 404:
export HDF5VER="1.8.10-patch1"
--Zaheer
Hi, you need to set the environment variable ERKALE_LIBRARY to the basis/ directory of the source tree.
You can use the test program that is compiled to src/test/erkale_tests (or erkale_tests_omp if compiled with OpenMP) of the build tree.
The odd thing is that a lot of the tests pass succesfully even with the serial version of the library, but things seem to go always wrong when a certain method is used (density fitting, in density_fitting.cpp). The testset fails at
Water, PBEPBE/cc-pVTZ, E=417.896605 fail, dp=5.584827 fail, orbital energies 0 ok, 58 failed (18.63 s) Relative difference of total energy is -6.471783e+00, difference in dipole moment is 4.852591e+00. Maximum difference of orbital energy is 2.674047e+02.
and before that the same calculation without the use of the routines in density_fitting.cpp goes through succesfully.
Well, I stumbled on something that might guide you towards the right direction. I was debugging something in my program but didn't realize that I'd forgotten that the scripts still linked against OpenBLAS. It's seems that the problem is caused by some kind of a stack overflow, which is less likely to happen in parallel mode:
Serial trace:
==11846== Invalid read of size 8
==11846== at 0x5F53255: dswap_k_SANDYBRIDGE (swapsse2.S:440)
==11846== by 0x4D43A0F: dswap (swap.c:90)
==11846== by 0x62D71F2: dsteqr (in /usr/lib64/libopenblas-r0.2.5.so)
==11846== by 0x62DDAEB: dsyev (in /usr/lib64/libopenblas-r0.2.5.so)
==11846== by 0x498567: bool arma::auxlib::eig_sym<double, arma::Mat
Serial trace, part II
==11846== Invalid read of size 8
==11846== at 0x5F53221: dswap_k_SANDYBRIDGE (swapsse2.S:424)
==11846== by 0x4D43A0F: dswap (swap.c:90)
==11846== by 0x62D71F2: dsteqr (in /usr/lib64/libopenblas-r0.2.5.so)
==11846== by 0x62DDAEB: dsyev (in /usr/lib64/libopenblas-r0.2.5.so)
==11846== by 0x498567: bool arma::auxlib::eig_sym<double, arma::Mat
Parallel trace:
==11848== Thread 8:
==11848== Invalid read of size 8
==11848== at 0x6113255: dswap_k_SANDYBRIDGE (swap_sse2.S:440)
==11848== by 0x50817AD: ??? (blas_server_omp.c:116)
==11848== by 0x31FB608829: gomp_thread_start (team.c:116)
==11848== by 0x31E4A07D14: start_thread (pthread_create.c:308)
==11848== by 0x31E42F246C: clone (clone.S:114)
==11848== Address 0x6d3a808 is 0 bytes after a block of size 9,800 alloc'd
==11848== at 0x4A07A2F: operator new[](unsigned long, std::nothrow_t const&) (vg_replace_malloc.c:394)
==11848== by 0x461DDA: arma::Mat
When I link against ATLAS these kinds of errors do not appear.
So, any progress? This memory leak might well be the root cause of the problem.
Hi @jussilehtola ,
I think this happened in dswap function. ==11846== by 0x4D43A0F: dswap_ (swap.c:90)
Could you provide the argument for dswap, including N, the address of x and y, incx and incy?
Xianyi
Well, here's a couple of entries.
Calling swap with arguments (18, 0, 0, 0.000000, 0x4d853f0, 1, 0x4d85480, 1, (nil), 0)
Calling swap with arguments (18, 0, 0, 0.000000, 0x4d85510, 1, 0x4d855a0, 1, (nil), 0)
==7456== Invalid read of size 8
==7456== at 0x645A21: dswap_k (swapsse2.S:424)
==7456== by 0x73F8FC: dswap (swap.c:91)
==7456== by 0x73C512: dsteqr (dsteqr.f:562)
==7456== by 0x738DAB: dsyev (dsyev.f:264)
==7456== by 0x4869B7: bool arma::auxlib::eig_sym<double, arma::Mat
Calling swap with arguments (41, 0, 0, 0.000000, 0x6729518, 1, 0x672a0a0, 1, (nil), 0)
Calling swap with arguments (41, 0, 0, 0.000000, 0x6729660, 1, 0x672a478, 1, (nil), 0)
Calling swap with arguments (41, 0, 0, 0.000000, 0x67297a8, 1, 0x6729e10, 1, (nil), 0)
Calling==7456== Invalid read of size 8
==7456== at 0x6459D1: dswap_k (swapsse2.S:400)
==7456== by 0x73F8FC: dswap (swap.c:91)
==7456== by 0x73C512: dsteqr (dsteqr.f:562)
==7456== by 0x738DAB: dsyev (dsyev.f:264)
==7456== by 0x4869B7: bool arma::auxlib::eig_sym<double, arma::Mat
Initializing density fitting calculation, requiring 542 ki memory ... ==7456== Invalid read of size 8
==7456== at 0x6459D1: dswap_k (swapsse2.S:400)
==7456== by 0x73F8FC: dswap (swap.c:91)
==7456== by 0x73356D: dgetri_ (dgetri.f:253)
==7456== by 0x46C54F: bool arma::auxlib::inv_inplace_lapack
Hi @jussilehtola
I reproduce this error in build/erkale/serial/src/test/ as following.
Water, PBEPBE/cc-pVTZ, E=417.896605 fail, dp=5.584827 fail, orbital energies 0 ok, 58 failed (3 min 11.\ 10 s) Relative difference of total energy is -6.471783e+00, difference in dipole moment is 4.852591e+00. Maximum difference of orbital energy is 2.674047e+02.
I modified the test code to only run this test. How could I only rebuild test?
Could you give me the minimal working set and codes with this error "==7456== at 0x645A21: dswap_k (swap_sse2.S:424)" ? Thus, I can debug the library.
Xianyi
Umm, to rebuild just the test you can just run make in the src/test subdirectory of the binary tree.
I'm not sure what you mean by the minimal working set and codes.
We appear to be holding off releasing openblas into Fedora and EPEL until this is fixed, would be nice to see it addressed. @xianyi - do you still need anything?
@opoplawski ,
I will meet a project deadline next week. Then, I will address this issue.
Xianyi
Any progress?
Sorry, I didn't start to debug it. I got a fever.
2013/4/16 Susi Lehtola notifications@github.com
Any progress?
— Reply to this email directly or view it on GitHubhttps://github.com/xianyi/OpenBLAS/issues/191#issuecomment-16415558 .
Sorry to hear that. Get well soon!
Any progress..?
I have replaced OpenBLAS functions with netlib BLAS to narrow down the error.
And? :)
@susilehtola , OpenBLAS/GotoBLAS implements some LAPACK functions, including LU, cholesky factorization. I found this is a bug in OpenBLAS LAPACK implementation. When I replace those functions with netlib reference implementation, the error is gone.
However, I didn't have enough time to debug this error. I will meet a project deadline at the beginning of July.
Xianyi
Thanks for the info. I've removed the special LAPACK functions from the Fedora package for the time being, so that I can finally submit into the stable distribution.
Cool, fixed in 0.2.7.
Hi,
I've been recently trying to benchmark how faster my code, ERKALE (http://erkale.googlecode.com), becomes when it's linked against OpenBLAS instead of ATLAS.
However, I've run into a seriously strange bug. Some calculations run through just fine, but others give weird results with the serial version of the library. However, the same program linked against the OpenMP version of OpenBLAS give the correct results.
My program is mostly using just matrix-vector multiplication (maybe some matrix-matrix as well), dot products, and most importantly eigenvector analysis (for symmetric matrices). I have currently no idea on what the problem really is, but maybe you could look into this?
The system I've been running on is a Sandy Bridge (Intel(R) Core(TM) i7-2600).