evangelistalab / forte

http://www.evangelistalab.org
GNU Lesser General Public License v3.0
50 stars 28 forks source link

DMRG tests not passing in GHA but passing locally #400

Closed brianz98 closed 1 month ago

brianz98 commented 1 month ago

DMRG tests are now temporarily disabled in GHA because they're not consistently passing. Forte is still compiled with Block2 in GHA.

See for example https://github.com/evangelistalab/forte/actions/runs/10255130288/job/28371416030, which gives the following stack trace

/home/runner/block2-bin/lib/libblock2.so(+0xca58c3) [0x7fc1fa0a58c3]
/home/runner/block2-bin/lib/libblock2.so(+0xca5cbc) [0x7fc1fa0a5cbc]
/home/runner/block2-bin/lib/libblock2.so(_ZN6block212SparseMatrixINS_11SU2LongLongEdE18right_canonicalizeERKSt10shared_ptrIS2_E+0x2ab) [0x7fc1f9fc313b]
/home/runner/block2-bin/lib/libblock2.so(_ZN6block23MPSINS_11SU2LongLongEdE26random_canonicalize_tensorEi+0x568) [0x7fc1f9b22158]
/home/runner/block2-bin/lib/libblock2.so(_ZN6block23MPSINS_11SU2LongLongEdE19random_canonicalizeEv+0x25) [0x7fc1f9b221e5]
/home/runner/block2-bin/lib/libblock2.so(_ZNK6block210DMRGDriverINS_11SU2LongLongEdE14get_random_mpsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEjiiS1_iRKSt6vectorIdSaIdEEbS1_+0x52d) [0x7fc1f9775bad]
/home/runner/work/forte/forte/forte/_forte.cpython-311-x86_64-linux-gnu.so(+0x31a9bf) [0x7fc1fa71a9bf]
/home/runner/work/forte/forte/forte/_forte.cpython-311-x86_64-linux-gnu.so(+0x263609) [0x7fc1fa663609]
/home/runner/work/forte/forte/forte/_forte.cpython-311-x86_64-linux-gnu.so(+0x20ccbf) [0x7fc1fa60ccbf]
/home/runner/work/forte/forte/forte/_forte.cpython-311-x86_64-linux-gnu.so(+0x1b0ad9) [0x7fc1fa5b0ad9]
/usr/share/miniconda/envs/forte/bin/python3.11(+0x201b06) [0x55d701698b06]
/usr/share/miniconda/envs/forte/bin/python3.11(_PyObject_MakeTpCall+0x253) [0x55d7016778b3]
/usr/share/miniconda/envs/forte/bin/python3.11(_PyEval_EvalFrameDefault+0x716) [0x55d7016853b6]
/usr/share/miniconda/envs/forte/bin/python3.11(_PyFunction_Vectorcall+0x181) [0x55d7016a8981]
/usr/share/miniconda/envs/forte/bin/python3.11(_PyEval_EvalFrameDefault+0x4a44) [0x55d7016896e4]
/usr/share/miniconda/envs/forte/bin/python3.11(_PyFunction_Vectorcall+0x181) [0x55d7016a8981]
/usr/share/miniconda/envs/forte/bin/python3.11(PyObject_Call+0x130) [0x55d7016b26b0]
/usr/share/miniconda/envs/forte/bin/python3.11(_PyEval_EvalFrameDefault+0x4a44) [0x55d7016896e4]
/usr/share/miniconda/envs/forte/bin/python3.11(+0x2a5a8d) [0x55d70173ca8d]
/usr/share/miniconda/envs/forte/bin/python3.11(PyEval_EvalCode+0x9f) [0x55d70173c11f]
/usr/share/miniconda/envs/forte/bin/python3.11(+0x2c408a) [0x55d70175b08a]
/usr/share/miniconda/envs/forte/bin/python3.11(+0x2bfc13) [0x55d701756c13]
/usr/share/miniconda/envs/forte/bin/python3.11(PyRun_StringFlags+0x62) [0x55d70174b3c2]
/usr/share/miniconda/envs/forte/bin/python3.11(+0x2bc797) [0x55d701753797]
/usr/share/miniconda/envs/forte/bin/python3.11(+0x1fafbf) [0x55d701691fbf]
/usr/share/miniconda/envs/forte/bin/python3.11(PyObject_Vectorcall+0x2c) [0x55d701691eac]
/usr/share/miniconda/envs/forte/bin/python3.11(_PyEval_EvalFrameDefault+0x716) [0x55d7016853b6]
/usr/share/miniconda/envs/forte/bin/python3.11(+0x2a5a8d) [0x55d70173ca8d]
/usr/share/miniconda/envs/forte/bin/python3.11(PyEval_EvalCode+0x9f) [0x55d70173c11f]
/usr/share/miniconda/envs/forte/bin/python3.11(+0x2c408a) [0x55d70175b08a]
/usr/share/miniconda/envs/forte/bin/python3.11(+0x2bfc13) [0x55d701756c13]
/usr/share/miniconda/envs/forte/bin/python3.11(+0x2d4fb0) [0x55d70176bfb0]
brianz98 commented 1 month ago

@hczhai is this something that you've seen before?

hczhai commented 1 month ago

This is likely caused by the problem of using MKL on AMD cpus. You can try any of the following methods:

  1. Make sure that block2 is not linked with MKL. When -DUSE_MKL=OFF, cmake may still find mkl as the lapack/blas library when other blas libraries are not available. See the "Found BLAS" line in https://github.com/evangelistalab/forte/actions/runs/10255130288/job/28371416030#step:12:93. You may use environment variables such as BLAS_ROOT to control it, see https://github.com/block-hczhai/block2-preview/blob/p0.5.3rc16/.github/workflows/build.yml#L143. You may need to install OpenBLAS, which works well on AMD cpus.
  2. If it is the intention to use MKL, you need to redefine a function in the MKL library to make it work for AMD cpu, and for this purpose LD_PRELOAD should be set before calling binaries or Python modules linked with libblock2.so (nothing needs to be changed for compiling block2). See https://github.com/block-hczhai/block2-preview/blob/p0.5.3rc16/.github/workflows/build.yml#L103-L109 and https://github.com/block-hczhai/block2-preview/blob/p0.5.3rc16/.github/workflows/build.yml#L232-L235.
brianz98 commented 1 month ago

This is likely caused by the problem of using MKL on AMD cpus. You can try any of the following methods:

  1. Make sure that block2 is not linked with MKL. When -DUSE_MKL=OFF, cmake may still find mkl as the lapack/blas library when other blas libraries are not available. See the "Found BLAS" line in https://github.com/evangelistalab/forte/actions/runs/10255130288/job/28371416030#step:12:93. You may use environment variables such as BLAS_ROOT to control it, see https://github.com/block-hczhai/block2-preview/blob/p0.5.3rc16/.github/workflows/build.yml#L143. You may need to install OpenBLAS, which works well on AMD cpus.

  2. If it is the intention to use MKL, you need to redefine a function in the MKL library to make it work for AMD cpu, and for this purpose LD_PRELOAD should be set before calling binaries or Python modules linked with libblock2.so (nothing needs to be changed for compiling block2). See https://github.com/block-hczhai/block2-preview/blob/p0.5.3rc16/.github/workflows/build.yml#L103-L109 and https://github.com/block-hczhai/block2-preview/blob/p0.5.3rc16/.github/workflows/build.yml#L232-L235.

The second option worked great! Thanks so much, again:)

hczhai commented 2 weeks ago

Hi @brianz98,

I see that this problem is not yet solved in GitHub actions. The following changes should work: https://github.com/hczhai/forte/commit/0a4be4f7e0758841fba2d18c1eb1a250a85ae936. Sorry for the inconvenience and let me know if there are additional issues.