ORNL-CEES / mfmg

MFMG is an open-source library implementing matrix-free multigrid methods.
https://mfmg.readthedocs.io
BSD 3-Clause "New" or "Revised" License
11 stars 7 forks source link

tests failing on gpusys (RHEL 7 system) #118

Open wdj opened 5 years ago

wdj commented 5 years ago

gpusys$ cat /proc/meminfo | head -n1 MemTotal: 3859908 kB

spack install

cd /usr/local/src git clone https://github.com/spack/spack.git chmod -R a+rX spack

in user .bashrc

export SPACK_ROOT=/usr/local/src/spack . $SPACK_ROOT/share/spack/setup-env.sh

spack installs

spack install gcc spack compiler add spack location -i gcc@8.2.0 spack install dealii@develop %gcc@8.2.0

in user .bashrc

GCCROOT=$(spack location --install-dir gcc) export LD_LIBRARY_PATH="${GCCROOT}/lib:${GCCROOT}/lib64" PATH="${GCCROOT}/bin:${PATH}" MPIROOT=$(spack location --install-dir mpi) PATH="${MPIROOT}/bin:${PATH}" CMAKEROOT=$(spack location --install-dir cmake) PATH="${CMAKEROOT}/bin:${PATH}"

cmake/make commands

DEAL_II_DIR=$(spack location --install-dir dealii) BOOST_ROOT=$(spack location --install-dir boost) cmake \ -D CMAKE_BUILD_TYPE=Debug \ -D MFMG_ENABLE_TESTS=ON \ -D MFMG_ENABLE_CUDA=OFF \ -D BOOST_ROOT=${BOOST_ROOT} \ -D DEAL_II_DIR=${DEAL_II_DIR} \ ../mfmg make

test command

env DEAL_II_NUM_THREADS=1 make test ARGS=-V

partial test output

7: Test command: /usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpiexec "-n" "1" "./test_hierarchy" 7: Test timeout computed to be: 1500 7: Running 23 test cases... 7: At line 51 of file /tmp/root/spack-stage/spack-stage-4kar8p/arpack-ng-3.6.3/UTIL/dvout.f 7: Fortran runtime error: Unit number is negative and unit was not already opened with OPEN(NEWUNIT=...) 7: -------------------------------------------------------------------------- 7: Primary job terminated normally, but 1 process returned 7: a non-zero exit code. Per user-direction, the job has been aborted. 7: -------------------------------------------------------------------------- 7: -------------------------------------------------------------------------- 7: mpiexec detected that one or more processes exited with non-zero status, thus causing 7: the job to be terminated. The first process to do so was: 7: 7: Process name: [[55908,1],0] 7: Exit code: 2 7: -------------------------------------------------------------------------- 7/20 Test #7: test_hierarchy_1 .................***Failed 4.07 sec test 8 Start 8: test_hierarchy_2

8: Test command: /usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpiexec "-n" "2" "./test_hierarchy" 8: Test timeout computed to be: 1500 8: Running 23 test cases... 8: Running 23 test cases... 8: At line 51 of file /tmp/root/spack-stage/spack-stage-4kar8p/arpack-ng-3.6.3/UTIL/dvout.f 8: Fortran runtime error: Unit number is negative and unit was not already opened with OPEN(NEWUNIT=...) 8: -------------------------------------------------------------------------- 8: Primary job terminated normally, but 1 process returned 8: a non-zero exit code. Per user-direction, the job has been aborted. 8: -------------------------------------------------------------------------- 8: unknown location(0): fatal error: in "benchmark<mfmg__DealIIMeshEvaluator<2>>": dealii::SparseDirectUMFPACK::ExcUMFPACKError: 8: -------------------------------------------------------- 8: An error occurred in line <291> of file </usr/local/src/spack/var/spack/stage/dealii-develop-c34vncl5qn7fkr4afiohu5cqe5i4kd5x/dealii/source/lac/sparse_direct.cc> in function 8: void dealii::SparseDirectUMFPACK::factorize(const Matrix&) [with Matrix = dealii::SparseMatrix] 8: The violated condition was: 8: status == UMFPACK_OK 8: Additional information: 8: UMFPACK routine umfpack_dl_numeric returned error status 1. 8: 8: A complete list of error codes can be found in the file <bundled/umfpack/UMFPACK/Include/umfpack.h>. 8: 8: That said, the two most common errors that can happen are that your matrix cannot be factorized because it is rank deficient, and that UMFPACK runs out of memory because your problem is too large. 8: 8: The first of these cases most often happens if you forget terms in your bilinear form necessary to ensure that the matrix has full rank, or if your equation has a spatially variable coefficient (or nonlinearity) that is supposed to be strictly positive but, for whatever reasons, is negative or zero. In either case, you probably want to check your assembly procedure. Similarly, a matrix can be rank deficient if you forgot to apply the appropriate boundary conditions. For example, the Laplace equation without boundary conditions has a single zero eigenvalue and its rank is therefore deficient by one. 8: 8: The other common situation is that you run out of memory.On a typical laptop or desktop, it should easily be possible to solve problems with 100,000 unknowns in 2d. If you are solving problems with many more unknowns than that, in particular if you are in 3d, then you may be running out of memory and you will need to consider iterative solvers instead of the direct solver employed by UMFPACK. 8: -------------------------------------------------------- 8: 8: /home/wjd/mfmg_project/mfmg/tests/test_hierarchy.cc(114): last checkpoint: "benchmark" entry. 8: -------------------------------------------------------------------------- 8: mpiexec detected that one or more processes exited with non-zero status, thus causing 8: the job to be terminated. The first process to do so was: 8: 8: Process name: [[55924,1],0] 8: Exit code: 2 8: -------------------------------------------------------------------------- 8/20 Test #8: test_hierarchy_2 .................***Failed 2.91 sec

from Testing/Temporary/LastTest.log

7/20 Testing: test_hierarchy_1 7/20 Test: test_hierarchy_1 Command: "/usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpiexec" "-n" "1" "./test_hierarchy" Directory: /home/wjd/mfmg_project/build/tests "test_hierarchy_1" start time: Jan 21 20:05 EST Output:

Running 23 test cases... At line 51 of file /tmp/root/spack-stage/spack-stage-4kar8p/arpack-ng-3.6.3/UTIL/dvout.f Fortran runtime error: Unit number is negative and unit was not already opened with OPEN(NEWUNIT=...)

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[55908,1],0] Exit code: 2

Test time = 4.07 sec ---------------------------------------------------------- Test Failed. "test_hierarchy_1" end time: Jan 21 20:05 EST "test_hierarchy_1" time elapsed: 00:00:04 ---------------------------------------------------------- 8/20 Testing: test_hierarchy_2 8/20 Test: test_hierarchy_2 Command: "/usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpiexec" "-n" "2" "./test_hierarchy" Directory: /home/wjd/mfmg_project/build/tests "test_hierarchy_2" start time: Jan 21 20:05 EST Output: ---------------------------------------------------------- Running 23 test cases... Running 23 test cases... At line 51 of file /tmp/root/spack-stage/spack-stage-4kar8p/arpack-ng-3.6.3/UTIL/dvout.f Fortran runtime error: Unit number is negative and unit was not already opened with OPEN(NEWUNIT=...) -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- unknown location(0): ^[[4;31;49mfatal error: in "benchmark>": dealii::SparseDirectUMFPACK::ExcUMFPACKError: -------------------------------------------------------- An error occurred in line <291> of file in function void dealii::SparseDirectUMFPACK::factorize(const Matrix&) [with Matrix = dealii::SparseMatrix] The violated condition was: status == UMFPACK_OK Additional information: UMFPACK routine umfpack_dl_numeric returned error status 1. A complete list of error codes can be found in the file . That said, the two most common errors that can happen are that your matrix cannot be factorized because it is rank deficient, and that UMFPACK runs out of memory because your problem is too large. The first of these cases most often happens if you forget terms in your bilinear form necessary to ensure that the matrix has full rank, or if your equation has a spatially variable coefficient (or nonlinearity) that is supposed to be strictly positive but, for whatever reasons, is negative or zero. In either case, you probably want to check your assembly procedure. Similarly, a matrix can be rank deficient if you forgot to apply the appropriate boundary conditions. For example, the Laplace equation without boundary conditions has a single zero eigenvalue and its rank is therefore deficient by one. The other common situation is that you run out of memory.On a typical laptop or desktop, it should easily be possible to solve problems with 100,000 unknowns in 2d. If you are solving problems with many more unknowns than that, in particular if you are in 3d, then you may be running out of memory and you will need to consider iterative solvers instead of the direct solver employed by UMFPACK. -------------------------------------------------------- ^[[0;39;49m /home/wjd/mfmg_project/mfmg/tests/test_hierarchy.cc(114): ^[[1;36;49mlast checkpoint: "benchmark" entry.^[[0;39;49m -------------------------------------------------------------------------- mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[55924,1],0] Exit code: 2 -------------------------------------------------------------------------- Test time = 2.91 sec ---------------------------------------------------------- Test Failed. "test_hierarchy_2" end time: Jan 21 20:05 EST "test_hierarchy_2" time elapsed: 00:00:02 ----------------------------------------------------------
Rombur commented 5 years ago

Any idea @aprokop? There seems to be a problem with arpack and so the coarse matrix becomes singular which trips UMFPACK.

aprokop commented 5 years ago

Not sure on top of my head. From the log, it is clear that arpack is tries to write into lout stream which is negative. This typically indicates that the corresponding file was not opened properly. However, without backtrace it is hard to understand where it is trying to write to.

In general, I'm not sure what's happening here. Why is arpack being called from spack-stage? If spack package was installed properly, it should have been moved out of stage. See, for example, how openmpi command in the log is being called:

7: Test command: /usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpiexec "-n" "1" "./test_hierarchy"

So that was properly installed in /usr/local/src/spack/. But arpack is being referenced as /tmp/root/spack-stage/spack-stage-4kar8p/arpack-ng-3.6.3/UTIL/dvout.f, which baffles me.

wdj commented 5 years ago

Am I missing some kind of spack (post-build) install step?

FWIW, I'm doing the spack builds as root but the mfmg configure/make/run as regular user --

From: Andrey Prokopenko notifications@github.com Reply-To: ORNL-CEES/mfmg reply@reply.github.com Date: Monday, January 21, 2019 at 8:55 PM To: ORNL-CEES/mfmg mfmg@noreply.github.com Cc: Wayne Joubert joubert@ornl.gov, Author author@noreply.github.com Subject: Re: [ORNL-CEES/mfmg] tests failing on gpusys (RHEL 7 system) (#118)

Not sure on top of my head. From the log, it is clear that arpack is tries to write into lout stream which is negative. This typically indicates that the corresponding file was not opened properly. However, without backtrace it is hard to understand where it is trying to write to.

In general, I'm not sure what's happening here. Why is arpack being called from spack-stage? If spack package was installed properly, it should have been moved out of stage. See, for example, how openmpi command in the log is being called:

7: Test command: /usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpiexec "-n" "1" "./test_hierarchy"

So that was properly installed in /usr/local/src/spack/. But arpack is being referenced as /tmp/root/spack-stage/spack-stage-4kar8p/arpack-ng-3.6.3/UTIL/dvout.f, which baffles me.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ORNL-CEES/mfmg/issues/118#issuecomment-456244452, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AEIe6GfjZhaosRdoYftmoz3sVRvlLAnOks5vFm-ZgaJpZM4aLwer.

Rombur commented 5 years ago

Am I missing some kind of spack (post-build) install step?

No you should be good.

FWIW, I'm doing the spack builds as root but the mfmg configure/make/run as regular user

I am not sure if that's a problem. I always build spack as a regular user and then load the modules that were created.

Instead of using make test can you try ctest. I doubt it will help but that's the way we usually run the tests.

Rombur commented 5 years ago

@wdj can you show the output of spack location --install-dir arpack-ng

wdj commented 5 years ago

gpusys$ spack location --install-dir arpack-ng

/usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/arpack-ng-3.6.3-uqfbppbahrwiobzqglsrfl3pdkphprll

I will try building the spack stuff in user space --

From: Bruno Turcksin notifications@github.com Reply-To: ORNL-CEES/mfmg reply@reply.github.com Date: Tuesday, January 22, 2019 at 8:51 AM To: ORNL-CEES/mfmg mfmg@noreply.github.com Cc: Wayne Joubert joubert@ornl.gov, Mention mention@noreply.github.com Subject: Re: [ORNL-CEES/mfmg] tests failing on gpusys (RHEL 7 system) (#118)

@wdjhttps://github.com/wdj can you show the output of spack location --install-dir arpack-ng

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ORNL-CEES/mfmg/issues/118#issuecomment-456405837, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AEIe6BnnzBs7mU58h_tMngCGQhklAB6Nks5vFxc_gaJpZM4aLwer.

aprokop commented 5 years ago

You could also try removing spack-stage.

wdj commented 5 years ago

Oddly, removing the stage dir doesn't change the behavior. I'm guessing the /tmp/root/spack/stage/... path must be baked into the object code at compile time, irrelevant to runtime --

From: Andrey Prokopenko notifications@github.com Reply-To: ORNL-CEES/mfmg reply@reply.github.com Date: Tuesday, January 22, 2019 at 9:12 AM To: ORNL-CEES/mfmg mfmg@noreply.github.com Cc: Wayne Joubert joubert@ornl.gov, Mention mention@noreply.github.com Subject: Re: [ORNL-CEES/mfmg] tests failing on gpusys (RHEL 7 system) (#118)

You could also try removing spack-stage.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ORNL-CEES/mfmg/issues/118#issuecomment-456412989, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AEIe6J8KRRsXI2ClQH4eZDEsdM5YZPNcks5vFxxFgaJpZM4aLwer.

Rombur commented 5 years ago

I tried with a fresh clone of spack and I have the same problem. I have a working version using spack and it uses the same version of arpack so that's not the problem. Something strange is that on Ubuntu, arpack was installed in lib but on rhel it is installed in lib64. In lib, there are a bunch of cmake files. I don't know if it is spack that is doing something different or if it is because of the OS. I checked other libraries and they don't have lib64.

Let's talk about it at the meeting.

wdj commented 5 years ago

FWIW, when building as regular user, not root, I got the following dealii build error. I must have something different in my environment, but I haven't found it yet --

Regardless, I am moving forward with the lanczos integration. I have the code and a unit test working but have not yet interfaced the lanczos solver to the mfmg algorithm propor.

I have it on a branch ("lanczos") I've pushed to the repo --

######################################################################## 100.0% ==> Staging archive: /home/wjd/spack/var/spack/stage/netlib-scalapack-2.0.2-e46zkg5p3ffv6ymcipit354xk5jdf6ow/scalapack-2.0.2.tgz ==> Created stage in /home/wjd/spack/var/spack/stage/netlib-scalapack-2.0.2-e46zkg5p3ffv6ymcipit354xk5jdf6ow ==> No patches needed for netlib-scalapack ==> Building netlib-scalapack [CMakePackage] ==> Executing phase: 'cmake' ==> Error: ProcessError: Command exited with status 1: 'cmake' '/home/wjd/spack/var/spack/stage/netlib-scalapack-2.0.2-e46zkg5p3ffv6ymcipit354xk5jdf6ow/scalapack-2.0.2' '-G' 'Unix Makefiles' '-DCMAKE_INSTALL_PREFIX:PATH=/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/netlib-scalapack-2.0.2-e46zkg5p3ffv6ymcipit354xk5jdf6ow' '-DCMAKE_BUILD_TYPE:STRING=RelWithDebInfo' '-DCMAKE_VERBOSE_MAKEFILE:BOOL=ON' '-DCMAKE_INSTALL_RPATH_USE_LINK_PATH:BOOL=FALSE' '-DCMAKE_INSTALL_RPATH:STRING=/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/netlib-scalapack-2.0.2-e46zkg5p3ffv6ymcipit354xk5jdf6ow/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/netlib-scalapack-2.0.2-e46zkg5p3ffv6ymcipit354xk5jdf6ow/lib64;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openblas-0.3.5-5jxfkb63psesbtsu7qwu2iwrrwqolyep/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/hwloc-1.11.11-lbhqpuejkjid7uarmzqeavfvx6ps6ifu/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/libpciaccess-0.13.5-qcb7t3uk6lfo2km5mu3xwjjrh6amgb2r/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/libxml2-2.9.8-fi5emr4twy4kogxov4t7hx4yydeuaga4/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/libiconv-1.15-zv3vs247p4445x5dbgxlgsqch3bsgbta/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/xz-5.2.4-bcielpo4hqmmyorbqx3lhfdb63sqe4i6/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/zlib-1.2.11-hyog4nvfq25emh5taua53slpjeplgwm2/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/numactl-2.0.12-olbib5og26swgq3r4j2oe3vzrqzjiruz/lib' '-DCMAKE_PREFIX_PATH:STRING=/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/cmake-3.13.3-5prvjs5duzkuido454kgmro7czi3e46q;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openblas-0.3.5-5jxfkb63psesbtsu7qwu2iwrrwqolyep' '-DBUILD_SHARED_LIBS:BOOL=ON' '-DBUILD_STATIC_LIBS:BOOL=OFF' '-DLAPACK_FOUND=true' '-DLAPACK_INCLUDE_DIRS=/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openblas-0.3.5-5jxfkb63psesbtsu7qwu2iwrrwqolyep/include' '-DLAPACK_LIBRARIES=/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openblas-0.3.5-5jxfkb63psesbtsu7qwu2iwrrwqolyep/lib/libopenblas.so' '-DBLAS_LIBRARIES=/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openblas-0.3.5-5jxfkb63psesbtsu7qwu2iwrrwqolyep/lib/libopenblas.so'

1 error found in build log: 22 -- --> C Compiler : /home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc -8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpicc 23 -- --> MPI Fortran Compiler : /home/wjd/spack/opt/spack/linux-rhel7- x8664/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/ mpif90 24 -- --> Fortran Compiler : /home/wjd/spack/opt/spack/linux-rhel7-x86 64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpif 90 25 -- Reducing RELEASE optimization level to O2 26 -- ========= 27 -- Compiling and Building BLACS INSTALL Testing to set correct varia bles

28 CMake Error at CMAKE/FortranMangling.cmake:27 (MESSAGE): 29 Configure in the BLACS INSTALL directory FAILED 30 Call Stack (most recent call first): 31 CMakeLists.txt:122 (COMPILE) 32 33 34 -- Configuring incomplete, errors occurred!

See build log for details: /home/wjd/spack/var/spack/stage/netlib-scalapack-2.0.2-e46zkg5p3ffv6ymcipit354xk5jdf6ow/scalapack-2.0.2/spack-build.out

From: Bruno Turcksin notifications@github.com Reply-To: ORNL-CEES/mfmg reply@reply.github.com Date: Wednesday, January 23, 2019 at 10:05 AM To: ORNL-CEES/mfmg mfmg@noreply.github.com Cc: Wayne Joubert joubert@ornl.gov, Mention mention@noreply.github.com Subject: Re: [ORNL-CEES/mfmg] tests failing on gpusys (RHEL 7 system) (#118)

I tried with a fresh clone of spack and I have the same problem. I have a working version using spack and it uses the same version of arpack so that's not the problem. Something strange is that on Ubuntu, arpack was installed in lib but on rhel it is installed in lib64. In lib, there are a bunch of cmake files. I don't know if it is spack that is doing something different or if it is because of the OS. I checked other libraries and they don't have lib64.

Let's talk about it at the meeting.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ORNL-CEES/mfmg/issues/118#issuecomment-456835780, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AEIe6BU-ji57wY9e2XTBC8AtxMq7Y-1Iks5vGHo1gaJpZM4aLwer.

aprokop commented 5 years ago

The thing that comes to mind is spack/spack#764. The thing that sticks out is the following string in the package:

options.append('-DCMAKE_INSTALL_NAME_DIR:PATH=%s/lib' % prefix)

It was originally introduced to fix some Mac thing, but I wonder if it breaks Redhat.

Rombur commented 5 years ago

Changing/removing the line options.append('-DCMAKE_INSTALL_NAME_DIR:PATH=%s/lib' % prefix) doesn't change anything