Open wdj opened 5 years ago
Any idea @aprokop? There seems to be a problem with arpack and so the coarse matrix becomes singular which trips UMFPACK.
Not sure on top of my head. From the log, it is clear that arpack is tries to write into lout
stream which is negative. This typically indicates that the corresponding file was not opened properly. However, without backtrace it is hard to understand where it is trying to write to.
In general, I'm not sure what's happening here. Why is arpack being called from spack-stage
? If spack package was installed properly, it should have been moved out of stage. See, for example, how openmpi command in the log is being called:
7: Test command: /usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpiexec "-n" "1" "./test_hierarchy"
So that was properly installed in /usr/local/src/spack/
. But arpack is being referenced as /tmp/root/spack-stage/spack-stage-4kar8p/arpack-ng-3.6.3/UTIL/dvout.f
, which baffles me.
Am I missing some kind of spack (post-build) install step?
FWIW, I'm doing the spack builds as root but the mfmg configure/make/run as regular user --
From: Andrey Prokopenko notifications@github.com Reply-To: ORNL-CEES/mfmg reply@reply.github.com Date: Monday, January 21, 2019 at 8:55 PM To: ORNL-CEES/mfmg mfmg@noreply.github.com Cc: Wayne Joubert joubert@ornl.gov, Author author@noreply.github.com Subject: Re: [ORNL-CEES/mfmg] tests failing on gpusys (RHEL 7 system) (#118)
Not sure on top of my head. From the log, it is clear that arpack is tries to write into lout stream which is negative. This typically indicates that the corresponding file was not opened properly. However, without backtrace it is hard to understand where it is trying to write to.
In general, I'm not sure what's happening here. Why is arpack being called from spack-stage? If spack package was installed properly, it should have been moved out of stage. See, for example, how openmpi command in the log is being called:
7: Test command: /usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpiexec "-n" "1" "./test_hierarchy"
So that was properly installed in /usr/local/src/spack/. But arpack is being referenced as /tmp/root/spack-stage/spack-stage-4kar8p/arpack-ng-3.6.3/UTIL/dvout.f, which baffles me.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ORNL-CEES/mfmg/issues/118#issuecomment-456244452, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AEIe6GfjZhaosRdoYftmoz3sVRvlLAnOks5vFm-ZgaJpZM4aLwer.
Am I missing some kind of spack (post-build) install step?
No you should be good.
FWIW, I'm doing the spack builds as root but the mfmg configure/make/run as regular user
I am not sure if that's a problem. I always build spack as a regular user and then load the modules that were created.
Instead of using make test
can you try ctest
. I doubt it will help but that's the way we usually run the tests.
@wdj can you show the output of spack location --install-dir arpack-ng
gpusys$ spack location --install-dir arpack-ng
/usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/arpack-ng-3.6.3-uqfbppbahrwiobzqglsrfl3pdkphprll
I will try building the spack stuff in user space --
From: Bruno Turcksin notifications@github.com Reply-To: ORNL-CEES/mfmg reply@reply.github.com Date: Tuesday, January 22, 2019 at 8:51 AM To: ORNL-CEES/mfmg mfmg@noreply.github.com Cc: Wayne Joubert joubert@ornl.gov, Mention mention@noreply.github.com Subject: Re: [ORNL-CEES/mfmg] tests failing on gpusys (RHEL 7 system) (#118)
@wdjhttps://github.com/wdj can you show the output of spack location --install-dir arpack-ng
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ORNL-CEES/mfmg/issues/118#issuecomment-456405837, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AEIe6BnnzBs7mU58h_tMngCGQhklAB6Nks5vFxc_gaJpZM4aLwer.
You could also try removing spack-stage.
Oddly, removing the stage dir doesn't change the behavior. I'm guessing the /tmp/root/spack/stage/... path must be baked into the object code at compile time, irrelevant to runtime --
From: Andrey Prokopenko notifications@github.com Reply-To: ORNL-CEES/mfmg reply@reply.github.com Date: Tuesday, January 22, 2019 at 9:12 AM To: ORNL-CEES/mfmg mfmg@noreply.github.com Cc: Wayne Joubert joubert@ornl.gov, Mention mention@noreply.github.com Subject: Re: [ORNL-CEES/mfmg] tests failing on gpusys (RHEL 7 system) (#118)
You could also try removing spack-stage.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ORNL-CEES/mfmg/issues/118#issuecomment-456412989, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AEIe6J8KRRsXI2ClQH4eZDEsdM5YZPNcks5vFxxFgaJpZM4aLwer.
I tried with a fresh clone of spack and I have the same problem. I have a working version using spack and it uses the same version of arpack so that's not the problem. Something strange is that on Ubuntu, arpack was installed in lib
but on rhel it is installed in lib64
. In lib
, there are a bunch of cmake files. I don't know if it is spack that is doing something different or if it is because of the OS. I checked other libraries and they don't have lib64
.
Let's talk about it at the meeting.
FWIW, when building as regular user, not root, I got the following dealii build error. I must have something different in my environment, but I haven't found it yet --
Regardless, I am moving forward with the lanczos integration. I have the code and a unit test working but have not yet interfaced the lanczos solver to the mfmg algorithm propor.
I have it on a branch ("lanczos") I've pushed to the repo --
######################################################################## 100.0% ==> Staging archive: /home/wjd/spack/var/spack/stage/netlib-scalapack-2.0.2-e46zkg5p3ffv6ymcipit354xk5jdf6ow/scalapack-2.0.2.tgz ==> Created stage in /home/wjd/spack/var/spack/stage/netlib-scalapack-2.0.2-e46zkg5p3ffv6ymcipit354xk5jdf6ow ==> No patches needed for netlib-scalapack ==> Building netlib-scalapack [CMakePackage] ==> Executing phase: 'cmake' ==> Error: ProcessError: Command exited with status 1: 'cmake' '/home/wjd/spack/var/spack/stage/netlib-scalapack-2.0.2-e46zkg5p3ffv6ymcipit354xk5jdf6ow/scalapack-2.0.2' '-G' 'Unix Makefiles' '-DCMAKE_INSTALL_PREFIX:PATH=/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/netlib-scalapack-2.0.2-e46zkg5p3ffv6ymcipit354xk5jdf6ow' '-DCMAKE_BUILD_TYPE:STRING=RelWithDebInfo' '-DCMAKE_VERBOSE_MAKEFILE:BOOL=ON' '-DCMAKE_INSTALL_RPATH_USE_LINK_PATH:BOOL=FALSE' '-DCMAKE_INSTALL_RPATH:STRING=/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/netlib-scalapack-2.0.2-e46zkg5p3ffv6ymcipit354xk5jdf6ow/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/netlib-scalapack-2.0.2-e46zkg5p3ffv6ymcipit354xk5jdf6ow/lib64;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openblas-0.3.5-5jxfkb63psesbtsu7qwu2iwrrwqolyep/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/hwloc-1.11.11-lbhqpuejkjid7uarmzqeavfvx6ps6ifu/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/libpciaccess-0.13.5-qcb7t3uk6lfo2km5mu3xwjjrh6amgb2r/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/libxml2-2.9.8-fi5emr4twy4kogxov4t7hx4yydeuaga4/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/libiconv-1.15-zv3vs247p4445x5dbgxlgsqch3bsgbta/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/xz-5.2.4-bcielpo4hqmmyorbqx3lhfdb63sqe4i6/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/zlib-1.2.11-hyog4nvfq25emh5taua53slpjeplgwm2/lib;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/numactl-2.0.12-olbib5og26swgq3r4j2oe3vzrqzjiruz/lib' '-DCMAKE_PREFIX_PATH:STRING=/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/cmake-3.13.3-5prvjs5duzkuido454kgmro7czi3e46q;/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openblas-0.3.5-5jxfkb63psesbtsu7qwu2iwrrwqolyep' '-DBUILD_SHARED_LIBS:BOOL=ON' '-DBUILD_STATIC_LIBS:BOOL=OFF' '-DLAPACK_FOUND=true' '-DLAPACK_INCLUDE_DIRS=/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openblas-0.3.5-5jxfkb63psesbtsu7qwu2iwrrwqolyep/include' '-DLAPACK_LIBRARIES=/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openblas-0.3.5-5jxfkb63psesbtsu7qwu2iwrrwqolyep/lib/libopenblas.so' '-DBLAS_LIBRARIES=/home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openblas-0.3.5-5jxfkb63psesbtsu7qwu2iwrrwqolyep/lib/libopenblas.so'
1 error found in build log: 22 -- --> C Compiler : /home/wjd/spack/opt/spack/linux-rhel7-x86_64/gcc -8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpicc 23 -- --> MPI Fortran Compiler : /home/wjd/spack/opt/spack/linux-rhel7- x8664/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/ mpif90 24 -- --> Fortran Compiler : /home/wjd/spack/opt/spack/linux-rhel7-x86 64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpif 90 25 -- Reducing RELEASE optimization level to O2 26 -- ========= 27 -- Compiling and Building BLACS INSTALL Testing to set correct varia bles
28 CMake Error at CMAKE/FortranMangling.cmake:27 (MESSAGE): 29 Configure in the BLACS INSTALL directory FAILED 30 Call Stack (most recent call first): 31 CMakeLists.txt:122 (COMPILE) 32 33 34 -- Configuring incomplete, errors occurred!
See build log for details: /home/wjd/spack/var/spack/stage/netlib-scalapack-2.0.2-e46zkg5p3ffv6ymcipit354xk5jdf6ow/scalapack-2.0.2/spack-build.out
From: Bruno Turcksin notifications@github.com Reply-To: ORNL-CEES/mfmg reply@reply.github.com Date: Wednesday, January 23, 2019 at 10:05 AM To: ORNL-CEES/mfmg mfmg@noreply.github.com Cc: Wayne Joubert joubert@ornl.gov, Mention mention@noreply.github.com Subject: Re: [ORNL-CEES/mfmg] tests failing on gpusys (RHEL 7 system) (#118)
I tried with a fresh clone of spack and I have the same problem. I have a working version using spack and it uses the same version of arpack so that's not the problem. Something strange is that on Ubuntu, arpack was installed in lib but on rhel it is installed in lib64. In lib, there are a bunch of cmake files. I don't know if it is spack that is doing something different or if it is because of the OS. I checked other libraries and they don't have lib64.
Let's talk about it at the meeting.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ORNL-CEES/mfmg/issues/118#issuecomment-456835780, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AEIe6BU-ji57wY9e2XTBC8AtxMq7Y-1Iks5vGHo1gaJpZM4aLwer.
The thing that comes to mind is spack/spack#764. The thing that sticks out is the following string in the package:
options.append('-DCMAKE_INSTALL_NAME_DIR:PATH=%s/lib' % prefix)
It was originally introduced to fix some Mac thing, but I wonder if it breaks Redhat.
Changing/removing the line options.append('-DCMAKE_INSTALL_NAME_DIR:PATH=%s/lib' % prefix)
doesn't change anything
gpusys$ cat /proc/meminfo | head -n1 MemTotal: 3859908 kB
spack install
cd /usr/local/src git clone https://github.com/spack/spack.git chmod -R a+rX spack
in user .bashrc
export SPACK_ROOT=/usr/local/src/spack . $SPACK_ROOT/share/spack/setup-env.sh
spack installs
spack install gcc spack compiler add
spack location -i gcc@8.2.0
spack install dealii@develop %gcc@8.2.0in user .bashrc
GCCROOT=$(spack location --install-dir gcc) export LD_LIBRARY_PATH="${GCCROOT}/lib:${GCCROOT}/lib64" PATH="${GCCROOT}/bin:${PATH}" MPIROOT=$(spack location --install-dir mpi) PATH="${MPIROOT}/bin:${PATH}" CMAKEROOT=$(spack location --install-dir cmake) PATH="${CMAKEROOT}/bin:${PATH}"
cmake/make commands
DEAL_II_DIR=$(spack location --install-dir dealii) BOOST_ROOT=$(spack location --install-dir boost) cmake \ -D CMAKE_BUILD_TYPE=Debug \ -D MFMG_ENABLE_TESTS=ON \ -D MFMG_ENABLE_CUDA=OFF \ -D BOOST_ROOT=${BOOST_ROOT} \ -D DEAL_II_DIR=${DEAL_II_DIR} \ ../mfmg make
test command
env DEAL_II_NUM_THREADS=1 make test ARGS=-V
partial test output
7: Test command: /usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpiexec "-n" "1" "./test_hierarchy" 7: Test timeout computed to be: 1500 7: Running 23 test cases... 7: At line 51 of file /tmp/root/spack-stage/spack-stage-4kar8p/arpack-ng-3.6.3/UTIL/dvout.f 7: Fortran runtime error: Unit number is negative and unit was not already opened with OPEN(NEWUNIT=...) 7: -------------------------------------------------------------------------- 7: Primary job terminated normally, but 1 process returned 7: a non-zero exit code. Per user-direction, the job has been aborted. 7: -------------------------------------------------------------------------- 7: -------------------------------------------------------------------------- 7: mpiexec detected that one or more processes exited with non-zero status, thus causing 7: the job to be terminated. The first process to do so was: 7: 7: Process name: [[55908,1],0] 7: Exit code: 2 7: -------------------------------------------------------------------------- 7/20 Test #7: test_hierarchy_1 .................***Failed 4.07 sec test 8 Start 8: test_hierarchy_2
8: Test command: /usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpiexec "-n" "2" "./test_hierarchy" 8: Test timeout computed to be: 1500 8: Running 23 test cases... 8: Running 23 test cases... 8: At line 51 of file /tmp/root/spack-stage/spack-stage-4kar8p/arpack-ng-3.6.3/UTIL/dvout.f 8: Fortran runtime error: Unit number is negative and unit was not already opened with OPEN(NEWUNIT=...) 8: -------------------------------------------------------------------------- 8: Primary job terminated normally, but 1 process returned 8: a non-zero exit code. Per user-direction, the job has been aborted. 8: -------------------------------------------------------------------------- 8: unknown location(0): fatal error: in "benchmark<mfmg__DealIIMeshEvaluator<2>>": dealii::SparseDirectUMFPACK::ExcUMFPACKError: 8: -------------------------------------------------------- 8: An error occurred in line <291> of file </usr/local/src/spack/var/spack/stage/dealii-develop-c34vncl5qn7fkr4afiohu5cqe5i4kd5x/dealii/source/lac/sparse_direct.cc> in function 8: void dealii::SparseDirectUMFPACK::factorize(const Matrix&) [with Matrix = dealii::SparseMatrix]
8: The violated condition was:
8: status == UMFPACK_OK
8: Additional information:
8: UMFPACK routine umfpack_dl_numeric returned error status 1.
8:
8: A complete list of error codes can be found in the file <bundled/umfpack/UMFPACK/Include/umfpack.h>.
8:
8: That said, the two most common errors that can happen are that your matrix cannot be factorized because it is rank deficient, and that UMFPACK runs out of memory because your problem is too large.
8:
8: The first of these cases most often happens if you forget terms in your bilinear form necessary to ensure that the matrix has full rank, or if your equation has a spatially variable coefficient (or nonlinearity) that is supposed to be strictly positive but, for whatever reasons, is negative or zero. In either case, you probably want to check your assembly procedure. Similarly, a matrix can be rank deficient if you forgot to apply the appropriate boundary conditions. For example, the Laplace equation without boundary conditions has a single zero eigenvalue and its rank is therefore deficient by one.
8:
8: The other common situation is that you run out of memory.On a typical laptop or desktop, it should easily be possible to solve problems with 100,000 unknowns in 2d. If you are solving problems with many more unknowns than that, in particular if you are in 3d, then you may be running out of memory and you will need to consider iterative solvers instead of the direct solver employed by UMFPACK.
8: --------------------------------------------------------
8:
8: /home/wjd/mfmg_project/mfmg/tests/test_hierarchy.cc(114): last checkpoint: "benchmark" entry.
8: --------------------------------------------------------------------------
8: mpiexec detected that one or more processes exited with non-zero status, thus causing
8: the job to be terminated. The first process to do so was:
8:
8: Process name: [[55924,1],0]
8: Exit code: 2
8: --------------------------------------------------------------------------
8/20 Test #8: test_hierarchy_2 .................***Failed 2.91 sec
from Testing/Temporary/LastTest.log
7/20 Testing: test_hierarchy_1 7/20 Test: test_hierarchy_1 Command: "/usr/local/src/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/openmpi-3.1.3-ib5tya3erlk4gxgepkmge7ugk6ea6uip/bin/mpiexec" "-n" "1" "./test_hierarchy" Directory: /home/wjd/mfmg_project/build/tests "test_hierarchy_1" start time: Jan 21 20:05 EST Output:
Running 23 test cases... At line 51 of file /tmp/root/spack-stage/spack-stage-4kar8p/arpack-ng-3.6.3/UTIL/dvout.f Fortran runtime error: Unit number is negative and unit was not already opened with OPEN(NEWUNIT=...)
Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.
mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[55908,1],0] Exit code: 2