GlobalArrays / ga

Partitioned Global Address Space (PGAS) library for distributed arrays
http://hpc.pnl.gov/globalarrays/
Other
97 stars 38 forks source link

Compilation issues (Intel oneapi), please help? #272

Closed jdandrade-gmx closed 1 year ago

jdandrade-gmx commented 1 year ago

Hi all.

I'm having difficulties to compile GlobalArrays (GA, latest version) for use with gamess-us in order to use it for gpgpu (cuda) calculations.

First, let me point out that gamess-us with gpgpu officially supports only two mpi "flavors", mvapich (which happens to be recommended by nvidia too) and intel mpi.

I had absolutely no success in trying to use gcc 7.5.0 with mvapich2.3.3 to compile GA. As such, I decided to give intel (oneapi 2022.1.0) a try, and get partial success. I need some help to move forward and finally achieve success.

For instance, if I attempt to configure setting LIBS=-mkl or LIBS=-lmkl, I always end with the following error:

$./configure --prefix=/usr/local/chem/ga --enable-cuda-mem --enable-gparrays --with-mpi LIBS=-mkl MPICC=/opt/intel/oneapi/mpi/latest/bin/mpicc MPICXX=/opt/intel/oneapi/mpi/latest/bin/mpiicpc MPIF77=/opt/intel/oneapi/mpi/latest/bin/mpiifort CC=/opt/intel/oneapi/compiler/2022.1.0/linux/bin/intel64/icc FC=/opt/intel/oneapi/compiler/2022.1.0/linux/bin/intel64/ifort
...
checking for style of include used by make... GNU
configure: WARNING: MPI compilers desired, MPICC and CC are set, and MPICC!=CC.
configure: WARNING: Choosing MPICC as main compiler.
configure: WARNING: CC will be assumed as the unwrapped MPI compiler.
checking whether the C compiler works... no
configure: error: in `/home/johannes/src/ga-5.8.1':
configure: error: C compiler cannot create executables
See `config.log' for more details

On the other hand, if I give up LIBS but set both --with-blas=-mkl and --with-lapack=-mkl, the configuration goes smoothly, make command don't seems to show any errors but on the other hand make check ends as follows:

$./configure --prefix=/usr/local/chem/ga --enable-cuda-mem --enable-gparrays --with-mpi --with-blas=-mkl --with-lapack=-mkl MPICC=/opt/intel/oneapi/mpi/latest/bin/mpicc MPICXX=/opt/intel/oneapi/mpi/latest/bin/mpiicpc MPIF77=/opt/intel/oneapi/mpi/latest/bin/mpiifort CC=/opt/intel/oneapi/compiler/2022.1.0/linux/bin/intel64/icc FC=/opt/intel/oneapi/compiler/2022.1.0/linux/bin/intel64/ifort
$make
$make check
...
/bin/sh ./libtool  --tag=CC   --mode=link /opt/intel/oneapi/mpi/latest/bin/mpicc   -fno-aggressive-loop-optimizations          -o ma/testc.x ma/testc.o libga.la -lm
libtool: link: /opt/intel/oneapi/mpi/latest/bin/mpicc -fno-aggressive-loop-optimizations -o ma/testc.x ma/testc.o  ./.libs/libga.a -L/opt/intel/oneapi/vpl/2022.1.0/lib -L/opt/intel/oneapi/tbb/2021.6.0/env/../lib/intel64/gcc4.8 -L/opt/intel/oneapi/mpi/2021.6.0//libfabric/lib -L/opt/intel/oneapi/mpi/2021.6.0//lib/release -L/opt/intel/oneapi/mpi/2021.6.0//lib -L/opt/intel/oneapi/mkl/2022.1.0/lib/intel64 -L/opt/intel/oneapi/ipp/2021.6.0/lib/intel64 -L/opt/intel/oneapi/ippcp/2021.6.0/lib/intel64 -L/opt/intel/oneapi/dnnl/2022.1.0/cpu_dpcpp_gpu_dpcpp/lib -L/opt/intel/oneapi/dal/2021.6.0/lib/intel64 -L/opt/intel/oneapi/compiler/2022.1.0/linux/compiler/lib/intel64_lin -L/opt/intel/oneapi/compiler/2022.1.0/linux/lib -L/opt/intel/oneapi/clck/2021.6.0/lib/intel64 -L/opt/intel/oneapi/ccl/2021.6.0/lib/cpu_gpu_dpcpp -L/opt/intel/oneapi/compiler/2022.1.0/linux/bin/intel64/../../compiler/lib/intel64_lin -L/usr/lib64/gcc/x86_64-suse-linux/7/ -L/usr/lib64/gcc/x86_64-suse-linux/7/../../../../lib64 -L/usr/lib64/gcc/x86_64-suse-linux/7/../../../../lib64/ -L/lib/../lib64 -L/lib/../lib64/ -L/usr/lib/../lib64 -L/usr/lib/../lib64/ -L/opt/intel/oneapi/vpl/2022.1.0/lib/ -L/opt/intel/oneapi/tbb/2021.6.0/env/../lib/intel64/gcc4.8/ -L/opt/intel/oneapi/mpi/2021.6.0//libfabric/lib/ -L/opt/intel/oneapi/mpi/2021.6.0//lib/release/ -L/opt/intel/oneapi/mpi/2021.6.0//lib/ -L/opt/intel/oneapi/mkl/2022.1.0/lib/intel64/ -L/opt/intel/oneapi/ipp/2021.6.0/lib/intel64/ -L/opt/intel/oneapi/ippcp/2021.6.0/lib/intel64/ -L/opt/intel/oneapi/dnnl/2022.1.0/cpu_dpcpp_gpu_dpcpp/lib/ -L/opt/intel/oneapi/dal/2021.6.0/lib/intel64/ -L/opt/intel/oneapi/compiler/2022.1.0/linux/compiler/lib/intel64_lin/ -L/opt/intel/oneapi/compiler/2022.1.0/linux/lib/ -L/opt/intel/oneapi/clck/2021.6.0/lib/intel64/ -L/opt/intel/oneapi/ccl/2021.6.0/lib/cpu_gpu_dpcpp/ -L/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/lib/ -L/usr/lib64/gcc/x86_64-suse-linux/7/../../../ -L/lib64 -L/lib/ -L/usr/lib64 -L/usr/lib /home/johannes/src/ga-5.8.1/comex/.libs/libarmci.a -lblas -limf -lifport -lifcoremt -lsvml -lipgo -lirc -lpthread -lirc_s -ldl -lm -mkl
gcc: error: unrecognized command line option '-mkl'
make[3]: *** [Makefile:5861: ma/testc.x] Error 1
make[3]: Leaving directory '/home/johannes/src/ga-5.8.1'
make[2]: *** [Makefile:7951: check-am] Error 2
make[2]: Leaving directory '/home/johannes/src/ga-5.8.1'
make[1]: *** [Makefile:7344: check-recursive] Error 1
make[1]: Leaving directory '/home/johannes/src/ga-5.8.1'
make: *** [Makefile:7954: check] Error 2

If I run the config from scratch again, then manually edit the Makefile to change both entries from "-mkl" to "-lmkl", then make and make check again, it becomes:

$./configure --prefix=/usr/local/chem/ga --enable-cuda-mem --enable-gparrays --with-mpi --with-blas=-mkl --with-lapack=-mkl MPICC=/opt/intel/oneapi/mpi/latest/bin/mpicc MPICXX=/opt/intel/oneapi/mpi/latest/bin/mpiicpc MPIF77=/opt/intel/oneapi/mpi/latest/bin/mpiifort CC=/opt/intel/oneapi/compiler/2022.1.0/linux/bin/intel64/icc FC=/opt/intel/oneapi/compiler/2022.1.0/linux/bin/intel64/ifort
$vi Makefile
$make
$make check
...
libtool: link: /opt/intel/oneapi/mpi/latest/bin/mpicc -fno-aggressive-loop-optimizations -o ma/testc.x ma/testc.o  ./.libs/libga.a -L/opt/intel/oneapi/vpl/2022.1.0/lib -L/opt/intel/oneapi/tbb/2021.6.0/env/../lib/intel64/gcc4.8 -L/opt/intel/oneapi/mpi/2021.6.0//libfabric/lib -L/opt/intel/oneapi/mpi/2021.6.0//lib/release -L/opt/intel/oneapi/mpi/2021.6.0//lib -L/opt/intel/oneapi/mkl/2022.1.0/lib/intel64 -L/opt/intel/oneapi/ipp/2021.6.0/lib/intel64 -L/opt/intel/oneapi/ippcp/2021.6.0/lib/intel64 -L/opt/intel/oneapi/dnnl/2022.1.0/cpu_dpcpp_gpu_dpcpp/lib -L/opt/intel/oneapi/dal/2021.6.0/lib/intel64 -L/opt/intel/oneapi/compiler/2022.1.0/linux/compiler/lib/intel64_lin -L/opt/intel/oneapi/compiler/2022.1.0/linux/lib -L/opt/intel/oneapi/clck/2021.6.0/lib/intel64 -L/opt/intel/oneapi/ccl/2021.6.0/lib/cpu_gpu_dpcpp -L/opt/intel/oneapi/compiler/2022.1.0/linux/bin/intel64/../../compiler/lib/intel64_lin -L/usr/lib64/gcc/x86_64-suse-linux/7/ -L/usr/lib64/gcc/x86_64-suse-linux/7/../../../../lib64 -L/usr/lib64/gcc/x86_64-suse-linux/7/../../../../lib64/ -L/lib/../lib64 -L/lib/../lib64/ -L/usr/lib/../lib64 -L/usr/lib/../lib64/ -L/opt/intel/oneapi/vpl/2022.1.0/lib/ -L/opt/intel/oneapi/tbb/2021.6.0/env/../lib/intel64/gcc4.8/ -L/opt/intel/oneapi/mpi/2021.6.0//libfabric/lib/ -L/opt/intel/oneapi/mpi/2021.6.0//lib/release/ -L/opt/intel/oneapi/mpi/2021.6.0//lib/ -L/opt/intel/oneapi/mkl/2022.1.0/lib/intel64/ -L/opt/intel/oneapi/ipp/2021.6.0/lib/intel64/ -L/opt/intel/oneapi/ippcp/2021.6.0/lib/intel64/ -L/opt/intel/oneapi/dnnl/2022.1.0/cpu_dpcpp_gpu_dpcpp/lib/ -L/opt/intel/oneapi/dal/2021.6.0/lib/intel64/ -L/opt/intel/oneapi/compiler/2022.1.0/linux/compiler/lib/intel64_lin/ -L/opt/intel/oneapi/compiler/2022.1.0/linux/lib/ -L/opt/intel/oneapi/clck/2021.6.0/lib/intel64/ -L/opt/intel/oneapi/ccl/2021.6.0/lib/cpu_gpu_dpcpp/ -L/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/lib/ -L/usr/lib64/gcc/x86_64-suse-linux/7/../../../ -L/lib64 -L/lib/ -L/usr/lib64 -L/usr/lib -lmkl /home/johannes/src/ga-5.8.1/comex/.libs/libarmci.a -lblas -limf -lifport -lifcoremt -lsvml -lipgo -lirc -lpthread -lirc_s -ldl -lm
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: cannot find -lmkl
collect2: error: ld returned 1 exit status
make[3]: *** [Makefile:5861: ma/testc.x] Error 1
make[3]: Leaving directory '/home/johannes/src/ga-5.8.1'
make[2]: *** [Makefile:7951: check-am] Error 2
make[2]: Leaving directory '/home/johannes/src/ga-5.8.1'
make[1]: *** [Makefile:7344: check-recursive] Error 1
make[1]: Leaving directory '/home/johannes/src/ga-5.8.1'
make: *** [Makefile:7954: check] Error 2

And if I put "-lmkl" since the start, I get:

$./configure --prefix=/usr/local/chem/ga --enable-cuda-mem --enable-gparrays --with-mpi --with-blas=-lmkl --with-lapack=-lmkl MPICC=/opt/intel/oneapi/mpi/latest/bin/mpicc MPICXX=/opt/intel/oneapi/mpi/latest/bin/mpiicpc MPIF77=/opt/intel/oneapi/mpi/latest/bin/mpiifort CC=/opt/intel/oneapi/compiler/2022.1.0/linux/bin/intel64/icc FC=/opt/intel/oneapi/compiler/2022.1.0/linux/bin/intel64/ifort
$make
$make check
...
/bin/sh ./libtool  --tag=CC   --mode=link /opt/intel/oneapi/mpi/latest/bin/mpicc   -fno-aggressive-loop-optimizations          -o ma/testc.x ma/testc.o libga.la -lm
libtool: link: /opt/intel/oneapi/mpi/latest/bin/mpicc -fno-aggressive-loop-optimizations -o ma/testc.x ma/testc.o  ./.libs/libga.a -L/opt/intel/oneapi/vpl/2022.1.0/lib -L/opt/intel/oneapi/tbb/2021.6.0/env/../lib/intel64/gcc4.8 -L/opt/intel/oneapi/mpi/2021.6.0//libfabric/lib -L/opt/intel/oneapi/mpi/2021.6.0//lib/release -L/opt/intel/oneapi/mpi/2021.6.0//lib -L/opt/intel/oneapi/mkl/2022.1.0/lib/intel64 -L/opt/intel/oneapi/ipp/2021.6.0/lib/intel64 -L/opt/intel/oneapi/ippcp/2021.6.0/lib/intel64 -L/opt/intel/oneapi/dnnl/2022.1.0/cpu_dpcpp_gpu_dpcpp/lib -L/opt/intel/oneapi/dal/2021.6.0/lib/intel64 -L/opt/intel/oneapi/compiler/2022.1.0/linux/compiler/lib/intel64_lin -L/opt/intel/oneapi/compiler/2022.1.0/linux/lib -L/opt/intel/oneapi/clck/2021.6.0/lib/intel64 -L/opt/intel/oneapi/ccl/2021.6.0/lib/cpu_gpu_dpcpp -L/opt/intel/oneapi/compiler/2022.1.0/linux/bin/intel64/../../compiler/lib/intel64_lin -L/usr/lib64/gcc/x86_64-suse-linux/7/ -L/usr/lib64/gcc/x86_64-suse-linux/7/../../../../lib64 -L/usr/lib64/gcc/x86_64-suse-linux/7/../../../../lib64/ -L/lib/../lib64 -L/lib/../lib64/ -L/usr/lib/../lib64 -L/usr/lib/../lib64/ -L/opt/intel/oneapi/vpl/2022.1.0/lib/ -L/opt/intel/oneapi/tbb/2021.6.0/env/../lib/intel64/gcc4.8/ -L/opt/intel/oneapi/mpi/2021.6.0//libfabric/lib/ -L/opt/intel/oneapi/mpi/2021.6.0//lib/release/ -L/opt/intel/oneapi/mpi/2021.6.0//lib/ -L/opt/intel/oneapi/mkl/2022.1.0/lib/intel64/ -L/opt/intel/oneapi/ipp/2021.6.0/lib/intel64/ -L/opt/intel/oneapi/ippcp/2021.6.0/lib/intel64/ -L/opt/intel/oneapi/dnnl/2022.1.0/cpu_dpcpp_gpu_dpcpp/lib/ -L/opt/intel/oneapi/dal/2021.6.0/lib/intel64/ -L/opt/intel/oneapi/compiler/2022.1.0/linux/compiler/lib/intel64_lin/ -L/opt/intel/oneapi/compiler/2022.1.0/linux/lib/ -L/opt/intel/oneapi/clck/2021.6.0/lib/intel64/ -L/opt/intel/oneapi/ccl/2021.6.0/lib/cpu_gpu_dpcpp/ -L/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/lib/ -L/usr/lib64/gcc/x86_64-suse-linux/7/../../../ -L/lib64 -L/lib/ -L/usr/lib64 -L/usr/lib -llapack /home/johannes/src/ga-5.8.1/comex/.libs/libarmci.a -lblas -limf -lifport -lifcoremt -lsvml -lipgo -lirc -lpthread -lirc_s -ldl -lm
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: ./.libs/libga.a(ma.o): in function `MA_init':
ma.c:(.text+0x2409): undefined reference to `cudaMallocManaged'
collect2: error: ld returned 1 exit status
make[3]: *** [Makefile:5861: ma/testc.x] Error 1
make[3]: Leaving directory '/home/johannes/src/ga-5.8.1'
make[2]: *** [Makefile:7951: check-am] Error 2
make[2]: Leaving directory '/home/johannes/src/ga-5.8.1'
make[1]: *** [Makefile:7344: check-recursive] Error 1
make[1]: Leaving directory '/home/johannes/src/ga-5.8.1'
make: *** [Makefile:7954: check] Error 2

Does anybody has any clue on what I can possibly be doing wrong here? :(

Thanks a lot in advance for any help! :)

jeffhammond commented 1 year ago

I'll break down each error so you can see how to debug this in the future.

First, the error below is because the mpicc script provided by Intel MPI invokes GCC not ICC, and GCC does not know about the special MKL flag.

libtool: link: /opt/intel/oneapi/mpi/latest/bin/mpicc ...
...
gcc: error: unrecognized command line option '-mkl'

You can solve this by invoking mpiicc, which will in turn use ICC, and the -mkl options will work. Aside: You should use -qmkl, which is the newer version of this flag.

Second, you chose to add --enable-cuda-mem, which assumes either you are using a compiler that knows where the CUDA libraries are, or that you add the right information via LDFLAGS.

libtool: link: /opt/intel/oneapi/mpi/latest/bin/mpicc ...
...
ma.c:(.text+0x2409): undefined reference to `cudaMallocManaged'

Are you sure that you want all MA allocations backed by CUDA managed memory? That makes sense if you are using NWChem and using MA as the heap manager, but the majority of other GA codes do not use MA like NWChem does, so I am skeptical you need it in GAMESS.

If you want MA to use CUDA managed memory, then figure out the path to the CUDA runtime and set LDFLAGS=-L/usr/local/cuda/lib -lcuda or something like that.

Finally, do you actually use --enable-gparrays in GAMESS or are you just enabling everything for the fun of it?

jdandrade-gmx commented 1 year ago

Hi @jeffhammond !

I'll break down each error so you can see how to debug this in the future.

Really, thanks for your time and for the lessons! :)

First, the error below is because the mpicc script provided by Intel MPI invokes GCC not ICC, and GCC does not know about the special MKL flag.

libtool: link: /opt/intel/oneapi/mpi/latest/bin/mpicc ...
...
gcc: error: unrecognized command line option '-mkl'

You can solve this by invoking mpiicc, which will in turn use ICC, and the -mkl options will work. Aside: You should use -qmkl, which is the newer version of this flag.

Thanks! Seems that I missed mpiicc (did it properly for fortran and c++ compilers), despite that was also by chance when looking for an answer for the errors I was encountering (I'll admit, I'm quite of "flying blind" with GA here).

Also, I was really not aware of the -qmkl flag, thanks for that too! :)

Finally, I added both "--enable-cuda-mem" and "--enable-gparrays" not quite "for the fun or it", but rather for the "lack of fun that came with previous failed compilations", and I just wanted to make sure that this compilation would use the power in the GPU as much as possible.

I was also expecting that if those were too much the error message would be clear enough and easy to debug (my mistake, clearly, despite the libtool messages now seem they were very clear).

But, in order to make everything work better, I removed both: that seems to be the best choice, right?

Anyway, I moved forward, however several tests in the "make check" now fail:

$./configure --prefix=/usr/local/chem/ga --with-mpi --with-blas=-qmkl --with-lapack=-qmkl LIBS=-qmkl MPICC=/opt/intel/oneapi/mpi/latest/bin/mpiicc MPICXX=/opt/intel/oneapi/mpi/latest/bin/mpiicpc MPIF77=/opt/intel/oneapi/mpi/latest/bin/mpiifort CC=/opt/intel/oneapi/compiler/2022.1.0/linux/bin/intel64/icc FC=/opt/intel/oneapi/compiler/2022.1.0/linux/bin/intel64/ifort
$make
$make check
...
make[3]: Leaving directory '/home/johannes/src/ga-5.8.1'
make  check-TESTS
make[3]: Entering directory '/home/johannes/src/ga-5.8.1'
make[4]: Entering directory '/home/johannes/src/ga-5.8.1'
PASS: ma/test-coalesce.x
PASS: ma/test-inquire.x
XFAIL: ma/testf.x
PASS: global/testing/elempatch.x
PASS: global/testing/getmem.x
PASS: global/testing/mtest.x
PASS: global/testing/mulmatpatchc.x
PASS: global/testing/normc.x
PASS: global/testing/matrixc.x
PASS: global/testing/ntestc.x
PASS: global/testing/nbtestc.x
PASS: global/testing/ntestfc.x
PASS: global/testing/packc.x
PASS: global/testing/patch_enumc.x
PASS: global/testing/print.x
PASS: global/testing/scan_addc.x
PASS: global/testing/scan_copyc.x
PASS: global/testing/testc.x
PASS: global/testing/testmatmultc.x
PASS: global/testing/testmult.x
PASS: global/testing/testmultrect.x
PASS: global/testing/gemmtest.x
PASS: global/testing/read_only.x
PASS: global/testing/cache_test.x
PASS: global/testing/unpackc.x
PASS: global/testing/bin.x
FAIL: global/testing/blktest.x
PASS: global/testing/g2test.x
PASS: global/testing/g3test.x
PASS: global/testing/ga_lu.x
PASS: global/testing/ga_shift.x
PASS: global/testing/ghosts.x
PASS: global/testing/jacobi.x
FAIL: global/testing/mir_perf2.x
FAIL: global/testing/mmatrix.x
FAIL: global/testing/mulmatpatch.x
PASS: global/testing/nbtest.x
PASS: global/testing/nb2test.x
FAIL: global/testing/ndim.x
FAIL: global/testing/patch.x
PASS: global/testing/patch2.x
PASS: global/testing/patch_enumf.x
FAIL: global/testing/perfmod.x
FAIL: global/testing/perform.x
FAIL: global/testing/perf.x
PASS: global/testing/perf2.x
FAIL: global/testing/pg2test.x
FAIL: global/testing/pgtest.x
PASS: global/testing/scan.x
PASS: global/testing/simple_groups.x
PASS: global/testing/sparse.x
PASS: global/testing/sprsmatmult.x
PASS: global/testing/stride.x
FAIL: global/testing/testeig.x
FAIL: global/testing/testmatmult.x
FAIL: global/testing/testsolve.x
FAIL: global/testing/test.x
FAIL: global/testing/overlay.x
PASS: global/testing/simple_groups_comm.x
PASS: global/testing/ga-mpi.x
PASS: global/testing/lock.x
PASS: global/testing/simple_groups_commc.x
FAIL: global/testing/nga-onesided.x
PASS: global/testing/nga-patch.x
FAIL: global/testing/nga-periodic.x
FAIL: global/testing/nga-scatter.x
PASS: global/testing/nga-util.x
FAIL: global/testing/ngatest.x
PASS: global/examples/lennard-jones/lennard.x
PASS: global/examples/boltzmann/boltz.x
PASS: global/testing/thread_perf_contig.x
PASS: global/testing/thread_perf_strided.x
PASS: global/testing/threadsafec.x
=================================
21 of 74 tests failed
See ./test-suite.log
Please report to hpctools@pnl.gov
=================================
make[4]: *** [Makefile:7473: test-suite.log] Error 1
make[4]: Leaving directory '/home/johannes/src/ga-5.8.1'
make[3]: *** [Makefile:7553: check-TESTS] Error 2
make[3]: Leaving directory '/home/johannes/src/ga-5.8.1'
make[2]: *** [Makefile:7952: check-am] Error 2
make[2]: Leaving directory '/home/johannes/src/ga-5.8.1'
make[1]: *** [Makefile:7344: check-recursive] Error 1
make[1]: Leaving directory '/home/johannes/src/ga-5.8.1'
make: *** [Makefile:7954: check] Error 2

Does anybody has any additional suggestions on how to improve this?

Thanks a lot again for all help already provided! :D

jeffhammond commented 1 year ago

--enable-gparrays has nothing to do with GPUs. It means "global pointer arrays" or something like that and was a research feature a few years ago aimed at enabling data structures outside of dense linear algebra.

--enable-cuda-mem only impacts the behavior of MA. MA may be used inside of GA in some places but there is no benefit in those places to using CUDA managed memory. The only reason for this is when I was porting NWChem CCSD(T), but it isn't required anymore, because that port now uses device memory.

I recommend not enabling either option. They will have no impact on GAMESS.

jdandrade-gmx commented 1 year ago

Hi @jeffhammond , thanks again for the prompt answer! :)

--enable-gparrays has nothing to do with GPUs. It means "global pointer arrays" or something like that and was a research feature a few years ago aimed at enabling data structures outside of dense linear algebra.

--enable-cuda-mem only impacts the behavior of MA. MA may be used inside of GA in some places but there is no benefit in those places to using CUDA managed memory. The only reason for this is when I was porting NWChem CCSD(T), but it isn't required anymore, because that port now uses device memory.

I recommend not enabling either option. They will have no impact on GAMESS.

Ok, I will keep both of them disabled forever. ;)

Concerning the 21 failed tests (listed below), do you or anybody has any suggestion on how to proceed and make them work properly?

XFAIL: ma/testf.x
FAIL: global/testing/blktest.x
FAIL: global/testing/mir_perf2.x
FAIL: global/testing/mmatrix.x
FAIL: global/testing/mulmatpatch.x
FAIL: global/testing/ndim.x
FAIL: global/testing/patch.x
FAIL: global/testing/perfmod.x
FAIL: global/testing/perform.x
FAIL: global/testing/perf.x
FAIL: global/testing/pg2test.x
FAIL: global/testing/pgtest.x
FAIL: global/testing/testeig.x
FAIL: global/testing/testmatmult.x
FAIL: global/testing/testsolve.x
FAIL: global/testing/test.x
FAIL: global/testing/overlay.x
FAIL: global/testing/nga-onesided.x
FAIL: global/testing/nga-periodic.x
FAIL: global/testing/nga-scatter.x
FAIL: global/testing/ngatest.x

Thanks a lot in advance for any help! ;)

jeffhammond commented 1 year ago

please provide logs for perf.x or test.x, either by finding them in the tree or by running those tests manually. it should be pretty obvious what's wrong once we have the output.

jeffhammond commented 1 year ago

./test-suite.log might have details. can't remember and am not connected to any linux machines right now.

jdandrade-gmx commented 1 year ago

Hi @jeffhammond, how are you?

First of all, let me apologize for taking so long to answer: I've got a well-deserved and healthy 2-days disconnected (it should have been 3 actually), but as a consequence I could not answer you right away.

Moving towards the file requested:

./test-suite.log might have details. can't remember and am not connected to any linux machines right now.

Since I easily found the test-suite.log file right away with several error messages and did not manage to find the specific logs from perf.x of test.x, here comes test-suite.log:

test-suite.log.gz

About the error messages, it initially seemed to usually point towards the following as root cause of the segfaults:

(...)
 TESTING nga_acc
    - Data Type: double precision
    - Dimension: 1
    - Running on                     4 processes (processors)
[prigogine:980  :0:980] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ac806335100)
==== backtrace ====
[prigogine:981  :0:981] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ac806ecdc40)
==== backtrace ====
[prigogine:978  :0:978] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ac8067961f0)
==== backtrace ====
[prigogine:979  :0:979] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ac805e744c0)
==== backtrace ====
    0  /usr/lib64/libucs.so.0(+0x1b55c) [0x7fe3880c355c]
    1  /usr/lib64/libucs.so.0(+0x1b712) [0x7fe3880c3712]
    2  /opt/intel/oneapi/mkl/2022.1.0/lib/intel64/libmkl_def.so.2(+0x3f745f) [0x7fe37dfff45f]
===================
    0  /usr/lib64/libucs.so.0(+0x1b55c) [0x7f20525f255c]
    1  /usr/lib64/libucs.so.0(+0x1b712) [0x7f20525f2712]
    2  /opt/intel/oneapi/mkl/2022.1.0/lib/intel64/libmkl_def.so.2(+0x3f745f) [0x7f204852e45f]
===================

I found it way too strange that it did not find those libraries, so I already checked the LD_LIBRARY_PATH, and initially for some odd reason /usr/lib64 was not included in it. I've updated the .bashrc file to force its inclusion, as you can see below:

$echo $LD_LIBRARY_PATH
/usr/local/chem/gmx20.6/lib64:/opt/intel/oneapi/vpl/2022.1.0/lib:/opt/intel/oneapi/tbb/2021.6.0/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/mpi/2021.6.0//libfabric/lib:/opt/intel/oneapi/mpi/2021.6.0//lib/release:/opt/intel/oneapi/mpi/2021.6.0//lib:/opt/intel/oneapi/mkl/2022.1.0/lib/intel64:/opt/intel/oneapi/itac/2021.6.0/slib:/opt/intel/oneapi/ipp/2021.6.0/lib/intel64:/opt/intel/oneapi/ippcp/2021.6.0/lib/intel64:/opt/intel/oneapi/ipp/2021.6.0/lib/intel64:/opt/intel/oneapi/dnnl/2022.1.0/cpu_dpcpp_gpu_dpcpp/lib:/opt/intel/oneapi/debugger/2021.6.0/gdb/intel64/lib:/opt/intel/oneapi/debugger/2021.6.0/libipt/intel64/lib:/opt/intel/oneapi/debugger/2021.6.0/dep/lib:/opt/intel/oneapi/dal/2021.6.0/lib/intel64:/opt/intel/oneapi/compiler/2022.1.0/linux/lib:/opt/intel/oneapi/compiler/2022.1.0/linux/lib/x64:/opt/intel/oneapi/compiler/2022.1.0/linux/lib/oclfpga/host/linux64/lib:/opt/intel/oneapi/compiler/2022.1.0/linux/compiler/lib/intel64_lin:/opt/intel/oneapi/ccl/2021.6.0/lib/cpu_gpu_dpcpp:/usr/lib64:/usr/local/cuda/lib64:/usr/lib64/mpi/gcc/mvapich2/lib64

After that, I ran everything again since configure (actually, untared the distribution file again). Unfortunately, the same 21 tests failed (see the above provided file), still with the same error messages... :(

As usual, thanks again for your help: I'll be looking forward for your answer.

jeffhammond commented 1 year ago

I think there's a problem with how MKL is called.

These errors are usually associated with mixing LP64 (INTEGER is 32b) and ILP64 (INTEGER is 64b) calling conventions in Fortran.

Intel MKL ERROR: Parameter 6 was incorrect on entry to DSYGV .
Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 13 was incorrect on entry to DGEMM .

Please try this, and report back:

. /opt/intel/oneapi/setvars.sh --force
cd $GA_DIRECTORY
git clean -dfx
git reset --hard
git fetch --all
git checkout develop
./autogen.sh
mkdir build-intel
cd build-intel
../configure CC=icx CXX=icpx F77=ifort FC=ifort MPICC="mpiicc -cc=icx" MPICXX="mpiicpc -cxx=icpx" MPIF77=mpiifort MPIFC=mpiifort
make -j`nproc`
make -j`nproc` checkprogs
make check

This recipe worked for me. Not building GA tests with MKL does not prevent you from doing so in your application, although you of course must make sure that your integer sizes line up. See the following GA configure options for details if you need to force the size, although I do not recommend this since MKL supports both LP64 and ILP64.

  --with-blas[=ARG]       use external BLAS library; attempt to detect
                          sizeof(INTEGER)
  --with-blas4[=ARG]      use external BLAS library compiled with
                          sizeof(INTEGER)==4
  --with-blas8[=ARG]      use external BLAS library compiled with
                          sizeof(INTEGER)==8
  --with-lapack=[ARG]     use external LAPACK library
  --with-scalapack=[ARG]  use ScaLAPACK library compiled with
                          sizeof(INTEGER)==4
  --with-scalapack8=[ARG] use ScaLAPACK library compiled with
                          sizeof(INTEGER)==8
  --with-elpa=[ARG]       use ELPA library compiled with sizeof(INTEGER)==4
  --with-elpa8=[ARG]      use ELPA library compiled with sizeof(INTEGER)==8
jeffhammond commented 1 year ago

This also works:

../configure CC=icx CXX=icpx F77=ifort FC=ifort \
MPICC="mpiicc -cc=icx" MPICXX="mpiicpc -cxx=icpx" \
MPIF77=mpiifort MPIFC=mpiifort --with-blas="-qmkl" \
 --with-lapack="-qmkl" && \
make -j8 checkprogs && \
make check
jdandrade-gmx commented 1 year ago

I'm back here.

Done as proposed, and since it worked (THANKS A LOT!) I decided to move on to several tests to better understand the behavior.

1) setting "LIBS=-qmkl" leads to error every time. 2) I wasn't aware of icx and icpx options as new compilers (specially because they are one directory level below all the other executables), but since I managed to make both work now I have two options to deal with gamess and see which one is better. Probably the new ones (**xs), because they seem to be more suitable for gpu use. 3) I chose to add the --with-blas="-qmkl" and --with-lapack="-qmkl" since they do not lead me to any errors. 4) Interestingly, "--with-mpi" compiles fine with icc and icpc, however it fails miserably with icx and icpx. 5) The "--force" with intel compiler setvars was unnecessary as long as the setvars is already properly sourced at .bashrc file.

I'll now generate two versions ("icc+icpc" and "icx+icpx") of the library and try to link both with gamess. If I either succeed or fail I'll come back to report any issues (or hopefully the benchmarks compared to CPUs).

I also decided, at least at this moment, to follow your suggestion and not to try to force the size of the integers.

Finally, I haven't yet tried again the recipe against the downloaded 5.8.1, but would the need for a git developers version means that there is some issue with the "main" version?

Thanks a lot for all help! :)

jeffhammond commented 1 year ago

Your issues are invariant to recent releases. I merely used the develop branch because that's what I always use.