Open xyuan opened 2 years ago
@grnydawn already has E3SM building with GNU on Crusher. So first, do you have this issue after extending his machine files with hipcc in your case or using your own?
Also confirm your branch details - latest master or something else? What's the case you are trying to build?
The following runscript for HiP test case is used,
CASE_ROOT=$(pwd) E3SM=/ccs/home/yuanx/e3sm OUTPUT=/gpfs/alpine/cli115/scratch/yuanx/ACME_SIMULATIONS DATA=/gpfs/alpine/cli115/scratch/yuanx/ACME_SIMULATIONS
COMPSET=F2010-MMF1
RES=ne4pg2_ne4pg2 COMPILER=gnugpu MACH=crusher PROJ=cli133 PELAYOUT=1x1
CASE=${COMPSET}.${RES}.${MACH}.${COMPILER}.${PELAYOUT}
echo echo ${CASE} echo
${E3SM}/cime/scripts/create_newcase -case ${CASE_ROOT}/${CASE} -compset ${COMPSET} -res ${RES} -mach ${MACH} -compiler ${COMPILER} -pecount ${PELAYOUT} -project ${PROJ} --output-root ${OUT PUT} --handle-preexisting-dirs r
cd ${CASE_ROOT}/${CASE}
cat > user_nl_eam << 'eof' transport_alg=0 hypervis_subcycle_q=1 dt_tracer_factor = 2 eof
./case.setup ./case.build
./xmlchange STOP_OPTION=ndays ./xmlchange STOP_N=1 ./xmlchange CONTINUE_RUN=FALSE ./xmlchange JOB_WALLCLOCK_TIME=02:00 ./xmlchange REST_OPTION=never ./xmlchange CHARGE_ACCOUNT=$PROJ
cp -rf ${DATA}/data/ ${CASE_ROOT}/${CASE}/run
echo echo ${CASE} echo
This test branch can be used with the latest E3SM master, with some changes to add HiP support in the config_machine.xml, gnugpu_crusher.cmake, please copy it from my working branch
After setting up MPICH environment (LD_LIBRARY_PATH, and INCLUDE_PATH) correctly, it is still unable to find libmpi_gpu_91.so library in the mpich library directory at FindMPI.cmake test case
information from CMakeErrors.log,
/opt/rocm-4.5.2/bin/hipcc -mcmodel=medium -O -O2 -I/opt/cray/pe/mpich/8.1.12/ofi/gnu/9.1/include -I/opt/rocm-4.5.2/include -DTIMING -DCNL -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU -DLINUX -DNDEBUG -DHAVE_MPI -DMCT_INTERFACE -DPIO2 -DHAVE_SLASHPROC -D_PNETCDF -DATM_PRESENT -DICE_PRESENT -DLND_PRESENT -DOCN_PRESENT -DROF_PRESENT -DGLC_PRESENT -DWAV_PRESENT -DESP_PRESENT -DMED_PRESENT -DPIO2 -I. -I/gpfs/alpine/cli115/scratch/yuanx/ACME_SIMULATIONS/F-MMFXX.ne4pg2_ne4pg2.crusher.gnugpu.1x1/bld/gnugpu/mpich/nodebug/nothreads/mct/include -I/gpfs/alpine/cli115/scratch/yuanx/ACME_SIMULATIONS/F-MMFXX.ne4pg2_ne4pg2.crusher.gnugpu.1x1/bld/gnugpu/mpich/nodebug/nothreads/mct/mct/noesmf/c1a1l1i1o1r1g1w1i1e1/include -I/gpfs/alpine/cli115/scratch/yuanx/ACME_SIMULATIONS/F-MMFXX.ne4pg2_ne4pg2.crusher.gnugpu.1x1/bld/gnugpu/mpich/nodebug/nothreads/mct/finclude -I/opt/cray/pe/netcdf-hdf5parallel/4.7.4.7/crayclang/10.0/include -I/opt/cray/pe/mpich/8.1.12/ofi/gnu/9.1/include -I/opt/cray/pe/parallel-netcdf/1.12.1.7/crayclang/10.0/include -I/gpfs/alpine/cli115/scratch/yuanx/ACME_SIMULATIONS/F-MMFXX.ne4pg2_ne4pg2.crusher.gnugpu.1x1/bld/gnugpu/mpich/nodebug/nothreads/mct/include CMakeFiles/cmTC_ca4c9.dir/test_mpi.c.o -o cmTC_ca4c9 /opt/cray/pe/hdf5-parallel/1.12.0.7/crayclang/10.0/lib/libhdf5_hl_parallel.a /opt/cray/pe/hdf5-parallel/1.12.0.7/crayclang/10.0/lib/libhdf5_parallel.a /opt/cray/pe/parallel-netcdf/1.12.1.7/crayclang/10.0/lib/libpnetcdf.a /opt/cray/pe/netcdf-hdf5parallel/4.7.4.7/crayclang/10.0/lib/libnetcdf.a /opt/cray/pe/libsci/21.08.1.2/GNU/9.1/x86_64/lib/libsci_gnu_82_mpi.a /opt/cray/pe/libsci/21.08.1.2/GNU/9.1/x86_64/lib/libsci_gnu_82.a /usr/lib64/libdl.a -lMPI_mpi_gnu_91_LIBRARY-NOTFOUND /opt/cray/pe/mpich/8.1.12/gtl/lib/libmpi_gtl_hsa.a /opt/cray/pe/dsmml/0.2.2/dsmml/lib/libdsmml.a
ld.lld: error: unable to find library -lMPI_mpi_gnu_91_LIBRARY-NOTFOUND
clang-13: error: linker command failed with exit code 1 (use -v to see invocation)
gmake[2]: *** [CMakeFiles/cmTC_ca4c9.dir/build.make:109: cmTC_ca4c9] Error 1
gmake[2]: Leaving directory '/gpfs/alpine/cli115/scratch/yuanx/ACME_SIMULATIONS/F-MMFXX.ne4pg2_ne4pg2.crusher.gnugpu.1x1/bld/gnugpu/mpich/nodebug/nothreads/mct/pio/pio2/CMakeFiles/CMakeTmp'
gmake[1]: *** [Makefile:127: cmTC_ca4c9/fast] Error 2
gmake[1]: Leaving directory '/gpfs/alpine/cli115/scratch/yuanx/ACME_SIMULATIONS/F-MMFXX.ne4pg2_ne4pg2.crusher.gnugpu.1x1/bld/gnugpu/mpich/nodebug/nothreads/mct/pio/pio2/CMakeFiles/CMakeTmp'
After the E3SM master branch is working, we can move to the branch https://github.com/xyuan/e3sm_p3_shoc/tree/e3sm_p3_shoc_hip for P3 and SHOC test on crusher gpu
To test the P3 and SHOC, we simply need to use COMPSET=F2010-MMF2
Digging through my old notes, please try if setting the following CMAKE flag helps:
-DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath,${GCC_X86_64}/lib64"
I can't access your case dir above. Copy your branch to a shared location if you still have an issue after trying the above.
@xyuan Can you post current status of this? Did above flag help? Are you still waiting on Kitware?
This linker_flags is already used/tested, and the the issue is same
There is an issue when building full E3SM with GNU compiler, but there is another issue related to cray_mpi, the cmake find_package is unable to find the mpi package, which gives the following message,
The mpich module is loaded, but it is unable to find the mpirun, and the search $PATH environment did also not include the MPICH_DIR either.
[yuanx@login2.crusher pio2]$ module list -- Could NOT find MPI_C (missing: MPI_C_LIB_NAMES MPI_C_HEADER_DIR MPI_C_WORKS) CMake Error at /autofs/nccs-svm1_sw/crusher/spack-envs/base/opt/cray-sles15-zen3/gcc-11.2.0/cmake-3.21.3-ldjnovu5ttqqaxltrk352yratvegkqae/share/cmake-3.21/Modules/FindPackageHandleStandardArgs.cmake:230 (message): Could NOT find MPI (missing: MPI_C_FOUND C) Call Stack (most recent call first): /autofs/nccs-svm1_sw/crusher/spack-envs/base/opt/cray-sles15-zen3/gcc-11.2.0/cmake-3.21.3-ldjnovu5ttqqaxltrk352yratvegkqae/share/cmake-3.21/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE) /autofs/nccs-svm1_sw/crusher/spack-envs/base/opt/cray-sles15-zen3/gcc-11.2.0/cmake-3.21.3-ldjnovu5ttqqaxltrk352yratvegkqae/share/cmake-3.21/Modules/FindMPI.cmake:1748 (find_package_handle_standard_args) src/clib/CMakeLists.txt:32 (find_package)
Currently Loaded Modules: 1) craype-x86-trento 5) xpmem/2.3.2-2.2_7.8__g93dd7ee.shasta 9) craype/2.7.15 13) DefApps/default 17) cray-python/3.9.4.2 21) cray-libsci/21.08.1.2 2) libfabric/1.15.0.0 6) cray-pmi/6.1.2 10) cray-dsmml/0.2.2 14) craype-accel-amd-gfx90a 18) git/2.31.1 22) cray-hdf5-parallel/1.12.0.7 3) craype-network-ofi 7) cray-pmi-lib/6.0.17 11) PrgEnv-gnu/8.2.0 15) rocm/4.5.2 19) cmake/3.21.3 23) cray-netcdf-hdf5parallel/4.7.4.7 4) perftools-base/22.05.0 8) gcc/11.2.0 12) xalt/1.3.0 16) cray-mpich/8.1.12 20) zlib/1.2.11 24) cray-parallel-netcdf/1.12.1.7
PATH=/sw/crusher/xalt/1.3.0/bin:/opt/cray/pe/parallel-netcdf/1.12.1.7/bin:/opt/cray/pe/netcdf-hdf5parallel/4.7.4.7/bin:/opt/cray/pe/hdf5-parallel/1.12.0.7/bin:/opt/cray/pe/hdf5/1.12.0.7/bin:/sw/crusher/spack-envs/base/opt/cray-sles15-zen3/gcc-11.2.0/cmake-3.21.3-ldjnovu5ttqqaxltrk352yratvegkqae/bin:/sw/crusher/spack-envs/base/opt/cray-sles15-zen3/gcc-11.2.0/git-2.31.1-fstqbk5vpdu22xo7w2ohtegoqq3y7lmb/bin:/opt/cray/pe/python/3.9.4.2/bin:/opt/rocm-4.5.2/bin:/opt/cray/pe/craype/2.7.15/bin:/opt/cray/pe/gcc/11.2.0/bin:/opt/cray/pe/perftools/22.05.0/bin:/opt/cray/pe/papi/6.0.0.14/bin:/opt/cray/libfabric/1.15.0.0/bin:/sw/summit/python/3.7/anaconda3/5.3.0/condabin:/opt/clmgr/sbin:/opt/clmgr/bin:/opt/sgi/sbin:/opt/sgi/bin:/ccs/home/yuanx/.local/bin:/usr/local/bin:/usr/bin:/bin:/opt/bin:/opt/c3/bin:/usr/lib/mit/bin:/opt/puppetlabs/bin:/sbin:/opt/cray/pe/bin