E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
353 stars 362 forks source link

Crusher GNU compiler with hipcc cmake FindMPI issue #4976

Open xyuan opened 2 years ago

xyuan commented 2 years ago

There is an issue when building full E3SM with GNU compiler, but there is another issue related to cray_mpi, the cmake find_package is unable to find the mpi package, which gives the following message,

The mpich module is loaded, but it is unable to find the mpirun, and the search $PATH environment did also not include the MPICH_DIR either.

[yuanx@login2.crusher pio2]$ module list -- Could NOT find MPI_C (missing: MPI_C_LIB_NAMES MPI_C_HEADER_DIR MPI_C_WORKS) CMake Error at /autofs/nccs-svm1_sw/crusher/spack-envs/base/opt/cray-sles15-zen3/gcc-11.2.0/cmake-3.21.3-ldjnovu5ttqqaxltrk352yratvegkqae/share/cmake-3.21/Modules/FindPackageHandleStandardArgs.cmake:230 (message): Could NOT find MPI (missing: MPI_C_FOUND C) Call Stack (most recent call first): /autofs/nccs-svm1_sw/crusher/spack-envs/base/opt/cray-sles15-zen3/gcc-11.2.0/cmake-3.21.3-ldjnovu5ttqqaxltrk352yratvegkqae/share/cmake-3.21/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE) /autofs/nccs-svm1_sw/crusher/spack-envs/base/opt/cray-sles15-zen3/gcc-11.2.0/cmake-3.21.3-ldjnovu5ttqqaxltrk352yratvegkqae/share/cmake-3.21/Modules/FindMPI.cmake:1748 (find_package_handle_standard_args) src/clib/CMakeLists.txt:32 (find_package)

Currently Loaded Modules: 1) craype-x86-trento 5) xpmem/2.3.2-2.2_7.8__g93dd7ee.shasta 9) craype/2.7.15 13) DefApps/default 17) cray-python/3.9.4.2 21) cray-libsci/21.08.1.2 2) libfabric/1.15.0.0 6) cray-pmi/6.1.2 10) cray-dsmml/0.2.2 14) craype-accel-amd-gfx90a 18) git/2.31.1 22) cray-hdf5-parallel/1.12.0.7 3) craype-network-ofi 7) cray-pmi-lib/6.0.17 11) PrgEnv-gnu/8.2.0 15) rocm/4.5.2 19) cmake/3.21.3 23) cray-netcdf-hdf5parallel/4.7.4.7 4) perftools-base/22.05.0 8) gcc/11.2.0 12) xalt/1.3.0 16) cray-mpich/8.1.12 20) zlib/1.2.11 24) cray-parallel-netcdf/1.12.1.7

PATH=/sw/crusher/xalt/1.3.0/bin:/opt/cray/pe/parallel-netcdf/1.12.1.7/bin:/opt/cray/pe/netcdf-hdf5parallel/4.7.4.7/bin:/opt/cray/pe/hdf5-parallel/1.12.0.7/bin:/opt/cray/pe/hdf5/1.12.0.7/bin:/sw/crusher/spack-envs/base/opt/cray-sles15-zen3/gcc-11.2.0/cmake-3.21.3-ldjnovu5ttqqaxltrk352yratvegkqae/bin:/sw/crusher/spack-envs/base/opt/cray-sles15-zen3/gcc-11.2.0/git-2.31.1-fstqbk5vpdu22xo7w2ohtegoqq3y7lmb/bin:/opt/cray/pe/python/3.9.4.2/bin:/opt/rocm-4.5.2/bin:/opt/cray/pe/craype/2.7.15/bin:/opt/cray/pe/gcc/11.2.0/bin:/opt/cray/pe/perftools/22.05.0/bin:/opt/cray/pe/papi/6.0.0.14/bin:/opt/cray/libfabric/1.15.0.0/bin:/sw/summit/python/3.7/anaconda3/5.3.0/condabin:/opt/clmgr/sbin:/opt/clmgr/bin:/opt/sgi/sbin:/opt/sgi/bin:/ccs/home/yuanx/.local/bin:/usr/local/bin:/usr/bin:/bin:/opt/bin:/opt/c3/bin:/usr/lib/mit/bin:/opt/puppetlabs/bin:/sbin:/opt/cray/pe/bin

sarats commented 2 years ago

@grnydawn already has E3SM building with GNU on Crusher. So first, do you have this issue after extending his machine files with hipcc in your case or using your own?

sarats commented 2 years ago

Also confirm your branch details - latest master or something else? What's the case you are trying to build?

xyuan commented 2 years ago

The following runscript for HiP test case is used,

!/bin/bash

CASE_ROOT=$(pwd) E3SM=/ccs/home/yuanx/e3sm OUTPUT=/gpfs/alpine/cli115/scratch/yuanx/ACME_SIMULATIONS DATA=/gpfs/alpine/cli115/scratch/yuanx/ACME_SIMULATIONS

COMPSET=F2010-MMF1

COMPSET=F-MMFXX-P3

RES=ne4pg2_ne4pg2 COMPILER=gnugpu MACH=crusher PROJ=cli133 PELAYOUT=1x1

CASE=${COMPSET}.${RES}.${MACH}.${COMPILER}.${PELAYOUT}

echo echo ${CASE} echo

${E3SM}/cime/scripts/create_newcase -case ${CASE_ROOT}/${CASE} -compset ${COMPSET} -res ${RES} -mach ${MACH} -compiler ${COMPILER} -pecount ${PELAYOUT} -project ${PROJ} --output-root ${OUT PUT} --handle-preexisting-dirs r

cd ${CASE_ROOT}/${CASE}

./xmlchange --append -id CAM_CONFIG_OPTS -val " -crm_dt 10 "

./xmlchange ATM_NCPL=144

cat > user_nl_eam << 'eof' transport_alg=0 hypervis_subcycle_q=1 dt_tracer_factor = 2 eof

./case.setup ./case.build

./xmlchange STOP_OPTION=ndays ./xmlchange STOP_N=1 ./xmlchange CONTINUE_RUN=FALSE ./xmlchange JOB_WALLCLOCK_TIME=02:00 ./xmlchange REST_OPTION=never ./xmlchange CHARGE_ACCOUNT=$PROJ

cp -rf ${DATA}/data/ ${CASE_ROOT}/${CASE}/run

./case.submit

echo echo ${CASE} echo

xyuan commented 2 years ago

This test branch can be used with the latest E3SM master, with some changes to add HiP support in the config_machine.xml, gnugpu_crusher.cmake, please copy it from my working branch

xyuan commented 2 years ago

After setting up MPICH environment (LD_LIBRARY_PATH, and INCLUDE_PATH) correctly, it is still unable to find libmpi_gpu_91.so library in the mpich library directory at FindMPI.cmake test case

xyuan commented 2 years ago

information from CMakeErrors.log,

/opt/rocm-4.5.2/bin/hipcc -mcmodel=medium -O -O2 -I/opt/cray/pe/mpich/8.1.12/ofi/gnu/9.1/include -I/opt/rocm-4.5.2/include -DTIMING -DCNL -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU -DLINUX -DNDEBUG -DHAVE_MPI -DMCT_INTERFACE -DPIO2 -DHAVE_SLASHPROC -D_PNETCDF -DATM_PRESENT -DICE_PRESENT -DLND_PRESENT -DOCN_PRESENT -DROF_PRESENT -DGLC_PRESENT -DWAV_PRESENT -DESP_PRESENT -DMED_PRESENT -DPIO2 -I. -I/gpfs/alpine/cli115/scratch/yuanx/ACME_SIMULATIONS/F-MMFXX.ne4pg2_ne4pg2.crusher.gnugpu.1x1/bld/gnugpu/mpich/nodebug/nothreads/mct/include -I/gpfs/alpine/cli115/scratch/yuanx/ACME_SIMULATIONS/F-MMFXX.ne4pg2_ne4pg2.crusher.gnugpu.1x1/bld/gnugpu/mpich/nodebug/nothreads/mct/mct/noesmf/c1a1l1i1o1r1g1w1i1e1/include -I/gpfs/alpine/cli115/scratch/yuanx/ACME_SIMULATIONS/F-MMFXX.ne4pg2_ne4pg2.crusher.gnugpu.1x1/bld/gnugpu/mpich/nodebug/nothreads/mct/finclude -I/opt/cray/pe/netcdf-hdf5parallel/4.7.4.7/crayclang/10.0/include -I/opt/cray/pe/mpich/8.1.12/ofi/gnu/9.1/include -I/opt/cray/pe/parallel-netcdf/1.12.1.7/crayclang/10.0/include -I/gpfs/alpine/cli115/scratch/yuanx/ACME_SIMULATIONS/F-MMFXX.ne4pg2_ne4pg2.crusher.gnugpu.1x1/bld/gnugpu/mpich/nodebug/nothreads/mct/include CMakeFiles/cmTC_ca4c9.dir/test_mpi.c.o -o cmTC_ca4c9 /opt/cray/pe/hdf5-parallel/1.12.0.7/crayclang/10.0/lib/libhdf5_hl_parallel.a /opt/cray/pe/hdf5-parallel/1.12.0.7/crayclang/10.0/lib/libhdf5_parallel.a /opt/cray/pe/parallel-netcdf/1.12.1.7/crayclang/10.0/lib/libpnetcdf.a /opt/cray/pe/netcdf-hdf5parallel/4.7.4.7/crayclang/10.0/lib/libnetcdf.a /opt/cray/pe/libsci/21.08.1.2/GNU/9.1/x86_64/lib/libsci_gnu_82_mpi.a /opt/cray/pe/libsci/21.08.1.2/GNU/9.1/x86_64/lib/libsci_gnu_82.a /usr/lib64/libdl.a -lMPI_mpi_gnu_91_LIBRARY-NOTFOUND /opt/cray/pe/mpich/8.1.12/gtl/lib/libmpi_gtl_hsa.a /opt/cray/pe/dsmml/0.2.2/dsmml/lib/libdsmml.a

ld.lld: error: unable to find library -lMPI_mpi_gnu_91_LIBRARY-NOTFOUND

clang-13: error: linker command failed with exit code 1 (use -v to see invocation)

gmake[2]: *** [CMakeFiles/cmTC_ca4c9.dir/build.make:109: cmTC_ca4c9] Error 1

gmake[2]: Leaving directory '/gpfs/alpine/cli115/scratch/yuanx/ACME_SIMULATIONS/F-MMFXX.ne4pg2_ne4pg2.crusher.gnugpu.1x1/bld/gnugpu/mpich/nodebug/nothreads/mct/pio/pio2/CMakeFiles/CMakeTmp'

gmake[1]: *** [Makefile:127: cmTC_ca4c9/fast] Error 2

gmake[1]: Leaving directory '/gpfs/alpine/cli115/scratch/yuanx/ACME_SIMULATIONS/F-MMFXX.ne4pg2_ne4pg2.crusher.gnugpu.1x1/bld/gnugpu/mpich/nodebug/nothreads/mct/pio/pio2/CMakeFiles/CMakeTmp'

xyuan commented 2 years ago

After the E3SM master branch is working, we can move to the branch https://github.com/xyuan/e3sm_p3_shoc/tree/e3sm_p3_shoc_hip for P3 and SHOC test on crusher gpu

To test the P3 and SHOC, we simply need to use COMPSET=F2010-MMF2

sarats commented 2 years ago

Digging through my old notes, please try if setting the following CMAKE flag helps: -DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath,${GCC_X86_64}/lib64"

sarats commented 2 years ago

I can't access your case dir above. Copy your branch to a shared location if you still have an issue after trying the above.

sarats commented 2 years ago

@xyuan Can you post current status of this? Did above flag help? Are you still waiting on Kitware?

xyuan commented 2 years ago

This linker_flags is already used/tested, and the the issue is same