E3SM-Project / scream

Exascale global atmosphere model written in C++ as part of the E3SM project
https://e3sm-project.github.io/scream/
Other
71 stars 49 forks source link

Some bugs and questions about running SCREAM on intel PVC GPU #2736

Open lulu1599 opened 4 months ago

lulu1599 commented 4 months ago

I'm trying to run SCREAM on intel PVC GPU, however meet some errors, looking for help, thanks! Here's my testing environment: SCREAM code: lasted code of master version (https://github.com/E3SM-Project/scream), not sure whether this is the correct version for test on intel GPU? Machine: intel CPU+ intel PVC GPU (1100) Compiler and MPI: intel oneapi 2024 (icx, ifx, icpx; mpiicx, mpiifx, mpiicpx)

Here's my config files:

  1. scream/cime_config/machines/config_machines.xml

    <machine MACH="PVC1100">
    <DESC>HPC, 6430 CPU + 1100 PVC(56core, 48GB)</DESC>
    <NODENAME_REGEX/>
    <OS>LINUX</OS>
    <COMPILERS>oneapi-ifx,oneapi-ifxgpu,gnu</COMPILERS>
    <MPILIBS>impi,openmpi, mpich</MPILIBS>
    <SAVE_TIMING_DIR> </SAVE_TIMING_DIR>
    <CIME_OUTPUT_ROOT>/home/lujingyu/E3SM/SCREAM/cases/scratch/$CASE</CIME_OUTPUT_ROOT>
    <DIN_LOC_ROOT>/home/lujingyu/E3SM/SCREAM/inputdata</DIN_LOC_ROOT>
    <DIN_LOC_ROOT_CLMFORC>/home/lujingyu/E3SM/SCREAM/inputdata/atm/datm7</DIN_LOC_ROOT_CLMFORC>
    <DOUT_S_ROOT>/home/lujingyu/E3SM/SCREAM/cases/archive/$CASE</DOUT_S_ROOT>
    <!--BASELINE_ROOT>/lus/gila/projects/CSC249ADSE15_CNDA/baselines/$COMPILER</BASELINE_ROOT-->
    <!--CCSM_CPRNC>/lus/gila/projects/CSC249ADSE15_CNDA/tools/cprnc/cprnc</CCSM_CPRNC-->
    <GMAKE_J>16</GMAKE_J>
    <TESTS>e3sm_developer</TESTS>
    <NTEST_PARALLEL_JOBS>4</NTEST_PARALLEL_JOBS>
    <BATCH_SYSTEM>none</BATCH_SYSTEM>
    <SUPPORTED_BY>e3sm</SUPPORTED_BY>
    <MAX_TASKS_PER_NODE>56</MAX_TASKS_PER_NODE>
    <MAX_TASKS_PER_NODE compiler="oneapi-ifx">208</MAX_TASKS_PER_NODE>
    <MAX_TASKS_PER_NODE compiler="oneapi-ifxgpu">56</MAX_TASKS_PER_NODE>
    <MAX_MPITASKS_PER_NODE>56</MAX_MPITASKS_PER_NODE>
    <MAX_MPITASKS_PER_NODE compiler="oneapi-ifx">64</MAX_MPITASKS_PER_NODE>
    <MAX_MPITASKS_PER_NODE compiler="oneapi-ifxgpu">56</MAX_MPITASKS_PER_NODE>
    <PROJECT_REQUIRED>FALSE</PROJECT_REQUIRED>
    <mpirun mpilib="impi">
      <executable>mpirun</executable>
      <arguments>
        <arg name="num_tasks"> -np {{ total_tasks }}</arg>
      </arguments>
    </mpirun>
    <module_system type="none"/>
     <RUNDIR>$CIME_OUTPUT_ROOT/$CASE/run</RUNDIR>
     <EXEROOT>$CIME_OUTPUT_ROOT/$CASE/bld</EXEROOT>
     <environment_variables>
        <env name="NETCDF_PATH">/home/lujingyu/nc_pnc2023_intel2024</env>
        <!--env name="PNETCDF_PATH">/home/lujingyu/nc_pnc2023_intel2024</env-->
        <!--env name="MKL_PATH">/opt/intel/oneapi/mkl/2024.0/</env-->
        <env name="LD_LIBRARY_PATH">/home/lujingyu/nc_pnc2023_intel2024/lib:$ENV{LD_LIBRARY_PATH} </env> <!-- -lnetcdf -lnetcdff -lpnetcdf-->
        <env name="PATH">/home/lujingyu/nc_pnc2023_intel2024/bin:$ENV{PATH}</env>
     </environment_variables>
     <environment_variables mpilib="impi">
        <env name="I_MPI_DEBUG">10</env> <!--调试级别-->
        <env name="I_MPI_OFFLOAD">1</env>
        <!-- <env name="I_MPI_PIN_DOMAIN">omp</env> Intel MPI 中用于进程绑定的域 -->
        <!-- <env name="I_MPI_PIN_ORDER">spread</env> 进程绑定时的顺序, spread 表示将进程按照散列分布到CPU核心 -->
        <!-- <env name="I_MPI_PIN_CELL">unit</env> 将进程绑定到处理器的基本执行单元(通常是 CPU 核心) -->
     </environment_variables>
     <environment_variables compiler="oneapi-ifxgpu"> 
        <env name="ONEAPI_DEVICE_SELECTOR">"opencl:gpu;level_zero:gpu"</env> 
        <!-- <env name="ONEAPI_MPICH_GPU">NO_GPU</env> OneAPI MPICH 库不使用 GPU -->
        <!-- <env name="MPIR_CVAR_ENABLE_GPU">0</env> MPICH库禁用 GPU -->
        <!-- <env name="romio_cb_read">disable</env> 禁用ROMIO(MPI I/O 库)的回调功能 -->
        <!-- <env name="romio_cb_write">disable</env> -->
        <env name="SYCL_CACHE_PERSISTENT">1</env> <!--SYCL 编程模型中缓存的持久性: 1 启用-->
        <env name="GATOR_INITIAL_MB">4000MB</env>
        <env name="GATOR_DISABLE">0</env>
        <!-- <env name="GPU_TILE_COMPACT">/soft/tools/mpi_wrapper_utils/gpu_tile_compact.sh</env> --> <!--管理 GPU 内存布局的紧凑性-->
        <env name="FI_CXI_DEFAULT_CQ_SIZE">131072</env>
        <env name="FI_CXI_CQ_FILL_PERCENT">20</env>
    </environment_variables>
    <environment_variables compiler="oneapi-ifx">
        <env name="LIBOMPTARGET_DEBUG">0</env><!--default 0, max 5 -->
        <env name="OMP_TARGET_OFFLOAD">DISABLED</env><!--default OMP_TARGET_OFFLOAD=MANDATORY-->
        <env name="FI_CXI_DEFAULT_CQ_SIZE">131072</env>
        <env name="FI_CXI_CQ_FILL_PERCENT">20</env>
        <env name="MPIR_CVAR_ENABLE_GPU">0</env>
        <env name="GPU_TILE_COMPACT"> </env>
    </environment_variables>
    <resource_limits>
        <resource name="RLIMIT_STACK">-1</resource>
    </resource_limits>
    </machine>
  2. scream/cime_config/machines/cmake_macros/oneapi-ifxgpu.cmake

    if (compile_threaded)
    string(APPEND CMAKE_C_FLAGS   " -qopenmp")
    string(APPEND CMAKE_Fortran_FLAGS   " -qopenmp")
    string(APPEND CMAKE_CXX_FLAGS " -qopenmp")
    string(APPEND CMAKE_EXE_LINKER_FLAGS  " -qopenmp")
    endif()
    string(APPEND CMAKE_C_FLAGS_RELEASE   " -O2")
    string(APPEND CMAKE_Fortran_FLAGS_RELEASE   " -O2")
    string(APPEND CMAKE_CXX_FLAGS_RELEASE " -O2")
    string(APPEND CMAKE_Fortran_FLAGS_DEBUG   " -O0 -g -check uninit -check bounds -check pointers -fpe0 -check noarg_temp_created")
    string(APPEND CMAKE_C_FLAGS_DEBUG   " -O0 -g")
    string(APPEND CMAKE_CXX_FLAGS_DEBUG " -O0 -g")
    string(APPEND CMAKE_C_FLAGS   " -traceback -fp-model precise -std=gnu99")
    string(APPEND CMAKE_CXX_FLAGS " -traceback -fp-model precise")
    string(APPEND CMAKE_Fortran_FLAGS   " -traceback -convert big_endian -assume byterecl -assume realloc_lhs -fp-model precise ")
    string(APPEND CPPDEFS " -DFORTRANUNDERSCORE -DNO_R16 -DCPRINTEL -DHAVE_SLASHPROC -DHIDE_MPI")
    string(APPEND CMAKE_Fortran_FORMAT_FIXED_FLAG " -fixed -132")
    string(APPEND CMAKE_Fortran_FORMAT_FREE_FLAG " -free")
    set(HAS_F2008_CONTIGUOUS "TRUE")
    set(MPIFC "mpiifx")
    set(MPICC "mpiicx")
    set(MPICXX "mpiicpx")
    set(SCC "icx")
    set(SCXX "icpx")
    set(SFC "ifx")
    string(APPEND CMAKE_EXE_LINKER_FLAGS " -fiopenmp -fopenmp-targets=spir64 ") 
    set(USE_SYCL "TRUE")
    set (EAMXX_ENABLE_GPU TRUE CACHE BOOL "") 
    string(APPEND SYCL_FLAGS " -fsycl -fsycl-targets=spir64  ") #-linux-intel_gpu_pvc -Xsycl-target-backend Xe-MAX  -sycl-std=121 
    string(APPEND KOKKOS_OPTIONS " -DKokkos_ARCH_INTEL_PVC=On -DKokkos_ENABLE_SYCL=On -DCMAKE_CXX_STANDARD=17")
  3. scream/components/eamxx/cmake/machine-files/PVC1100.cmake

    include(${CMAKE_CURRENT_LIST_DIR}/common.cmake)
    common_setup()
    # Load all kokkos settings from Ekat's mach file
    include (${EKAT_MACH_FILES_PATH}/kokkos/intel-pvc.cmake)
  4. scream/externals/ekat/cmake/machine-files/PVC1100.cmake

    # Load PVC arch with SYCL backend for kokkos
    include (${CMAKE_CURRENT_LIST_DIR}/kokkos/intel-pvc.cmake)

    Here's my case: (is this F2000-SCREAM-SA @ ne30pg2_ne30pg2 the best test case?)

    ./create_newcase --case test1 --compset F2000-SCREAM-SA --res ne30pg2_ne30pg2 --mach PVC1100 --compiler oneapi-ifxgpu --mpilib impi 

And here are all my log files, bld.zip and the error is:

/opt/intel/oneapi/compiler/2024.0/bin/compiler/../../include/sycl/types.hpp:2382:17: error: ambiguous partial specializations of 'is_device_copyable<const Kokkos::Experimental::Impl::SYCLFunctionWrapper<Kokkos::Impl::ViewCopy<Kokkos::View<double *****, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Experimental::SYCL, Kokkos::AnonymousSpace>, Kokkos::MemoryTraits<0>>, Kokkos::View<const double *****, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Serial, Kokkos::AnonymousSpace>, Kokkos::MemoryTraits<0>>, Kokkos::LayoutLeft, Kokkos::Experimental::SYCL, 5, long>, Kokkos::Experimental::Impl::SYCLInternal::USMObjectMem<sycl::usm::alloc::host>>>'
 2382 |   static_assert(is_device_copyable<FieldT>::value ||

I'm fresh on kokkos and SYCL, so I'm confused to the link between kokkos and SYCL, and the backend target with intel PVC GPU, is there anything wrong in my config files and lead to this error? Looking forward to the reply, thanks again!

Also, I have found some small bugs:

  1. in scream/externals/ekat/extern/kokkos/core/src/../../tpls/desul/include/desul/atomics/SYCLConversions.hpp, line 23: the name space seems like to be in ::sycl instead of ::sycl::ext::oneapi using intel oneapi 2024;
  2. in scream/components/homme/src/share/compose/cedr_kokkos.hpp: line 21: an unexpected > appeared here typedef Kokkos::Experimental::SYCL> CedrGpuSpace;
bartgol commented 4 months ago

The version of Kokkos used in EAMxx is not the most up to date. We are in the process of updating to kokkos v4.2, but testing is still underway. You can try to checkout the branch bartgol/eamxx/kokkos-4.2, and see if it fixes your issues. However, I can't guarantee anything.

lulu1599 commented 4 months ago

Thanks for your quick reply! I'm awaring may be the kokkos is the reason, too. I'll try this kokkos-4.2 to see what will happen.

lulu1599 commented 4 months ago

The version of Kokkos used in EAMxx is not the most up to date. We are in the process of updating to kokkos v4.2, but testing is still underway. You can try to checkout the branch bartgol/eamxx/kokkos-4.2, and see if it fixes your issues. However, I can't guarantee anything.

so my config options are correct? (^▽^)

mt5555 commented 4 months ago

SCREAM has not yet been run on Intel PVC. So correct config options are unknown! Just to warn you, this will probably be lot of work. But if you do get it running, please let us know what was needed.

lulu1599 commented 4 months ago

I modified the cmake_macros/oneapi-ifxgpu.cmake, but with some other fortran flags followed. For example, if I set set(CMAKE_Fortran_FLAGS "-O2 -Mnovect"), but I get -O2 -Mnovect -cpp -Wall -fast -O3 -O3 -module theta-l_kokkos_4_72_10_modules I don't know which these flags come from. Thanks!

lulu1599 commented 3 months ago

Hi again! I have done some work on compiling SCREAM on intel PVC GPU, including some virtual and external functions called by SYCL kernel. I've successfully reached 50% of the compilation progress, but I've encountered some errors all related to the Pack class, as indicated by the message:

'static assertion failed due to requirement !ekat::OnGpuKokkos::Experimental::SYCL::value || pack_size<ekat::Pack<double, 16>>() == 1': Error! Do not use PackSize>1 on GPU'

. Seems pack_size<ScalarT>() must be ekat::Pack<double, 1> or something when run on GPU device . Could you please provide some guidance on how this issue might be resolved? Thank you for your assistance.

Here's the complete log file. e3sm.bldlog.240403-110444.txt

mt5555 commented 3 months ago

Correct - pack_size is for vectorization on CPU systems. It should be 1 on GPU systems.

some documentation: doi: 10.5194/gmd-12-1423-2019

lulu1599 commented 3 months ago

Thanks for the documentation. For solving this, can I just force pack_size to be 1 in the code? Or by any other way to change it, when I try to build it on GPU? Thanks again!

bartgol commented 3 months ago

Thanks for the documentation. For solving this, can I just force pack_size to be 1 in the code? Or by any other way to change it, when I try to build it on GPU? Thanks again!

On GPU it should already get set to 1. But since we are not handling SYCL (yet), it doesn't. You should probably do something similar to what is done here, but using SYCL instead of CUDA. You should also modify something in components/eamxx/src/physics/rrtmgp/CMakeLists.txt, adding a SYCL equivalent of what is already done for CUDA/HIP.

lulu1599 commented 3 months ago

Thanks, I will try it.

lulu1599 commented 3 months ago

Thanks for your useful guide, after solving a couple of bugs, finally I am getting to the linking stage. Here's a error I have been try to solve it for 3 days but still don't know what to do. I tried to search llvm-foreach: Aborted (core dumped) and exit code 254 but get nothing useful. Maybe it's a IntelLLVM or environment error? I really don't know. Do you have any ideas?

Build succeeded. Compilation from IR - skipping loading of FCL Build succeeded. Compilation from IR - skipping loading of FCL Build succeeded. llvm-foreach: Aborted (core dumped) ... icpx: error: gen compiler command failed with exit code 254 (use -v to see invocation) Intel(R) oneAPI DPC++/C++ Compiler 2024.0.0 (2024.0.0.20231017) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /opt/intel/oneapi/compiler/2024.0/bin/compiler Configuration file: /opt/intel/oneapi/compiler/2024.0/bin/compiler/../icpx.cfg icpx: note: diagnostic msg: Error generating preprocessed source(s).

Here's the log files. csm_share.bldlog.240423-085832.txt e3sm.bldlog.240423-085832.txt gptl.bldlog.240423-085832.txt kokkos.bldlog.240423-085832.txt mct.bldlog.240423-085832.txt spio.bldlog.240423-085832.txt

bartgol commented 3 months ago

Sorry, this is beyond my knowledge. :/

lulu1599 commented 2 months ago

A great news is that the e3sm.exe has been finally maked successfully (by linking with ifx compiler), now the issue is image I am not sure whether this "nsplit=-1" is related to the error, and should it to be changed to >=1 manually? ] And how to fix the error Native API failed. Native API returns: -46 (PI_ERROR_INVALID_KERNEL_NAME) terminate called after throwing an instance of 'sycl::_V1::exception' my create case command is ./create_newcase --case pvc1450_test_fixlinker9 --compset F2000-SCREAM-SA --res ne30pg2_ne30pg2 --mach PVC --compiler oneapi-ifxgpu --mpilib impi, without --ngpus-per-node, --gpu-type, and --gpu-offload because there's no proper options yet, is this right? I have set the environment with export KOKKOS_VISIBLE_DEVICES=0.

bartgol commented 2 months ago

The nsplit warning is benign. Homme doesn't know whether it's used in EAMxx or not. For use in EAMxx, it is expected that nsplit is not known until runtime.

lulu1599 commented 2 months ago

OK, I get it. Another thing I wanna ask is that, is the --ngpus-per-node, --gpu-type and --gpu-offload necessary on intel PVC GPU when I run ./create_newcase? Noticed that the intel pvc is not supported yet.

image

bartgol commented 2 months ago

As you noticed, we don't officially support Intel GPUs. We can't really provide help with that, due to limited funding/resources. You are welcome to try to debug and maybe make it work on Intel, and contribute back with a PR.

Sorry, I know this is not very satisfactory, but unfortunately the e3sm policy is to not offer "customer support" to the general public.

lulu1599 commented 2 months ago

Your answer is very helpful, and I belive that we are not far to the success, and the Native API failed. Native API returns: -46 (PI_ERROR_INVALID_KERNEL_NAME) is caused by multi definition of fuctions, the only thing I can do is to locate this fuction.

  1. I tried to use -traceback but this flag is not supported on spir64_gen device.
  2. Then I tried ./case.submit --debug, entering the pdb tool, however, nothing useful printed, like this.

So do you have any good advice that can help me find the wrong fuction make the error? Thanks a lot. image

lulu1599 commented 2 months ago

Or what's you method to debug the e3sm/scream issues?

bartgol commented 2 months ago

I have never debugged on Intel GPUs, and I've never run into this kind of errors, sorry.

lulu1599 commented 2 months ago

Hi again, now I'm trying to build SCREAM on NVIDIA GPU, but encountering a linking error ../../eamxx/src/doubly-periodic/libscream_theta-l_kokkos_4_72_10.a(eamxx_homme_process_interface.cpp.o): In functionscream::HommeDynamics::initialize_homme_state()': /The2ndTechGroup/lujingyu/SCREAM/scream_with_kokkos4.2/scream/components/eamxx/src/dynamics/homme/eamxx_homme_process_interface.cpp:1141: undefined reference to ekat::WorkspaceManager<ekat::Pack<double, 1>, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >::GPU_DEFAULT_OVERPROVISION_FACTOR', and the GPU_DEFAULT_OVERPROVISION_FACTOR is definded in file externals/ekat/src/ekat/ekat_workspace.hpp. The code seems right and it's a compiler issue. I'm using the nv-hpcx toolkit (version 23.9) and cuda (version 12.2). May I ask how do you build SCREAM with nvidiagpu? And have you run into the same error?

image

bartgol commented 2 months ago

Based on the fact that you have the folder components/eamxx/src/doubly-periodic, I can infer that this is a version of EAMxx quite old. I can't really help yo debug, as I see a different code at those lines.

I would recommend to rebase your branch on current master. Or, since you are using kokoks 4.2 (inferring it from the folder name), you may wait a few days. We're trying to merge #2835 , which should bring in kokkos 4.2 in current master.

lulu1599 commented 2 months ago

I see... Thanks for your explanation!

lulu1599 commented 1 month ago

Excited to share that we have successfully managed to run SCREAM on an Intel PVC GPU. However, only running it with 1 process on 1 GPU, and the SYPD is 0.04 for 1 nday, which is quite a low performance. 01664EED

When attempting to run with

I am seeking your assistance on how to run >1 process on 1 GPU and >1 thread on 1 GPU (Kokkos has been compiled with both the OPENMP and SYCL backends). Your guidance and support would be greatly appreciated.