Some bugs and questions about running SCREAM on intel PVC GPU

lulu1599 commented 4 months ago

I'm trying to run SCREAM on intel PVC GPU, however meet some errors, looking for help, thanks! Here's my testing environment: SCREAM code: lasted code of master version (https://github.com/E3SM-Project/scream), not sure whether this is the correct version for test on intel GPU? Machine: intel CPU+ intel PVC GPU (1100) Compiler and MPI: intel oneapi 2024 (icx, ifx, icpx; mpiicx, mpiifx, mpiicpx)

Here's my config files:

scream/cime_config/machines/config_machines.xml

<machine MACH="PVC1100">
<DESC>HPC, 6430 CPU + 1100 PVC(56core, 48GB)</DESC>
<NODENAME_REGEX/>
<OS>LINUX</OS>
<COMPILERS>oneapi-ifx,oneapi-ifxgpu,gnu</COMPILERS>
<MPILIBS>impi,openmpi, mpich</MPILIBS>
<SAVE_TIMING_DIR> </SAVE_TIMING_DIR>
<CIME_OUTPUT_ROOT>/home/lujingyu/E3SM/SCREAM/cases/scratch/$CASE</CIME_OUTPUT_ROOT>
<DIN_LOC_ROOT>/home/lujingyu/E3SM/SCREAM/inputdata</DIN_LOC_ROOT>
<DIN_LOC_ROOT_CLMFORC>/home/lujingyu/E3SM/SCREAM/inputdata/atm/datm7</DIN_LOC_ROOT_CLMFORC>
<DOUT_S_ROOT>/home/lujingyu/E3SM/SCREAM/cases/archive/$CASE</DOUT_S_ROOT>
<!--BASELINE_ROOT>/lus/gila/projects/CSC249ADSE15_CNDA/baselines/$COMPILER</BASELINE_ROOT-->
<!--CCSM_CPRNC>/lus/gila/projects/CSC249ADSE15_CNDA/tools/cprnc/cprnc</CCSM_CPRNC-->
<GMAKE_J>16</GMAKE_J>
<TESTS>e3sm_developer</TESTS>
<NTEST_PARALLEL_JOBS>4</NTEST_PARALLEL_JOBS>
<BATCH_SYSTEM>none</BATCH_SYSTEM>
<SUPPORTED_BY>e3sm</SUPPORTED_BY>
<MAX_TASKS_PER_NODE>56</MAX_TASKS_PER_NODE>
<MAX_TASKS_PER_NODE compiler="oneapi-ifx">208</MAX_TASKS_PER_NODE>
<MAX_TASKS_PER_NODE compiler="oneapi-ifxgpu">56</MAX_TASKS_PER_NODE>
<MAX_MPITASKS_PER_NODE>56</MAX_MPITASKS_PER_NODE>
<MAX_MPITASKS_PER_NODE compiler="oneapi-ifx">64</MAX_MPITASKS_PER_NODE>
<MAX_MPITASKS_PER_NODE compiler="oneapi-ifxgpu">56</MAX_MPITASKS_PER_NODE>
<PROJECT_REQUIRED>FALSE</PROJECT_REQUIRED>
<mpirun mpilib="impi">
  <executable>mpirun</executable>
  <arguments>
    <arg name="num_tasks"> -np {{ total_tasks }}</arg>
  </arguments>
</mpirun>
<module_system type="none"/>
 <RUNDIR>$CIME_OUTPUT_ROOT/$CASE/run</RUNDIR>
 <EXEROOT>$CIME_OUTPUT_ROOT/$CASE/bld</EXEROOT>
 <environment_variables>
    <env name="NETCDF_PATH">/home/lujingyu/nc_pnc2023_intel2024</env>
    <!--env name="PNETCDF_PATH">/home/lujingyu/nc_pnc2023_intel2024</env-->
    <!--env name="MKL_PATH">/opt/intel/oneapi/mkl/2024.0/</env-->
    <env name="LD_LIBRARY_PATH">/home/lujingyu/nc_pnc2023_intel2024/lib:$ENV{LD_LIBRARY_PATH} </env> <!-- -lnetcdf -lnetcdff -lpnetcdf-->
    <env name="PATH">/home/lujingyu/nc_pnc2023_intel2024/bin:$ENV{PATH}</env>
 </environment_variables>
 <environment_variables mpilib="impi">
    <env name="I_MPI_DEBUG">10</env> <!--调试级别-->
    <env name="I_MPI_OFFLOAD">1</env>
    <!-- <env name="I_MPI_PIN_DOMAIN">omp</env> Intel MPI 中用于进程绑定的域 -->
    <!-- <env name="I_MPI_PIN_ORDER">spread</env> 进程绑定时的顺序, spread 表示将进程按照散列分布到CPU核心 -->
    <!-- <env name="I_MPI_PIN_CELL">unit</env> 将进程绑定到处理器的基本执行单元（通常是 CPU 核心） -->
 </environment_variables>
 <environment_variables compiler="oneapi-ifxgpu"> 
    <env name="ONEAPI_DEVICE_SELECTOR">"opencl:gpu;level_zero:gpu"</env> 
    <!-- <env name="ONEAPI_MPICH_GPU">NO_GPU</env> OneAPI MPICH 库不使用 GPU -->
    <!-- <env name="MPIR_CVAR_ENABLE_GPU">0</env> MPICH库禁用 GPU -->
    <!-- <env name="romio_cb_read">disable</env> 禁用ROMIO（MPI I/O 库）的回调功能 -->
    <!-- <env name="romio_cb_write">disable</env> -->
    <env name="SYCL_CACHE_PERSISTENT">1</env> <!--SYCL 编程模型中缓存的持久性: 1 启用-->
    <env name="GATOR_INITIAL_MB">4000MB</env>
    <env name="GATOR_DISABLE">0</env>
    <!-- <env name="GPU_TILE_COMPACT">/soft/tools/mpi_wrapper_utils/gpu_tile_compact.sh</env> --> <!--管理 GPU 内存布局的紧凑性-->
    <env name="FI_CXI_DEFAULT_CQ_SIZE">131072</env>
    <env name="FI_CXI_CQ_FILL_PERCENT">20</env>
</environment_variables>
<environment_variables compiler="oneapi-ifx">
    <env name="LIBOMPTARGET_DEBUG">0</env><!--default 0, max 5 -->
    <env name="OMP_TARGET_OFFLOAD">DISABLED</env><!--default OMP_TARGET_OFFLOAD=MANDATORY-->
    <env name="FI_CXI_DEFAULT_CQ_SIZE">131072</env>
    <env name="FI_CXI_CQ_FILL_PERCENT">20</env>
    <env name="MPIR_CVAR_ENABLE_GPU">0</env>
    <env name="GPU_TILE_COMPACT"> </env>
</environment_variables>
<resource_limits>
    <resource name="RLIMIT_STACK">-1</resource>
</resource_limits>
</machine>

scream/cime_config/machines/cmake_macros/oneapi-ifxgpu.cmake

if (compile_threaded)
string(APPEND CMAKE_C_FLAGS   " -qopenmp")
string(APPEND CMAKE_Fortran_FLAGS   " -qopenmp")
string(APPEND CMAKE_CXX_FLAGS " -qopenmp")
string(APPEND CMAKE_EXE_LINKER_FLAGS  " -qopenmp")
endif()
string(APPEND CMAKE_C_FLAGS_RELEASE   " -O2")
string(APPEND CMAKE_Fortran_FLAGS_RELEASE   " -O2")
string(APPEND CMAKE_CXX_FLAGS_RELEASE " -O2")
string(APPEND CMAKE_Fortran_FLAGS_DEBUG   " -O0 -g -check uninit -check bounds -check pointers -fpe0 -check noarg_temp_created")
string(APPEND CMAKE_C_FLAGS_DEBUG   " -O0 -g")
string(APPEND CMAKE_CXX_FLAGS_DEBUG " -O0 -g")
string(APPEND CMAKE_C_FLAGS   " -traceback -fp-model precise -std=gnu99")
string(APPEND CMAKE_CXX_FLAGS " -traceback -fp-model precise")
string(APPEND CMAKE_Fortran_FLAGS   " -traceback -convert big_endian -assume byterecl -assume realloc_lhs -fp-model precise ")
string(APPEND CPPDEFS " -DFORTRANUNDERSCORE -DNO_R16 -DCPRINTEL -DHAVE_SLASHPROC -DHIDE_MPI")
string(APPEND CMAKE_Fortran_FORMAT_FIXED_FLAG " -fixed -132")
string(APPEND CMAKE_Fortran_FORMAT_FREE_FLAG " -free")
set(HAS_F2008_CONTIGUOUS "TRUE")
set(MPIFC "mpiifx")
set(MPICC "mpiicx")
set(MPICXX "mpiicpx")
set(SCC "icx")
set(SCXX "icpx")
set(SFC "ifx")
string(APPEND CMAKE_EXE_LINKER_FLAGS " -fiopenmp -fopenmp-targets=spir64 ") 
set(USE_SYCL "TRUE")
set (EAMXX_ENABLE_GPU TRUE CACHE BOOL "") 
string(APPEND SYCL_FLAGS " -fsycl -fsycl-targets=spir64  ") #-linux-intel_gpu_pvc -Xsycl-target-backend Xe-MAX  -sycl-std=121 
string(APPEND KOKKOS_OPTIONS " -DKokkos_ARCH_INTEL_PVC=On -DKokkos_ENABLE_SYCL=On -DCMAKE_CXX_STANDARD=17")

scream/components/eamxx/cmake/machine-files/PVC1100.cmake

include(${CMAKE_CURRENT_LIST_DIR}/common.cmake)
common_setup()
# Load all kokkos settings from Ekat's mach file
include (${EKAT_MACH_FILES_PATH}/kokkos/intel-pvc.cmake)

scream/externals/ekat/cmake/machine-files/PVC1100.cmake

# Load PVC arch with SYCL backend for kokkos
include (${CMAKE_CURRENT_LIST_DIR}/kokkos/intel-pvc.cmake)

Here's my case: (is this F2000-SCREAM-SA @ ne30pg2_ne30pg2 the best test case?)

./create_newcase --case test1 --compset F2000-SCREAM-SA --res ne30pg2_ne30pg2 --mach PVC1100 --compiler oneapi-ifxgpu --mpilib impi

And here are all my log files, bld.zip and the error is:

/opt/intel/oneapi/compiler/2024.0/bin/compiler/../../include/sycl/types.hpp:2382:17: error: ambiguous partial specializations of 'is_device_copyable<const Kokkos::Experimental::Impl::SYCLFunctionWrapper<Kokkos::Impl::ViewCopy<Kokkos::View<double *****, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Experimental::SYCL, Kokkos::AnonymousSpace>, Kokkos::MemoryTraits<0>>, Kokkos::View<const double *****, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Serial, Kokkos::AnonymousSpace>, Kokkos::MemoryTraits<0>>, Kokkos::LayoutLeft, Kokkos::Experimental::SYCL, 5, long>, Kokkos::Experimental::Impl::SYCLInternal::USMObjectMem<sycl::usm::alloc::host>>>'
 2382 |   static_assert(is_device_copyable<FieldT>::value ||

I'm fresh on kokkos and SYCL, so I'm confused to the link between kokkos and SYCL, and the backend target with intel PVC GPU, is there anything wrong in my config files and lead to this error? Looking forward to the reply, thanks again!

Also, I have found some small bugs:

in scream/externals/ekat/extern/kokkos/core/src/../../tpls/desul/include/desul/atomics/SYCLConversions.hpp, line 23: the name space seems like to be in ::sycl instead of ::sycl::ext::oneapi using intel oneapi 2024;
in scream/components/homme/src/share/compose/cedr_kokkos.hpp: line 21: an unexpected > appeared here typedef Kokkos::Experimental::SYCL> CedrGpuSpace;

bartgol commented 4 months ago

The version of Kokkos used in EAMxx is not the most up to date. We are in the process of updating to kokkos v4.2, but testing is still underway. You can try to checkout the branch bartgol/eamxx/kokkos-4.2, and see if it fixes your issues. However, I can't guarantee anything.

lulu1599 commented 4 months ago

Thanks for your quick reply! I'm awaring may be the kokkos is the reason, too. I'll try this kokkos-4.2 to see what will happen.

lulu1599 commented 4 months ago

The version of Kokkos used in EAMxx is not the most up to date. We are in the process of updating to kokkos v4.2, but testing is still underway. You can try to checkout the branch bartgol/eamxx/kokkos-4.2, and see if it fixes your issues. However, I can't guarantee anything.

so my config options are correct? (^▽^)

mt5555 commented 4 months ago

SCREAM has not yet been run on Intel PVC. So correct config options are unknown! Just to warn you, this will probably be lot of work. But if you do get it running, please let us know what was needed.

lulu1599 commented 4 months ago

I modified the cmake_macros/oneapi-ifxgpu.cmake, but with some other fortran flags followed. For example, if I set set(CMAKE_Fortran_FLAGS "-O2 -Mnovect"), but I get -O2 -Mnovect -cpp -Wall -fast -O3 -O3 -module theta-l_kokkos_4_72_10_modules I don't know which these flags come from. Thanks!

lulu1599 commented 3 months ago

Hi again! I have done some work on compiling SCREAM on intel PVC GPU, including some virtual and external functions called by SYCL kernel. I've successfully reached 50% of the compilation progress, but I've encountered some errors all related to the Pack class, as indicated by the message:

'static assertion failed due to requirement !ekat::OnGpuKokkos::Experimental::SYCL::value || pack_size<ekat::Pack<double, 16>>() == 1': Error! Do not use PackSize>1 on GPU'

. Seems pack_size<ScalarT>() must be ekat::Pack<double, 1> or something when run on GPU device . Could you please provide some guidance on how this issue might be resolved? Thank you for your assistance.

Here's the complete log file. e3sm.bldlog.240403-110444.txt

mt5555 commented 3 months ago

Correct - pack_size is for vectorization on CPU systems. It should be 1 on GPU systems.

some documentation: doi: 10.5194/gmd-12-1423-2019

lulu1599 commented 3 months ago

Thanks for the documentation. For solving this, can I just force pack_size to be 1 in the code? Or by any other way to change it, when I try to build it on GPU? Thanks again!

bartgol commented 3 months ago

Thanks for the documentation. For solving this, can I just force pack_size to be 1 in the code? Or by any other way to change it, when I try to build it on GPU? Thanks again!

On GPU it should already get set to 1. But since we are not handling SYCL (yet), it doesn't. You should probably do something similar to what is done here, but using SYCL instead of CUDA. You should also modify something in components/eamxx/src/physics/rrtmgp/CMakeLists.txt, adding a SYCL equivalent of what is already done for CUDA/HIP.

lulu1599 commented 3 months ago

Thanks, I will try it.

lulu1599 commented 3 months ago

Thanks for your useful guide, after solving a couple of bugs, finally I am getting to the linking stage. Here's a error I have been try to solve it for 3 days but still don't know what to do. I tried to search llvm-foreach: Aborted (core dumped) and exit code 254 but get nothing useful. Maybe it's a IntelLLVM or environment error? I really don't know. Do you have any ideas?

Build succeeded. Compilation from IR - skipping loading of FCL Build succeeded. Compilation from IR - skipping loading of FCL Build succeeded. llvm-foreach: Aborted (core dumped) ... icpx: error: gen compiler command failed with exit code 254 (use -v to see invocation) Intel(R) oneAPI DPC++/C++ Compiler 2024.0.0 (2024.0.0.20231017) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /opt/intel/oneapi/compiler/2024.0/bin/compiler Configuration file: /opt/intel/oneapi/compiler/2024.0/bin/compiler/../icpx.cfg icpx: note: diagnostic msg: Error generating preprocessed source(s).

Here's the log files. csm_share.bldlog.240423-085832.txt e3sm.bldlog.240423-085832.txt gptl.bldlog.240423-085832.txt kokkos.bldlog.240423-085832.txt mct.bldlog.240423-085832.txt spio.bldlog.240423-085832.txt

bartgol commented 3 months ago

Sorry, this is beyond my knowledge. :/