Open lulu1599 opened 4 months ago
The version of Kokkos used in EAMxx is not the most up to date. We are in the process of updating to kokkos v4.2, but testing is still underway. You can try to checkout the branch bartgol/eamxx/kokkos-4.2
, and see if it fixes your issues. However, I can't guarantee anything.
Thanks for your quick reply! I'm awaring may be the kokkos is the reason, too. I'll try this kokkos-4.2
to see what will happen.
The version of Kokkos used in EAMxx is not the most up to date. We are in the process of updating to kokkos v4.2, but testing is still underway. You can try to checkout the branch
bartgol/eamxx/kokkos-4.2
, and see if it fixes your issues. However, I can't guarantee anything.
so my config options are correct? (^▽^)
SCREAM has not yet been run on Intel PVC. So correct config options are unknown! Just to warn you, this will probably be lot of work. But if you do get it running, please let us know what was needed.
I modified the cmake_macros/oneapi-ifxgpu.cmake, but with some other fortran flags followed.
For example, if I set set(CMAKE_Fortran_FLAGS "-O2 -Mnovect")
, but I get
-O2 -Mnovect -cpp -Wall -fast -O3 -O3 -module theta-l_kokkos_4_72_10_modules
I don't know which these flags come from. Thanks!
Hi again! I have done some work on compiling SCREAM on intel PVC GPU, including some virtual and external functions called by SYCL kernel.
I've successfully reached 50% of the compilation progress, but I've encountered some errors all related to the Pack
class, as indicated by the message:
'static assertion failed due to requirement !ekat::OnGpuKokkos::Experimental::SYCL::value || pack_size<ekat::Pack<double, 16>>() == 1': Error! Do not use PackSize>1 on GPU'
. Seems pack_size<ScalarT>()
must be ekat::Pack<double, 1>
or something when run on GPU device . Could you please provide some guidance on how this issue might be resolved? Thank you for your assistance.
Here's the complete log file. e3sm.bldlog.240403-110444.txt
Correct - pack_size is for vectorization on CPU systems. It should be 1 on GPU systems.
some documentation: doi: 10.5194/gmd-12-1423-2019
Thanks for the documentation. For solving this, can I just force pack_size to be 1 in the code? Or by any other way to change it, when I try to build it on GPU? Thanks again!
Thanks for the documentation. For solving this, can I just force pack_size to be 1 in the code? Or by any other way to change it, when I try to build it on GPU? Thanks again!
On GPU it should already get set to 1. But since we are not handling SYCL (yet), it doesn't. You should probably do something similar to what is done here, but using SYCL instead of CUDA. You should also modify something in components/eamxx/src/physics/rrtmgp/CMakeLists.txt
, adding a SYCL equivalent of what is already done for CUDA/HIP.
Thanks, I will try it.
Thanks for your useful guide, after solving a couple of bugs, finally I am getting to the linking stage. Here's a error I have been try to solve it for 3 days but still don't know what to do. I tried to search llvm-foreach: Aborted (core dumped)
and exit code 254
but get nothing useful. Maybe it's a IntelLLVM or environment error? I really don't know. Do you have any ideas?
Build succeeded. Compilation from IR - skipping loading of FCL Build succeeded. Compilation from IR - skipping loading of FCL Build succeeded. llvm-foreach: Aborted (core dumped) ... icpx: error: gen compiler command failed with exit code 254 (use -v to see invocation) Intel(R) oneAPI DPC++/C++ Compiler 2024.0.0 (2024.0.0.20231017) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /opt/intel/oneapi/compiler/2024.0/bin/compiler Configuration file: /opt/intel/oneapi/compiler/2024.0/bin/compiler/../icpx.cfg icpx: note: diagnostic msg: Error generating preprocessed source(s).
Here's the log files. csm_share.bldlog.240423-085832.txt e3sm.bldlog.240423-085832.txt gptl.bldlog.240423-085832.txt kokkos.bldlog.240423-085832.txt mct.bldlog.240423-085832.txt spio.bldlog.240423-085832.txt
Sorry, this is beyond my knowledge. :/
A great news is that the e3sm.exe has been finally maked successfully (by linking with ifx compiler), now the issue is
I am not sure whether this "nsplit=-1" is related to the error, and should it to be changed to >=1 manually? ]
And how to fix the error
Native API failed. Native API returns: -46 (PI_ERROR_INVALID_KERNEL_NAME) terminate called after throwing an instance of 'sycl::_V1::exception'
my create case command is
./create_newcase --case pvc1450_test_fixlinker9 --compset F2000-SCREAM-SA --res ne30pg2_ne30pg2 --mach PVC --compiler oneapi-ifxgpu --mpilib impi
,
without --ngpus-per-node
, --gpu-type
, and --gpu-offload
because there's no proper options yet, is this right? I have set the environment with export KOKKOS_VISIBLE_DEVICES=0
.
The nsplit warning is benign. Homme doesn't know whether it's used in EAMxx or not. For use in EAMxx, it is expected that nsplit is not known until runtime.
OK, I get it. Another thing I wanna ask is that, is the --ngpus-per-node
, --gpu-type
and --gpu-offload
necessary on intel PVC GPU when I run ./create_newcase
? Noticed that the intel pvc is not supported yet.
As you noticed, we don't officially support Intel GPUs. We can't really provide help with that, due to limited funding/resources. You are welcome to try to debug and maybe make it work on Intel, and contribute back with a PR.
Sorry, I know this is not very satisfactory, but unfortunately the e3sm policy is to not offer "customer support" to the general public.
Your answer is very helpful, and I belive that we are not far to the success, and the Native API failed. Native API returns: -46 (PI_ERROR_INVALID_KERNEL_NAME)
is caused by multi definition of fuctions, the only thing I can do is to locate this fuction.
-traceback
but this flag is not supported on spir64_gen device../case.submit --debug
, entering the pdb tool, however, nothing useful printed, like this.So do you have any good advice that can help me find the wrong fuction make the error? Thanks a lot.
Or what's you method to debug the e3sm/scream issues?
I have never debugged on Intel GPUs, and I've never run into this kind of errors, sorry.
Hi again, now I'm trying to build SCREAM on NVIDIA GPU, but encountering a linking error ../../eamxx/src/doubly-periodic/libscream_theta-l_kokkos_4_72_10.a(eamxx_homme_process_interface.cpp.o): In function
scream::HommeDynamics::initialize_homme_state()': /The2ndTechGroup/lujingyu/SCREAM/scream_with_kokkos4.2/scream/components/eamxx/src/dynamics/homme/eamxx_homme_process_interface.cpp:1141: undefined reference to ekat::WorkspaceManager<ekat::Pack<double, 1>, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >::GPU_DEFAULT_OVERPROVISION_FACTOR'
, and the GPU_DEFAULT_OVERPROVISION_FACTOR
is definded in file externals/ekat/src/ekat/ekat_workspace.hpp
. The code seems right and it's a compiler issue. I'm using the nv-hpcx toolkit (version 23.9) and cuda (version 12.2). May I ask how do you build SCREAM with nvidiagpu? And have you run into the same error?
Based on the fact that you have the folder components/eamxx/src/doubly-periodic
, I can infer that this is a version of EAMxx quite old. I can't really help yo debug, as I see a different code at those lines.
I would recommend to rebase your branch on current master. Or, since you are using kokoks 4.2 (inferring it from the folder name), you may wait a few days. We're trying to merge #2835 , which should bring in kokkos 4.2 in current master.
I see... Thanks for your explanation!
Excited to share that we have successfully managed to run SCREAM on an Intel PVC GPU. However, only running it with 1 process on 1 GPU, and the SYPD is 0.04 for 1 nday, which is quite a low performance.
When attempting to run with
forrtl: severe (154): array index out of bounds.
Segmentation fault
error occurs.I am seeking your assistance on how to run >1 process on 1 GPU and >1 thread on 1 GPU (Kokkos has been compiled with both the OPENMP and SYCL backends). Your guidance and support would be greatly appreciated.
I'm trying to run SCREAM on intel PVC GPU, however meet some errors, looking for help, thanks! Here's my testing environment: SCREAM code: lasted code of master version (https://github.com/E3SM-Project/scream), not sure whether this is the correct version for test on intel GPU? Machine: intel CPU+ intel PVC GPU (1100) Compiler and MPI: intel oneapi 2024 (icx, ifx, icpx; mpiicx, mpiifx, mpiicpx)
Here's my config files:
scream/cime_config/machines/config_machines.xml
scream/cime_config/machines/cmake_macros/oneapi-ifxgpu.cmake
scream/components/eamxx/cmake/machine-files/PVC1100.cmake
scream/externals/ekat/cmake/machine-files/PVC1100.cmake
Here's my case: (is this F2000-SCREAM-SA @ ne30pg2_ne30pg2 the best test case?)
And here are all my log files, bld.zip and the error is:
I'm fresh on kokkos and SYCL, so I'm confused to the link between kokkos and SYCL, and the backend target with intel PVC GPU, is there anything wrong in my config files and lead to this error? Looking forward to the reply, thanks again!
Also, I have found some small bugs:
::sycl
instead of::sycl::ext::oneapi
using intel oneapi 2024;>
appeared heretypedef Kokkos::Experimental::SYCL> CedrGpuSpace;