Closed Tissot11 closed 2 weeks ago
hi @Tissot11, apologies for the late reply. btw, the shock setup should work in the new dev branch (will be released as 1.1.0 shortly).
regarding your issue -- Cuda actually has very limited support for intel compilers. you're essentially trying to use the latest Cuda and Intel, which might be a recipe for incompatibility issues. i would try with gcc; starting from version 9 all Cuda versions work perfectly ok.
Hi @haykh, thanks for replying. I did with intel because on one HPC machine they did't not have hdf5 either as a standalone module or compiled module with gnu and openmpi. Anyway, now I manage to compile on another machine without any problems.
1. module load lib/hdf5/1.14.4-gnu-13.3-openmpi-5.0 devel/cuda/12.4
2. cmake -B build -D pgen=srpic/langmuir -D mpi=ON -D Kokkos_ENABLE_CUDA=ON -D Kokkos_ARCH_AMPERE80=ON -D Kokkos_ENABLE_OPENMP=ON
3. cmake --build build -j 8
However, on runtime I get errors. I presume it requires some setting of variables in the submit script. I have asked the technical team but I also attach here the error files since they might take time to reply. Could you please have a look and tell what could be the problem or if you encounter this sort of issue before? I also paste the content of job script.
I'm extremely happy to hear about the shock setup in the upcoming version of Entity
! I don't mean to be impatient but any rough idea when this might be released?
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --partition=dev_gpu_4_a100
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:4
#SBATCH --gpu-bind=single:1
#SBATCH --time=00:20:00
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export GPUS_PER_SOCKET=1
export GPUS_PER_NODE=4
module load lib/hdf5/1.14.4-gnu-13.3-openmpi-5.0 devel/cuda/12.4
entityExe="/home/hd/hd_hd/hd_ff296/CodeRepositUniCluster/entity-v1.0-A100/build/src/entity.xc"
srun --mpi=pmi2 ${entityExe} -input langmuir.toml
@Tissot11 could you try without OpenMP? I had previously issues with that, since CUDA is getting confused if multiple CPU threads are running. Regardless, you won't gain much by having OpenMP anyway. Also, you might want to enable -D output=ON
to have the output.
Ok. So I used the following build
cmake -B build -D pgen=srpic/langmuir -D mpi=ON -D Kokkos_ENABLE_CUDA=ON -D Kokkos_ARCH_AMPERE80=ON -D output=ON
I encounted issue now about HDF5 root not being set. I tried
export HDF5_ROOT_DIR=$HDF5_HOME
But this doesn't help either. Any suggestion for this? I attach the build log and output
CMakeError.log CMakeOutput.log
I looked up the page configuring Environment modules but it is not clear to me if this is meant that I should build my own module or used the basic modules provided by the admins?
That's odd, OpenMP has nothing to do with hdf5. Did you clean the build
directory before recompiling? The error messages look very weird. But for debugging purposes, try to minimize the dependencies (i.e., first remove the build
directory and recompile, if doesn't work -- disable the output... etc)
I always re-download entity for compiling. WIthout production flag on, it compiles fine. But on the runtime, I see the same error.
Does it make any difference if entity was compiled on a node with a GPU present? At the moment, I have compiled it on the login node where no physical GPU is present.
It should not matter, as long as (a) you specified the correct Kokkos_ARCH
flag, and (b) you use the same exact libraries on both nodes. I mean the error message has nothing to do with the code, it seems like some configuration/submission error.
I'll try to make a minimal compilable code to test errors like this (also for the future).
Yeah, I also suspect that technical team should provide some tips on how they have configured everything on the cluster. The documentation is a bare minimum with no examples being offered for these GPUs+ MPI related jobs. I am essentially trying whatever I could come up with.
I'm sorry, I couldn't understand the environment modules page is meant for building your own modules or using the existing ones on a cluster? In principle the tool you provide is quite useful for understanding and avoiding these errors in installations and runtimes.
It turns out that I could compile and run Entity with GNU13.3, OpenMPI 4.1, CUDA 12.4
. It seems that there is a problem if I choose OpenMPI 5.0
with this combination. Compilation and linking work fine and the test Langmuir job starts fine but it gets stuck immediately. I attach the info and log files of the run. Please have a look and tell me what could be the likely cause.
I could successfully run another code that also leverages GPUs on this cluster without any problems.
langmuir.log outEntity-23935811.txt langmuir.info.txt
Below is the content of the submit script
# Number of nodes to allocate
#SBATCH --nodes=2
#SBATCH --partition=gpu_8
#SBATCH --exclusive
# Number of MPI instances (ranks) to be executed per node
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --gpu-bind=single:1
# Maximum run time of job
#SBATCH --time=24:00:00
# Give job a reasonable name
#SBATCH --job-name=langmuirEntity
# File name for standard output (%j will be replaced by job id)
#SBATCH --output=outEntity-%j.txt
export OMP_PLACES=cores
export GPUS_PER_SOCKET=4
export GPUS_PER_NODE=8
module load lib/hdf5/1.14.4-gnu-13.3-openmpi-4.1 devel/cuda/12.4
entityExe="/home/hd/hd_hd/hd_ff296/CodeRepositUniCluster/entity-v1.1-V100/build/src/entity.xc"
srun --mpi=pmix ${entityExe} -input langmuir.toml
I couldn't understand the environment modules page is meant for building your own modules or using the existing ones on a cluster?
Yes, in theory on a well-maintained cluster you shouldn't have to compile anything yourself. In practice -- oftentimes the MPI you use is not compile with the version of GCC compatible with CUDA (because few people run multi-node GPU jobs). So I just try to give an option to compile your own libraries and use them instead of relying on the cluster admins.
You can even go as far as compiling your own Cudatoolkit (see the section on conda) with a proper gcx (also downloaded through conda).
But there is so far this can take you. If for whatever reason they have outdated glibc, or they don't give you access to UCX library -- then you sort of have to rely on whatever modules they provide.
Regarding this last issue, few comments: (a) don't use OpenMP when compiling on MPI; and don't set the OMP_PLACES variable in the submit script. (b) from the looks of it, you're using V100? could you post the cmake command? (c) regarding your main error; i think this is a problem in the way the cluster communication is configured. in fact, there are two warning signs that indicate the issue: one coming from Open-MPI, the other -- from UCX. to try to debug the problem: (c1) can you try to rerun with only 1 node (i.e., 8 GPUs on a single physical node; or better yet 4 GPUs on a single socket)? (c2) what type of network connection does your cluster use? This issue here highlights there might be compatibility issues if MPI is not compiled properly (or env variables are not set correctly): https://github.com/open-mpi/ompi/issues/10436. Potentially, (c1) should be able to at least rule out other issues. (c3) have you run any multi-node CPU jobs (say, not with entity) with the same MPI and GCC on that cluster? (c4) could you try compiling your own openmpi? just use the same gcc you used above, you won't need CUDA for that.
in the meantime, i'll try to make a minimum code-example which should be able to test whether the problem has anything to do with the entity itself (at this point, it's unlikely).
PS. General comment: this is just for the future; you're using 16 GPUs for just 14M particles (so 1M particles per GPU). That's horribly inefficient. You can/should fit at least 100x more (i.e., increase the box-resolution or ppc, or take less GPUs), otherwise the GPU will be very sub-optimal.
I did everything you suggested and I can confirm other code works with CPU
and GPU
using CUDA with the same modules. We have IB HDR
network interconnect.
It turns out that the problem is with pgen=langmuir
input file. I compiled entity with pgen=weibel
with and without (CPU
only) the CUDA
module and with GCC
, OpenMPI
modules. pgen=weibel
ran absolutely fine with both GPU
and CPU
runs (see the attached file). Could you please try pgen=langmuir
by yourself?
Now, I need to visualize the results of this test weibel run. Afterwards, I will try the shock setup I'm actually interested in.
Thanks for you tip! I was trying to test entity
with single node and multi-node runs to see if it works on our cluster. I'll now start looking into shock setup and keep your suggestions in mind regarding the performance issues. Please share any other tips you have regarding entity
and in general!
wow, thanks @Tissot11! this is actually very helpful. i'll have a look
Hi,
I can seem to configure and build fine. However, I get error at the linking time at the end of build (see the screenshot). What is the likely cause?
I used following commands for building
module load compiler/intel/2023.1.0 mpi/impi/2021.11 lib/hdf5/1.14.4-intel-2023.1.0 devel/cuda/12.4
cmake -B build -D pgen=srpic/langmuir -D mpi=ON -D Kokkos_ENABLE_CUDA=ON -D Kokkos_ARCH_VOLTA70=ON -D Kokkos_ENABLE_OPENMP=ON
cmake --build build -j 8