ECP-WarpX / WarpX

WarpX is an advanced electromagnetic & electrostatic Particle-In-Cell code.
https://ecp-warpx.github.io
Other
290 stars 184 forks source link

Runtime error with Laser Ion acceleration test run #5077

Closed Tissot11 closed 3 weeks ago

Tissot11 commented 1 month ago

Hi,

I'm new to WarpX and I did read some of the compilation related issues but I'm not sure if I understood my problems. I could compile the WarpX successfully on two machines using the similar steps, e.g.

  1. module load lib/hdf5/1.12.1-intel-19.1.2-impi-2019.8 devel/cuda/11.4

  2. export CC=$(which icc) export CXX=$(which icpc) export FC=$(which ifort)

    export CUDACXX=$(which nvcc) export CUDAHOSTCXX=${CXX}

    export AMREX_CUDA_ARCH=7.0

  3. cmake -S . -B build -DWarpX_DIMS="1;2;3" -DWarpX_COMPUTE=CUDA

  4. cmake --build build -j 16

It builds fine and linking show the appropriate path. However, when I run the job using the batch script, I get same errors on both machines

#SBATCH --nodes=1
#SBATCH --exclusive
# Number of MPI instances (ranks) to be executed per node
#SBATCH --ntasks-per-node=2
# Number of threads per MPI instance
#SBATCH --cpus-per-task=24
#SBATCH --gres=gpu:2
##SBATCH --gpu-bind=single:1
# Maximum run time of job
#SBATCH --time=00:20:00
# Give job a reasonable name
#SBATCH --job-name=IonAcc-WarpX

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export OMP_PLACES=cores

module load lib/hdf5/1.12.1-intel-19.1.2-impi-2019.8 devel/cuda/11.4

EXE="./warpx.2d"

srun ${EXE} inputs_2d

I attach the err and out files.

errWarpX-13067012.txt outWarpX-13067012.txt

If I turn off GPU in the batch script (purely CPU job), then also job fails saying the this version of CUDA is able to launch the process.

Could you please suggest what needs to be changed. Then I can ask the system admin for help.

Tissot11 commented 1 month ago

I also tried (after seeing another thread)

cmake -S . -B build -DWarpX_DIMS="1;2;3" -DWarpX_COMPUTE=CUDA -DGPUS_PER_SOCKET=1 -DGPUS_PER_NODE=2

But got the warning

` CMake Warning: Manually-specified variables were not used by the project:

GPUS_PER_NODE
GPUS_PER_SOCKET`

Building WarpX went fine as before, however on run after declaring

export GPUS_PER_SOCKET=1 export GPUS_PER_NODE=2

in the submit script, I get the same error as before with benchmark simulation. Now, I do not know what to try.

RemiLehe commented 1 month ago

Thanks for raising this issue. Could you let us know which supercomputer you are running on? Is it part of the list here: https://warpx.readthedocs.io/en/latest/install/hpc.html#hpc-machines

If not, we could add a new page for your machine. In general, each machine is a bit unique and requires "massaging" of the installation instructions.

Tissot11 commented 1 month ago

I have access to bwUniCluster, Justus2 and Horeka in Germany. I had already looked at the HPC list and tried my best to adapt the compilation and runtimes. I do suspect that it has something do to with cluster's configuration and I have asked the technical support, however they usually take longer time and I think it would be great to have input from your side on what is likely the cause.

I also compiled WarpX with module load lib/hdf5/1.14.4-gnu-13.3-openmpi-5.0 devel/cuda/12.4 and tried the simulation run of the same Ion Acceleration test run. This time I get different errors (see the attached err file). Could you please look at it?

errWarpX-23915461.txt outWarpX-23915461.txt

I noticed that either you use -DWarpX_COMPUTE=CUDA or -DWarpX_COMPUTE=OMP, does that mean with CUDA on, one can not use OpenMP threads per MPI processes in WarpX?

ax3l commented 1 month ago

Thanks for the details.

Reaching out to local support is a great idea. You can describe WarpX as follows to them: WarpX is an MPI-enabled multi-GPU (or multi-CPU) code. WarpX is compiled to run either on CPUs (computing via OpenMP+MPI) or GPUs (computing via, e.g., for Nvidia via CUDA+MPI).

For GPU runs, WarpX runs one MPI process (task or rank) per CUDA GPU, using each a GPU exclusively with one host process. Local cluster support can help you to write a batch submission script that does exactly that (we call this "pinning" a process to a GPU). We want to pin the MPI process on the CPU to the closest GPU to avoid extra latencies. E.g., if you have 4 GPUs on a node, you want to use exactly 4 MPI processes to control them.

I see in your output files from the run, that WarpX was built with CUDA and MPI support, but it was not started with srun/mpirun/mpiexec or equivalent, which led to a startup of four independent processes. The batch script (please share details on it) needs to be updated to include a this prefix for MPI startup.

I noticed that either you use -DWarpX_COMPUTE=CUDA or -DWarpX_COMPUTE=OMP, does that mean with CUDA on, one can not use OpenMP threads per MPI processes in WarpX?

It works as follows: for compute intensive parts of the code, we either "loop"/compute on the GPU or on the CPU. Currently, one builds either an exectuable for GPU or a CPU, controlled by the -DWarpX_COMPUTE=... option. One can still use threads (OpenMP or other) for auxiliary work when computing on GPUs, but we will not use OpenMP to move data down from the GPU, compute on it, and move it back (that is simply too slow). Typical GPU nodes usually have all their performance in the GPUs and their connected CPUs usually provide <<10% of the node's performance. One example where we optionally use threads on CPU is I/O with ADIOS2 & Blosc2 https://arxiv.org/pdf/1706.00522

Of course, you can also run WarpX on pure CPU systems/laptops/clusters using -DWarpX_COMPUTE=OMP.

Tissot11 commented 1 month ago

Thanks for the reply. I did give the technical support links to WarpX Github and documentation pages. They told me that they are looking into it. They might take longer time because it appears that they are not used to PIC codes like WarpX.

I did paste the relevant content of thejob submit script in the first post above. This submit script is complete sans my email address details etc. So do you mean that srun command in the submit script above is not able to correctly launch these 4 MPI processes tied to 4 GPUs on this node?

My question related to OMP threads has its origin in using only CPUs for launching jobs and filling a node fully. I was wondering if the CPU can still use OpenMP threads per MPI process and MPI processes on a single node can exchange data with GPU. I suppose this is possible with AMD APUs? I understand the one MPI process per GPU rule. Usually we have multi-core CPUs (e.g) 48 cores per node but only 4 GPU per node. So one can only launch 4 MPI processes on a single node and I was wondering if the rest CPUs on this node could be used by 12 OpenMP threads per MPI process.

Thanks for sending the paper. I'll look into it and also compile WarpX tomorrow only with CPUs and try running the a job.

Tissot11 commented 3 weeks ago

It turns out that there is a problem with OpenMPI 5.0 and CUDA 12.4. On using older versions, I could compile and run the test case of ion acceleration successfully. I'll try to visualise the results to understand better the working of WarpX. So you can close this ticket now.

However, before you close it, I just want to ask if WarpX also has particle injector available which could be used to simulate astrophysical plasmas. I see in one of your thread (#4581) you mention about the injector and cathode sources. Are these features already available and missing in the documentation?

n01r commented 3 weeks ago

Hi @Tissot11,

I am glad to hear you were able to run the example successfully! :tada: Let me address a few other points you made:

So one can only launch 4 MPI processes on a single node and I was wondering if the rest CPUs on this node could be used by 12 OpenMP threads per MPI process.

@ax3l already gave a great answer above. Adding to that: GPUs are architectures with an inherently high degree of parallelism. Since we do not just offload work from CPU hosts to GPU when GPUs are available but we fully compute everything on GPUs (when compiled for them), all data lives on the GPU as well. This way we ensure that we maximize performance as we avoid latencies from data transfer. However, data compression and writing via our I/O libraries can use unused parallel CPU threads when we are running WarpX on the GPU.

I just want to ask if WarpX also has particle injector available which could be used to simulate astrophysical plasmas.

People are indeed using WarpX for astrophysical simulations. Do you need a specific injector for this or do you just need to put your custom particle density profile into a simulation?

Feel free to close this issue if your original issue here has been resolved. :slightly_smiling_face: You can open a new issue with respect to your astrophysical topic. This helps when other users are searching the WarpX issues for specific topics.

Tissot11 commented 3 weeks ago

Thanks for the fast answer! Indeed, I should start appreciating the power of GPUs. I am somehow still stuck in CPU paradigm trying to always think of fully occupying CPUs in a node when few GPUs can outperform several CPUs as explained by you and @ax3l .

For an injector, I need something like a cathode source that can continually inject particles (electron and ions, neutral plasma) into the simulation box from one or two side of a simulation box. This is needed apart from initialising a plasma with a density profile in the simulation box. This simplifies simulating collisionless shocks in PIC simulations.

Since WarpX is a high performance code, it is indeed suitable for astrophysical simulations that require lot more computing power than laser-plasma simulations. I saw the example of a magnetic reconnection using WarpX (PICMI) which is great but having particle injectors/cathode sources in WarpX could make it really versatile for shock related simulations and also other astrophysical scenarios.

Just one question and then you can close the ticket. In PICMI interface (used in magnetic reconnection example), one can define grid sizes in terms of electron skin-depth etc. However, for WarpX simulations, the domain has be to in physical micrometers which makes sense for laser-plasma interaction scenarios and also quite user friendly to everyone. I am a theoretician who tends to design simulations and interpret results in terms of these physical scales e.g. skip-depth, ion gyro radius etc. Of course, I can assume a plasma density and calculate these lengths to define the domain size. But I was wondering if like PICMI interface, it is also possible to define simulation domain in terms of electron skin-depth, gyro radiusetc?

n01r commented 3 weeks ago

Original issue resolved, moved remaining discussion over to #5131.