Running on a new cluster - Stellar at Princeton/PPPL

budjensen commented 6 months ago

I am looking to install and run WarpX on Princeton/PPPL's stellar cluster (information HERE). I wrote about this in an earlier question on the discussions page (see below).

After looking into this more, I rebuilt WarpX with the following commands:

cmake -S . -B build -DWarpX_MPI=ON -DWarpX_DIMS="1"
cmake --build build -j 8

and ran a test:

./run_test.sh LaserAcceleration_1d_fluid

which failed with the output (see warpx_test.txt for the full test):

working on test: LaserAcceleration_1d_fluid
   re-making clean...
   building...

configuring LaserAcceleration_1d_fluid build...
   mkdir /tmp/ci-MsqcKmHhQb/warpx/builddir

building LaserAcceleration_1d_fluid...
   cmake --build /tmp/ci-MsqcKmHhQb/warpx/builddir -j 8 --
   Compilation time: 264.644 s
   run & test directory: /tmp/ci-MsqcKmHhQb/rt-WarpX/WarpX-tests/2024-03-06/LaserAcceleration_1d_fluid/
   copying files to run directory...
   path to input file: Examples/Physics_applications/laser_acceleration/inputs_1d_fluids
   running the test...
   mpiexec -n 2 ./LaserAcceleration_1d_fluid.ex inputs_1d_fluids  diag1.file_prefix=LaserAcceleration_1d_fluid_plt   warpx.do_dynamic_scheduling=0 warpx.serialize_initial_conditions=1 amrex.abort_on_unused_inputs=1 amrex.fpe_trap_invalid=1 amrex.fpe_trap_zero=1 amrex.fpe_trap_overflow=1 warpx.always_warn_immediately=1 warpx.abort_on_warning_threshold=low
   WARNING: Test stdout:

   WARNING: Test stderr:
Abort(1090831) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(176):
MPID_Init(1430)......:
MPIR_pmi_init(162)...: PMI_Init returned 14

   Execution time: 0.397 s
   WARNING: unable to copy analysis image
   archiving the output...
   creating problem test report ...
   LaserAcceleration_1d_fluid FAILED

cleaning AMReX CMake directories...

cleaning WarpX CMake directories...

creating new test report...

reverting git branches/hashes

creating suite report...

Looking at the WarpX documentation, I wonder if the source of the MPI error may have its root in the HPC environment. Would you be able to help me build an HPC profile for Stellar?

Discussed in https://github.com/ECP-WarpX/WarpX/discussions/4751

^{Originally posted by **budjensen** March 6, 2024} I have WarpX installed two different ways--from source and from conda--on the Stellar HPC at Princeton/PPPL (information on the cluster [HERE](https://researchcomputing.princeton.edu/systems/stellar)) and am just getting started running capacitive discharge simulations. I have the example python input script and run it (within a slurm script) as follows: ``` srun -N 1 -n 16 python PICMI_inputs_1d.py -n 1 ``` As a primer, when I run from my own build, I see the following lines at startup: ``` Initializing AMReX (24.02-30-g2ecafcff4013)... MPI initialized with 1 MPI processes MPI initialized with thread support level 3 OMP initialized with 4 OMP threads ``` whereas running Warpx installed from the conda distribution displays: ``` Initializing AMReX (24.01)... OMP initialized with 4 OMP threads ``` Is this expected behavior, that the conda distribution does not use MPI? (In contrast, my build from source had the `WarpX_MPI` option ON.) -------------- Secondly, I submit a batch script that looks like: ```bash #!/bin/bash #SBATCH -J Warp_Turn #SBATCH --nodes=1 #SBATCH --ntasks-per-node=16 #SBATCH --cpus-per-task=4 #SBATCH --time=00:30:00 #SBATCH --output warpx.%j.out export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK # load conda environment module purge module load anaconda3/2023.9 conda activate warpx srun -N 1 -n 16 python PICMI_inputs_1d.py -n 1 ``` I have worked out that setting `OMP_NUM_THREADS` determines the number of threads initialized in the startup line `OMP initialized with 4 OMP threads`, but have not had the number in the line `MPI initialized with 1 MPI processes` move beyond 1, even when submitting a job to more than one core and using the flag `-n `. Instead, when submitting a job with `--ntasks-per-node=N` I find that the MPI/OMP startup lines are printed out _N_ times. This makes me think that MPI is not utilizing the full system, and perhaps even is running _N_ identical simulations in parallel. How can I alert WarpX that I have 16 MPI tasks allocated for my job? As an additional clarification on a high level, is it correct that WarpX (via AMReX) uses MPI to distribute the simulation domain across processors and then uses OpenMP to accelerate each process? Thank you!

budjensen commented 6 months ago

I just rebuilt and modified the CMake configuration file (through the command ccmake build) by setting WarpX_PYTHON to ON. The test ./run_test.sh LaserAcceleration_1d_fluid is now passing. The output is attached here: warpx_test_rebuild.txt.

I am still getting the same error that showed up in my original post on the discussion board, where MPI initialized with 1 MPI processes does not increase past 1 process even as more tasks are requested in my slurm job script.

budjensen commented 6 months ago

In talking with Stellar Admins, I learned that (as long as I only run on one node) I can change the last line in my batch script to use mpirun instead of srun and mpi will be initialized properly.

To get it to scale beyond one node, I will need to use an mpi installed out the cluster. Here is a note from a admin:

WarpX instructions tell you to install mpich and mpi4py - which will not work with our slurm setup. You can try setting things up without installing mpich and mpi4py and then for mpi4py please follow our instructions:

https://researchcomputing.princeton.edu/support/knowledge-base/mpi4py

In short, is there a way to setup WarpX with openmpi instead of mpich?

At length, here is what I tried: I set up a conda environment using the command:

conda create -n warpx-openmpi -c conda-forge blaspp boost ccache cmake compilers git lapackpp "openpmd-api=*=mpi_openmpi*" python make numpy pandas scipy yt "fftw=*=mpi_openmpi*" pkg-config matplotlib mamba ninja pip virtualenv periodictable picmistandard

Where I simply changed mpich* to openmpi* in the command above, since I am hoping to use an openmpi module (openmpi/gcc/4.1.2) available on the cluster.

After installing the environment, I uninstalled mpi4py installed via conda (using conda remove --force mpi4py) and installed it via pip (as per the instructions in the link above) and then ran my batch script:

#!/bin/bash
#SBATCH -J Warp_Turn_mpi
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --cpus-per-task=4
#SBATCH --time=00:10:00
#SBATCH --output warpx.%j.out
#SBATCH --mail-type=all
#SBATCH --mail-user=bjensen@pppl.gov

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# load conda environment
module purge
module load anaconda3/2024.2
module load openmpi/gcc/4.1.2
conda activate warpx-openmpi

srun python PICMI_inputs_1d.py

and got this error. Are there instructions for running with openmpi?

budjensen commented 4 months ago

Hey @ax3l -- update after I thought I got Warp-X up and running. I created this profile for running on Stellar and installed dependencies via this bash script. I built a 1D python version of the code.

The code compiles and runs, but does so nonsensically (ie. won't pass tests, and notably, when I try to initialize a uniform distribution of particles the density at the first step is off by a factor of 8).

Is there anything I can send/do to help figure this out?

budjensen commented 4 months ago

@ax3l -- Here's an example of my problem. I run a simulation on the Stellar cluster with a script (PICMI_inputs.py), which sets the initial density to a uniform 2e16 m^-3. When I run the simulation, the density is initialized to:

If the simulation is ran for a few hundred steps, the plasma potential begins to rise to absurd values:

The applied potential is only 50V RF, so a 3000V plasma potential doesn't make any sense... Have you seen anything like this before with WarpX? Do you have any ideas for where I should start looking for solutions?

For context, when I run this on my personal computer (or even on Stellar with WarpX installed via conda), the initial density is 2e16 and the potential evolves as expected. I'd like to get a compiled version up and running on Stellar to make use of MPI.

Thank you for any help!

ECP-WarpX / WarpX

Running on a new cluster - Stellar at Princeton/PPPL #4752

Discussed in https://github.com/ECP-WarpX/WarpX/discussions/4751