An initial attempt to run WarpX at Los Alamos

EvanDodd commented 4 weeks ago

I have recently tried to use WarpX at LANL on a machine called Rocinante for a small project. The code was set up using conda, and then I tried to run some of the examples. While the examples seem to run, I believe that they are running slowly. Specifically, the Capacitive Discharge example says it should run in 20 minutes, but my attempt takes about 2.5 hours. Two other examples run for 8 hours without finishing, but I didn't see a run time estimate for them.

Any help would be greatly appreciated. I have been using other codes, but am having difficulties scaling up my actual problem and wanted to try the dynamic load balancing and AMR that WarpX seems to have. It wouldn't be a fair comparison if I'm unable to get the code working properly.

Thank you, Evan Dodd

Examples run: test00: Capacitive Discharge, batch run time ~2.5 hours with 112 threads, completed but didn't post process test01: Langmuir Waves, interactive run time ~1 minute, completed and png's were generated automatically test02: Ohm solver - electromagnetic modes, batch run time 8 hours (time limit), incomplete 13440 of 60000 time steps test03: Ohm solver - magnetic reconnection, batch run time 8 hours (time limit), incomplete 2500 of 50000 time steps test04: Guassian Beam, interactive run time ~1 minute, completed but didn't post process

Cluster Name: Rocinante Processor: Intel Xeon Platinum 8479 OS: HPE Cray Shasta CPU cores per Node: 112 (two sockets) Memory per Node: DDR 256 GB Interconnect: HPE/Cray Slingshot11 200Gb/s

ax3l commented 3 weeks ago

Hi @EvanDodd, thank you for reaching out.

Glad you tried running WarpX on Rocinante!

Generally speaking, we do not recommend using conda packages on HPC machines, because conda packages its own compiler and MPI stack which will likely not lead to the best performance on an HPC system, e.g., not using the right network stack and not optimizing well for the high-end CPUs on systems. Spack is a way more tuned package manager for HPC systems that we recommend on such machines. Alternatively, we write little warpx.profile and install_dependencies.sh scripts for HPC systems, using their modules and adding a few additional packages if lacking in modules. A good example is Perlmutter (NERSC) and we have a whole list of templates (raw scripts are here) to start a new system doc from (we are happy to help you in this issue if you can check and report what modules already exist and what the system recommends to use, etc.).

For your examples, their runtime is indeed super slow. I cannot yet say what exact causes this. With regard to runtime, for a dual socket system, we would usually recommend to run at least two MPI ranks per node (one pinned to each socket) and proportionally lower the number of OpenMP threads. In fact, for modern CPUs like the one you describe, there are internal BUS structures that are worth exploiting, so maybe even 4 MPI ranks (2 per socket) with 28 OMP threads each. Also note that HPC systems sometimes have system vs. silent cores, to check with your sysadmins - in which case you skip usually the first few cores to avoid that system operations trigger interrupts that cause high extra latencies. Pinning and number of threads to use per MPI rank are controlled in your batch script via environment variables.

The way forward that I would recommend is to document system modules that are recommended to use and compile WarpX for the system from source. What do you think?

EvanDodd commented 2 weeks ago

@ax3l, thanks for the reply. And sorry for the delay, I was out of town for a week and got back last night. At this point I'll probably dig back in on Thursday.

I'll start with Spack and see how it goes. My first attempt was to use git and try to compile the source, which didn't work. The HPC consultants suggested conda, and although simple to get started with, it seems problematic now. HPC actually seems to have a preference for conda and suggest its use if you are developing with mpi4py or pyTorch. This is a good data point to keep in mind.

I've dealt with the BUS structure a little bit, but didn't use 2 ranks per socket. Skipping the first few cores has been an issue, but seems to go system by system. I will ask.

In the long run, compiling for Rocinante or Tycho is preferred. However, this is also a learning experience for me so I'll dig into Spack a bit. Again, thank you for the reply

Evan

ECP-WarpX / WarpX

An initial attempt to run WarpX at Los Alamos #5139