ICTP / RegCM

ICTP Regional Climate Model
https://www.ictp.it/esp/about
Other
54 stars 41 forks source link

RegCM first benchmarks ESiWACE2 #16

Open goord opened 2 years ago

goord commented 2 years ago

Hi, I am opening this issue to discuss my preliminary findings regarding the med22 benchmark case (164 x 288 grid points, 23 vertical levels). I have run a preliminary benchmark on the Dutch national supercomputer Snellius using both gnu and intel compilers. The system contains of dual-socket nodes with 64 cores per CPU, and I observe that this case reaches about the maximum performance (11 simulated years per day) on 4 nodes.

scaling_plot_AMD

@graziano-giuliani is this in line with your own performance numbers or am I missing crucial flags/options? BTW the intel MPI seems to do a slightly worse job here...

graziano-giuliani commented 2 years ago

@goord : The behavior is typical of this class of finite difference stenciled atmospheric codes on DM machines. At some point the communication overhead is greater than the computational work: we increase the number of communications while reducing the data payload and we decrease the amount of computations per core. The system is mostly CPU idle and waiting on the interconnect. Adding more processors can at some point get to an increase in total execution time. Our rule of thumb is that we "saturate" if the per-core computational patch gets smaller than 10x10, which in this case leads to a maximum possible cores of roughly : 16x28=448. Minimum "theoretical" computational patch size is a 3x3, but as you see in the graph we saturate well before that limit, henceforth the ROT rule. Mind this is a "small" domain: the 3km ALPS run grid we are using is 608x578x41 and we run this currently using 17 nodes with 48 cores on marconi sky-lake. In that case the limit for us is instead the total node memory, because, missing a (reliable) parallel filesystem, we must collect the output data for output on one of the nodes, and we are afflicted by this memory asymmetry: attempting to open a per-core view on the output kills the GPFS. With more CPUs you can solve bigger problems, but not faster the same problem, once bounded by the system MPI performances. About gfortran/intel: I have not tested on AMD platforms since... well, say Athlon XPs times. We had users using with Zen2 with PGI compilers somewhere in 2019 in Trento. I have zero experience with newer Zen3 platform CPUs, but I would not be surprised if GNU optimization has a margin over ifort on non-intel recent CPUs. The Intel MPI moreover must be platform (interconnect) tuned, and with newer cards it may be configured to use conservative settings: some IB cards may get unstable if pushed too far.

goord commented 2 years ago

Hi @graziano-giuliani thanks for the feedback. If the med22 case is representative (in terms of physics and transport schemes etc.) for higher-resolution test cases, then I propose we profile the application on 1 or 2 nodes to make sure we're not communication-bound