SPECFEM / specfem3d_globe

SPECFEM3D_GLOBE simulates global and regional (continental-scale) seismic wave propagation.
GNU General Public License v3.0
90 stars 95 forks source link

Using Specfem with heterogeneous machines #700

Open kpouget opened 3 years ago

kpouget commented 3 years ago

Hello Specfem developers,

I am playing with Specfem3D_Globe and getting it to run on OpenShift Kubernetes (see this video for a first demo/illustration).

I have 2 questions related to Specfem3D execution:

  1. Is there, in the code, any optimization related to the CPU micro-architecture? I'm not very familiar with such optimizations, but I understand that at compile time, you can optimize the binary for the instructions of one specific CPU or another. We would like to do some benchmarks on a cluster with multiple micro-architectures and select at launch-time the right binary (container).
  2. When running Specfem with MPI, can we mix GPU=Cuda|OpenCL with GPU=No?

thanks,

Kevin

danielpeter commented 3 years ago

hi Kevin, great to see it setup through kubernetes, will need to try that out soon :)

for 1, there are no specific instructions in the code. tailoring to a specific CPU architecture would happen through the compiler and corresponding flags, e.g., when running the ./configure script with specific --host or --target options.

for 2, by setting the flag GPU_MODE = .true. or .false. it is running exclusively either on GPU or CPU only. there is no setting to run a hybrid simulation with part of the MPI processes on CPU and others on GPU. for the globe version, this would be a bit tricky since one would need to change the partition sizes to balance the load. thus, becomes a meshing challenge with the cube-sphere mesher. this could however be done in the SPECFEM3D_Cartesian version and by modifying the code a bit. I did this once with the Cartesian version, but found that there was little gain on adding CPU-processes to the GPU ones. the GPUs were taking on pretty much most of the work, and thus the time-to-solution at the end was determined by how fast the GPUs were, and only little gain was achieved by adding CPU workers.

best wishes, daniel

kpouget commented 3 years ago

Hello Daniel,

for 1, there are no specific instructions in the code. tailoring to a specific CPU architecture would happen through the compiler and corresponding flags, e.g., when running the ./configure script with specific --host or --target options.

ok, I see. Do you happen to know if Specfem is sensible to such CPU variations ?

for 2, by setting the flag GPU_MODE = .true. or .false. it is running exclusively either on GPU or CPU only. there is no setting to run a hybrid simulation with part of the MPI processes on CPU and others on GPU. for the globe version, this would be a bit tricky since one would need to change the partition sizes to balance the load. thus, becomes a meshing challenge with the cube-sphere mesher. this could however be done in the SPECFEM3D_Cartesian version and by modifying the code a bit. I did this once with the Cartesian version, but found that there was little gain on adding CPU-processes to the GPU ones. the GPUs were taking on pretty much most of the work, and thus the time-to-solution at the end was determined by how fast the GPUs were, and only little gain was achieved by adding CPU workers.

ok, makes sense, thanks

we're currently benchmarking Specfem with classic baremetal runs, modifying mainly the NEX_XI/NEX_ETA (16/32/64/128) and NPROC_XI/NPROC_ETA (running on 1/4/8/16 machines) with the default DATA problem; I wonder if other examples could be interesting for running the benchmark (I would like the execution to be between 15 and 45 min, max 1h30).

with MPI_NPROC=16 | MPI_SLOTS=4 | OMP_THREADS=2 | NEX=128 (=4 x 8-core machines) this took 1h33min.

kpouget commented 3 years ago

Hello Daniel,

FYI we published two blog posts about Specfem on OpenShift (kubernetes):

https://www.openshift.com/blog/a-complete-guide-for-running-specfem-scientific-hpc-workload-on-red-hat-openshift https://www.openshift.com/blog/demonstrating-performance-capabilities-of-red-hat-openshift-for-running-scientific-hpc-workloads

it's not with "heterogeneous machines" (title of the issue), but still the continuation of what I mentioned above. and no GPU at this stage, but I'm currently working on it, for testing purposes

danielpeter commented 3 years ago

hi Kevin,

thanks for posting! let me add a corresponding entry in the manual.

it looks like SPECFEM - even more so than GROMACS - is showing very good performance results on such an OpenShift platform, with scalings at almost the same performance level as for the bare-metal runs. this is probably due to the local communication and over-lapping computations which help to mitigate the overhead with OpenShift's network performance.

anyway, i plan to see if we could drop the static compilation requirements of the package. your OpenShift setup might then become simpler as well - will let you know if this becomes an option.

many thanks, daniel