Questions regarding to Strong and Scaling on Summit

SPECFEM / scaling-benchmarks

Scaling benchmark results of SPECFEM codes for different HPC clusters

0 stars 2 forks source link

Questions regarding to Strong and Scaling on Summit #1

Open wjlei1990 opened 5 years ago

wjlei1990 commented 5 years ago

Hi Daniel,

Thanks for putting all these together. I have some questions regarding to the parameters used for benchmark Summit@ORNL.

For the strong scaling how do I pick the parameters for strong scaling, including:

NEX_XI
NPROC_XI (several values for strong scaling?)
MODEL (what model to pick?)
RECORD_LENGTH_IN_MINUTES (how long should the simulation be to ensure the stability of benchmark results)

From my previous experience, it is a bit tricky to pick the problem size and GPU numbers. Because using least GPUs may blow out the GPU memory.

danielpeter commented 5 years ago

hi Wenjie,

for Titan, we see that the strong scaling falls off at about 500MB per process. thus, with NEX = 256 we will have 4GB per process on a 24 process simulation / 1GB for 96 processes / ~300 MB for 384 processes. at that time we should see the communication kicking in. so, strong scaling can easily be investigated with 3 simple simulations only.

also, setting NEX = 256 will tell you what NPROC_XI values you can use: 1 / 2 / 4 / 8 / 16 / 32. on Summit, you might be able to run these benchmark simulations as low as NPROC_XI = 1.

for the model, there is not much difference between using a 1D or 3D model. important parameters affecting the performance would be tiso model or not, full tiso or not, full attenuation or not. the one chosen here was PREM, as it includes tiso. furthermore, the setup time by the mesher is shorter for a 1D model than a 3D model. therefore, the scaling won’t waste too much time for the meshing procedure which is not considered anyway in the scaling plots.

regarding record length, these benchmark simulations all set the DO_BENCHMARK_SIM.. flag in constants.h to true. thus, it will be fixed to 300 time steps. that’s fairly short, but since it also sets the initial wavefield to 1 everywhere (to avoid flushing-to-zero issues), it won’t blow up the simulation and the scaling measurements so far worked pretty well. you would want to plot “average run time per time step” anyway, so having more time steps will just use resources for hardly better results.

note that the reasoning here is to test code scaling. finding an optimal setup for the simulations in INCITE would involve additional parameters and tests.

also, note that for weak scaling, the script will choose different NEX values depending on the size of the simulation. for GLOBE, these simulations will all use a slightly different number of elements per slice (or process) and therefore the “load” will slightly change. for plotting weak scaling, this can be corrected by calculating the “average run time per time step PER ELEMENT”. if you then use a reference simulation with x elements per slice, you can easily plot weak scaling for an x-element simulation. as long as the memory per GPU is not falling below the critical value from above, this “correction“ works well and weak scaling should look almost perfect.

best wishes, daniel

wjlei1990 commented 5 years ago

Hi Daniel,

Thanks for the information. I just tested run the strong scaling using your setup. I used NEX = 256 for all cases. Here is the table for results:

NPROC_XI	nGPUs	Time Per Step (sec)
1	6	Failed
2	24	0.0121
4	96	0.00335
8	384	0.00117
16	1536	0.000661
32	6144	0.000620

From NPROC_XI = 8 -> NPROC_XI = 16, the nubmer of GPUs is 4 times while the running time is only 2 times reduction. So I think we should stop before NPROC_XI = 16 becuase the communication kicks in as you mentioned before.

The NRPOC_XI = 1 failed at loading the mesh files in the solver part. Do you know if it blows the GPU memory or CPU memory? If it is CPU, then we may try with different CPU layout. If it is GPU, I don't think there is much we can do.

I just checked that GPU on summit has 16GB...so I think the NPROC_XI = 1 will blow the GPU memory...

Below is the output_solver.txt file:

preparing mass matrices
 preparing constants
 preparing gravity arrays
 preparing attenuation
   attenuation period range min/max:  17 / 975  (s)
   ATTENUATION_1D_WITH_3D_STORAGE  :  T
   ATTENUATION_3D                  :  F
 preparing wavefields
   allocating wavefields
   initializing wavefields
 preparing oceans arrays
   number of global points on oceans =  1050625
 preparing fields and constants on GPU devices

   minimum memory requested     :  15671.0608062744141 MB per process

   loading rotation arrays
   loading non-gravity/gravity arrays
   loading attenuation
   loading strain
   loading MPI interfaces
   loading oceans arrays
   loading crust/mantle region
   loading outer core region
   loading inner core region

(then the solver quits...)

I just added the results for NPROC_XI = 32. It seems the computation really hit the fan :)

danielpeter commented 5 years ago

hi Wenjie,

great, that looks interesting. your runtime at 96 procs is consistent with the one I made. so the scaling results seem to make sense. the 16GB memory on the GPU is unfortunately just slightly below what we would need for the 6 procs run (obviously some GPU memory is reserved for the scheduler/GPU system processes, thus not the whole 16 GB can be used :(.

anyway, it's interesting to see that the strong scaling efficiency drops much faster on these Volta V100 cards than on the Kepler K20x cards. we really need to feed these GPU beasts enough work, otherwise they get bored too quickly and stand idle around waiting for the network to catch up.

based on your scaling, the parallel efficiency just drops below 90% for nprocs > 96, that is when GPU memory loaded by the simulation is < 1GB per process (for Titan, this 90% threshold was reached at about < 500MB). so for this NEX simulation, we shouldn't run it on more than 96 GPUs otherwise we start waisting too much of the GPU power. also, it would be interesting to see this strong scaling for a larger simulation, say NEX = 512, to confirm the 1GB memory efficiency threshold.

many thanks, daniel

wjlei1990 commented 5 years ago

Hi Daniel,

Thanks for the feedback. Let me test on the NEX_XI = 512 case and let's see the number. If we stick with the NEX_XI = 256, then we only have two data points, nGPUs = 24 and nGPUs = 96, right?

I also did the weak scaling, the result is listed in the table:

NEX_XI	nGPUs	GPU Usage(%)	elem per slice	time / step (sec)	time / step per elem (sec)
288	216	7%	31968	0.239 * 10^(-2)	7.48 * 10^(-6)
384	384	8%	42552	0.306 * 10^(-2)	7.20 * 10^(-6)
480	600	8%	42984	0.315 * 10^(-2)	7.32 * 10^(-6)
640	1536	8%	37825	0.288 * 10^(-2)	7.61 * 10^(-6)
800	2400	9%	47700	0.381 * 10^(-2)	7.98 * 10^(-6)

How does those number look? Am I picking the right number to make measurements. The elem per slice is the number I took from values_from_mesher.h, in the line:

 ! total elements per slice =  31968

The GPU usage is bit low, though. The GPU memory usage varies from 1,162MB(7%) to 1,456MB(9%).

rdno commented 5 years ago

Hi all,

I ran the strong scaling benchmark for NEX=512. It could not run using 24 GPUs (It gave a memory error).

NPROC_XI	nGPUs	Time Per Step (sec)
4	96	0.020275
8	384	0.005388
16	1536	0.001892
32	6144	0.001175

It looks like it is more efficient when we run for NEX=512.

NEX=256 NEX=512

wjlei1990 commented 5 years ago

Hi,

After fixing the compile error, I also did the strong scaling benchmark on NEX=512 case.

The result is listed below:

NPROC_XI	nGPUs	Time per step (sec)
4	96	0.02019
8	384	0.005378
16	1536	0.001896

The result is very close to the Ridvan's run. So I think those results are stable and reliable.

danielpeter commented 5 years ago

hi Wenjie and Ridvan,

thanks for the benchmark numbers, they look pretty consistent.

Wenjie, the weak scaling numbers above for "time / step per elem" should be for example 7.48 10^(-8) instead of 7.48 10^(-6). there is the additional 10^-2 factor which somehow was left out.

Still, that is just a shift of the scaling curve. when i plot them, the weak scaling efficiency stays between 90 - 100% (figures in the results/ folder). that's fair enough. it is likely better when we use a higher NEX value (i.e. increasing the FACTOR parameter in the weak-scaling script to say 2). at least that is what i saw on Piz Daint. so, if we want to burn some more core-hours we could do that in future for weak scaling.

Ridvan, the strong scaling numbers with NEX 512 seem fine as well. i'm surprised to see that it falls off quite quickly, as is the case for NEX 256. again, the efficiency seems to drop below 90% as well when GPU memory is <1GB (which is about the same for the NEX 256 runs). GPU memory for NEX 512 and nproc 96 is about 6.6GB, thus 1.6GB for nproc 384 and 0.4GB for 1536. there might be ways to improve this strong scaling in future by avoiding some more of the GPU-host memory transfers for sources/seismograms/MPI buffers.

best wishes, daniel