Open wjlei1990 opened 5 years ago
hi Wenjie,
for Titan, we see that the strong scaling falls off at about 500MB per process. thus, with NEX = 256 we will have 4GB per process on a 24 process simulation / 1GB for 96 processes / ~300 MB for 384 processes. at that time we should see the communication kicking in. so, strong scaling can easily be investigated with 3 simple simulations only.
also, setting NEX = 256 will tell you what NPROC_XI values you can use: 1 / 2 / 4 / 8 / 16 / 32. on Summit, you might be able to run these benchmark simulations as low as NPROC_XI = 1.
for the model, there is not much difference between using a 1D or 3D model. important parameters affecting the performance would be tiso model or not, full tiso or not, full attenuation or not. the one chosen here was PREM, as it includes tiso. furthermore, the setup time by the mesher is shorter for a 1D model than a 3D model. therefore, the scaling won’t waste too much time for the meshing procedure which is not considered anyway in the scaling plots.
regarding record length, these benchmark simulations all set the DO_BENCHMARK_SIM.. flag in constants.h to true. thus, it will be fixed to 300 time steps. that’s fairly short, but since it also sets the initial wavefield to 1 everywhere (to avoid flushing-to-zero issues), it won’t blow up the simulation and the scaling measurements so far worked pretty well. you would want to plot “average run time per time step” anyway, so having more time steps will just use resources for hardly better results.
note that the reasoning here is to test code scaling. finding an optimal setup for the simulations in INCITE would involve additional parameters and tests.
also, note that for weak scaling, the script will choose different NEX values depending on the size of the simulation. for GLOBE, these simulations will all use a slightly different number of elements per slice (or process) and therefore the “load” will slightly change. for plotting weak scaling, this can be corrected by calculating the “average run time per time step PER ELEMENT”. if you then use a reference simulation with x elements per slice, you can easily plot weak scaling for an x-element simulation. as long as the memory per GPU is not falling below the critical value from above, this “correction“ works well and weak scaling should look almost perfect.
best wishes, daniel
Hi Daniel,
Thanks for the information. I just tested run the strong scaling using your setup. I used NEX = 256
for all cases. Here is the table for results:
NPROC_XI | nGPUs | Time Per Step (sec) |
---|---|---|
1 | 6 | Failed |
2 | 24 | 0.0121 |
4 | 96 | 0.00335 |
8 | 384 | 0.00117 |
16 | 1536 | 0.000661 |
32 | 6144 | 0.000620 |
From NPROC_XI = 8
-> NPROC_XI = 16
, the nubmer of GPUs is 4 times while the running time is only 2 times reduction. So I think we should stop before NPROC_XI = 16
becuase the communication kicks in as you mentioned before.
The NRPOC_XI = 1
failed at loading the mesh files in the solver part. Do you know if it blows the GPU memory or CPU memory? If it is CPU, then we may try with different CPU layout. If it is GPU, I don't think there is much we can do.
I just checked that GPU on summit has 16GB...so I think the NPROC_XI = 1
will blow the GPU memory...
Below is the output_solver.txt file:
preparing mass matrices
preparing constants
preparing gravity arrays
preparing attenuation
attenuation period range min/max: 17 / 975 (s)
ATTENUATION_1D_WITH_3D_STORAGE : T
ATTENUATION_3D : F
preparing wavefields
allocating wavefields
initializing wavefields
preparing oceans arrays
number of global points on oceans = 1050625
preparing fields and constants on GPU devices
minimum memory requested : 15671.0608062744141 MB per process
loading rotation arrays
loading non-gravity/gravity arrays
loading attenuation
loading strain
loading MPI interfaces
loading oceans arrays
loading crust/mantle region
loading outer core region
loading inner core region
(then the solver quits...)
I just added the results for NPROC_XI = 32
. It seems the computation really hit the fan :)
hi Wenjie,
great, that looks interesting. your runtime at 96 procs is consistent with the one I made. so the scaling results seem to make sense. the 16GB memory on the GPU is unfortunately just slightly below what we would need for the 6 procs run (obviously some GPU memory is reserved for the scheduler/GPU system processes, thus not the whole 16 GB can be used :(.
anyway, it's interesting to see that the strong scaling efficiency drops much faster on these Volta V100 cards than on the Kepler K20x cards. we really need to feed these GPU beasts enough work, otherwise they get bored too quickly and stand idle around waiting for the network to catch up.
based on your scaling, the parallel efficiency just drops below 90% for nprocs > 96, that is when GPU memory loaded by the simulation is < 1GB per process (for Titan, this 90% threshold was reached at about < 500MB). so for this NEX simulation, we shouldn't run it on more than 96 GPUs otherwise we start waisting too much of the GPU power. also, it would be interesting to see this strong scaling for a larger simulation, say NEX = 512, to confirm the 1GB memory efficiency threshold.
many thanks, daniel
Hi Daniel,
Thanks for the feedback. Let me test on the NEX_XI = 512
case and let's see the number. If we stick with the NEX_XI = 256
, then we only have two data points, nGPUs = 24
and nGPUs = 96
, right?
I also did the weak scaling, the result is listed in the table:
NEX_XI | nGPUs | GPU Usage(%) | elem per slice | time / step (sec) | time / step per elem (sec) |
---|---|---|---|---|---|
288 | 216 | 7% | 31968 | 0.239 * 10^(-2) | 7.48 * 10^(-6) |
384 | 384 | 8% | 42552 | 0.306 * 10^(-2) | 7.20 * 10^(-6) |
480 | 600 | 8% | 42984 | 0.315 * 10^(-2) | 7.32 * 10^(-6) |
640 | 1536 | 8% | 37825 | 0.288 * 10^(-2) | 7.61 * 10^(-6) |
800 | 2400 | 9% | 47700 | 0.381 * 10^(-2) | 7.98 * 10^(-6) |
How does those number look? Am I picking the right number to make measurements. The elem per slice
is the number I took from values_from_mesher.h
, in the line:
! total elements per slice = 31968
The GPU usage is bit low, though. The GPU memory usage varies from 1,162MB(7%) to 1,456MB(9%).
Hi all,
I ran the strong scaling benchmark for NEX=512
. It could not run using 24 GPUs (It gave a memory error).
NPROC_XI | nGPUs | Time Per Step (sec) |
---|---|---|
4 | 96 | 0.020275 |
8 | 384 | 0.005388 |
16 | 1536 | 0.001892 |
32 | 6144 | 0.001175 |
It looks like it is more efficient when we run for NEX=512
.
Hi,
After fixing the compile error, I also did the strong scaling benchmark on NEX=512
case.
The result is listed below:
NPROC_XI | nGPUs | Time per step (sec) |
---|---|---|
4 | 96 | 0.02019 |
8 | 384 | 0.005378 |
16 | 1536 | 0.001896 |
The result is very close to the Ridvan's run. So I think those results are stable and reliable.
hi Wenjie and Ridvan,
thanks for the benchmark numbers, they look pretty consistent.
Still, that is just a shift of the scaling curve. when i plot them, the weak scaling efficiency stays between 90 - 100% (figures in the results/ folder). that's fair enough. it is likely better when we use a higher NEX value (i.e. increasing the FACTOR parameter in the weak-scaling script to say 2). at least that is what i saw on Piz Daint. so, if we want to burn some more core-hours we could do that in future for weak scaling.
best wishes, daniel
Hi Daniel,
Thanks for putting all these together. I have some questions regarding to the parameters used for benchmark Summit@ORNL.
For the strong scaling how do I pick the parameters for strong scaling, including:
From my previous experience, it is a bit tricky to pick the problem size and GPU numbers. Because using least GPUs may blow out the GPU memory.