AMReX-Codes / amrex

AMReX: Software Framework for Block Structured AMR
https://amrex-codes.github.io/amrex
Other
517 stars 339 forks source link

SpeedUp evaluation #689

Closed pedro-ricardo closed 3 years ago

pedro-ricardo commented 4 years ago

Hello there, firstly congrats on this AMR framework you maintain. :+1:

I'm a senior programmer of an AMR code that's been in constant development for the last 14 years in Brazil. We have very well consolidated software to handle detailed CFD of compressible and incompressible flows with multiphase, combustion, fluid-structure interaction and particle dynamics.

Since it's beginning the code is meant to be an engineering tool focused on Multi-physics. But lately we face the limit of what a team of engineers are capable of handling when the issue is memory management and computational performance.

In the search of a better and more efficient framework to port our developments to, we found AMReX.

I wish to know how did you measured the parallel efficiency of AMReX and I want to reproduce your tests in our cluster and compare with our software as well to see if we can proceed in changing our whole basis to use AMReX.

Our primary concern is with the ghost cells around patches/boxes... In our experience, the number of: interpolations between levels; injection of values from other patches/boxes in the same level; and the parallel communications needed to converge the MLMG are just too many for the code to be massively parallel and yet guarantee 2nd order accuracy.

But ... hopefully that's just our lack of knowledge!

Thanks in advance!

asalmgren commented 4 years ago

Hi Pedro, and welcome to the AMReX github! We have a number of tests (in the Tutorials folder) that you can run and play with to measure the parallel efficiency. As I'm sure you know well, the parallel performance can be very sensitive to how you set up the study. AMReX can obviously do the operations you talk about. What we call MLMG is only one part of the AMReX framework -- to us that refers to the multi-level multigrid solver but we're not sure if that's what you mean here? We'd be happy to have you become AMReX users ... please let us know how we can help!

pedro-ricardo commented 4 years ago

Thank you for the quick response, and yes I was referring to multi-level multigrid

to us that refers to the multi-level multigrid solver but we're not sure if that's what you mean here

I'm already running tutorials and my question on "how did you measured the parallel efficiency" was about a standard configuration in witch you already have some results to expect (like a SpeedUp curve).

I'm sorry if it looks obvious, but in our experience just naively adding more processes in a case setup may produce wrong results for a SpeedUp curve.

For instance imagine a case setup like this (but with more volumes in the base): mesh_speed And I will start to add processes and measure time.

The MLMG will impact wrongly the SpeedUp measurement once you start slicing the domain since it's convergence rate is dependent on how much coarser it can get. By adding too much partitions the base level at each process will diminish and a long with it the number of coarse virtual levels one can get.

This will end up leading in the MLMG spending more iterations to converge at a certain process number than others... and I think this is undesirable.

I'm sorry if it wasn't clear the first time but the whole point about asking for a standard test was not to think about this kind of stuff. -smiles- (^.^)

zingale commented 4 years ago

Hi Pedro, I think Ann's suggestion of the tutorials is the way to go if you want to just isolate a solver. Regarding how you design a scaling problem, I think it is as much art as science. If you want to see some examples of scaling with problems that are dominated the MLMG solver in AMReX, we've published some here:

  1. https://ui.adsabs.harvard.edu/abs/2019ApJ...887..212F

    That's our low Mach number astrophysics code. Each step has 3 multigrid solves -- two cell-centered and one nodal, that are variable-coefficient elliptic solves

  2. https://ui.adsabs.harvard.edu/abs/2019arXiv191012578Z

    That's our compressible astro code. It solves a Poisson equation for gravity, and scaling on GPUs on the summit supercomputer are shown in that paper.

I believe the setup files we used for those runs are in the respective code github repos, but if not, they should be and we can dig them up.

pedro-ricardo commented 4 years ago

Thanks, I'll take a look in those setups. I'll close this now but maybe reopen the issue in the case I have anything else to ask about.

pedro-ricardo commented 3 years ago

Hello there, I'm reopening this issue to talk about the parallel performance in example in Tutorials/LinearSolvers/ABecLaplacian_F that is close to what our old AMR code did.

While the lib installation on our clusters is not ready. I'm testing this in a workstation with:

Setup

Results

N_proc Run Time [s] SpeedUp
1 78.7007 1.0000
2 53.4083 1.4736
4 46.3440 1.6982

Obs.: All tests where made 5 times, and the values in the table are the average.

As you can see (by that poor SpeedUp) I'm definitely doing something wrong so ... What am I missing?

pedro-ricardo commented 3 years ago
I've tryed a different approach by enlarging the domain geometry 10x in 1 direction and using n_cell like 1280 128 128. Also changed the computer to our cluster. Seems like this type of problem has a better speedup. But it is loosing the ratio too soon I think. N_proc Run Time [s] SpeedUp
1 479.8511 1.0000
2 242.3733 1.9798
4 136.3180 3.5201
8 73.7377 6.5075
16 50.1514 9.5680
32 31.3239 15.3190
pedro-ricardo commented 3 years ago

So far the things I kept an eye on:

There are many things to keep an on when do this. I can’t say if what you did makes sense

@drummerdoc What do you suggest?

drummerdoc commented 3 years ago

Sorry for the odd comment. It was an accidental Email response meant for a collaborator with a similar name...coincidentally, we were discussing scaling studies of combustion calculations with PeleLM Dang thing showed up here. so I deleted my post, but not fast enough! :)

drummerdoc commented 3 years ago

But... based on the picture you showed above, I'd double the physical size and the number of cells together in the X and in the Y directions from the base 128^2, so that the problem is doing almost the identical thing when you double the number of processes. That is, 128x128x128 with N procs, then 256x128x128 with 2N procs, then 256x256x128 with 4N procs.

pedro-ricardo commented 3 years ago

I see .. Ok .. so the SpeedUp you do is by enlarging the domain in the same proporion as increasing the number of processes. This way the cpu time should keep constant right?

The one I did was to start a huge problem and keep adding processes in the same problem, this way the cpu time should decrease in the same proporion as the processes increase.

I'll make new tests the way you suggested, thanks.

drummerdoc commented 3 years ago

Ahh shoot...the reality what you were doing is called "strong scaling". What I suggested is called "weak scaling". They say different things.

WeiqunZhang commented 3 years ago

You probably do not want to limit the number of coarsening levels. It actually makes it unfair for the large problems contrary to what one might think, because at the bottom of a v-cycle, a bicgstab solver is called to reduce the residual by 1e-4. Solving large problems with bicgstab is expensive.

For weak scaling problems, you should try to make the domain close to a cube, unless your target problem has a narrow pencil shape domain. The reason for that is more coarsening can be achieved and more coarsening is good for multigrid solvers.

You showed a few numbers. But what are those numbers? How did you measure them?

On Fri, Feb 12, 2021, 6:48 PM Marc Day notifications@github.com wrote:

Ahh shoot...the reality what you were doing is called "strong scaling". What I suggested is called "weak scaling". They say different things.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AMReX-Codes/amrex/issues/689#issuecomment-778550651, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB37TYLB4LFUFBRE2RZ3YVLS6XR6ZANCNFSM4KRGFS3Q .

pedro-ricardo commented 3 years ago

I was told that speedup measuring using MG must have these settings. Otherwise by letting the coarsening loose you won't guarantee that your processes will keep the same number of operations (needed to correctly measure speedup) since the single process case could generate more virtual coarse levels than the same problem with a lot of processes.

I failed to mention but I also set the residuals to zero. This way the bottom solver will always exit not by convergence but with the same number the of iterations ... And this guarantees the number of operations to keep the same, independent of my partitions.

About the numbers, they are a simple measurement of execution time, both inside the program with mpi_wtime() and outside ... repeated 5x times each simulation and averaged. The speedup column is the ratio between the one process case cpu time and the current configuration cpu time (T1/Tn). This measures the decrease in time and for a perfect scaling should be close to the number of processes.

pedro-ricardo commented 3 years ago

Ops .. that might cause a confusion.. by "same number of operations" that I said above please understand it in the algorithm level. The same algorithm must be evaluated with different processes. As I understand in this strong scaling, the only thing that can be different at each test are the number of cells per process.

pedro-ricardo commented 3 years ago

Then here it is .. using both speedup metrics, this is what I have running the AMReX Fortran example:

Strong Scaling (plotted along some Amdahl’s Law curves)

Weak Scaling (plotted along some Gustafson’s Law curves)

maikel commented 3 years ago

I cannot really help out, but I would still like to understand the situation and have some questions..

I suppose you run those test on a cluster now? Can you post some details such as...

Is this purely CPU? What is the hardware per compute node? #sockets, #cpus How many CPU per MPI rank do you use? Do you use OpenMP? What are the patch sizes per Rank?

As i understand your graphics you Plot the amount of MPI ranks. Is this the only dimension of parallelization that you use?

I'm asking because the posted references use a different scale.

Maikel

pedro-ricardo commented 3 years ago

Hello @maikel, I hope this info answers your questions

Is this purely CPU?

  • Yes, no GPU in there What is the hardware per compute node? #sockets, #cpus
  • 2 sockets each with 10 real cores How many CPU per MPI rank do you use?
  • 1 to 1 Do you use OpenMP?
  • No, just MPI What are the patch sizes per Rank?
  • You want the number of cells right? I chose huge numbers to avoid communication time to overcome internal calculations
  • Strong Speedup: All cases with 101,580,800 total cells, the partitioning is done automatically so I don't know exactly how many per Rank. If it performs a load balance based on cells, in the 32 processes case will still be 3,174,400 per rank.
  • Weak Speedup: 1 process case is 10,158,080 cells and it scales proportionally until 32 processes with 325,058,560 cells. Is this the only dimension of parallelization that you use?
  • Yes, no GPU, no OpenMP ... just MPI
pedro-ricardo commented 3 years ago
Also, this were the main libs loaded and their versions: Libs Versions
GCC 10.2.0
OpenMPI 4.0.5
Hdf5 1.12.0
Lapack 3.9.0
Hypre 2.19.0
PETSc 3.14.1
AMReX 21.01
maikel commented 3 years ago

Hey @pedro-ricardo,

thank you for clarifying

Hello @maikel, I hope this info answers your questions

Is this purely CPU?

  • Yes, no GPU in there

Ok. Is this a requirement?

What is the hardware per compute node? #sockets, #cpus

  • 2 sockets each with 10 real cores

How many CPU per MPI rank do you use?

  • 1 to 1

Do you use OpenMP?

  • No, just MPI

Do you have a specific reason to ban OpenMP? In my experience, it usually helps in 3D to mitigate communication overhead of MPI by allowing more locality, or larger Boxes. I also think that it is weird to test 32 processes on the hardware you post. Does MLMG require a power of two? Instead, I would try to test in multiples of nodes or sockets. shrug

What are the patch sizes per Rank?

  • You want the number of cells right? I chose huge numbers to avoid communication time to overcome internal calculations

    • Strong Speedup: All cases with 101,580,800 total cells, the partitioning is done automatically so I don't know exactly how many per Rank. If it performs a load balance based on cells, in the 32 processes case will still be 3,174,400 per rank.
    • Weak Speedup: 1 process case is 10,158,080 cells and it scales proportionally until 32 processes with 325,058,560 cells.

Do you count ghost cells or just inner cells? I guess, If you really want to keep the amount of computations constant I would try to use the exact same BoxArray across the multiple runs and only change the DistributionMapping. But I also doubt that this is what the standard configuration does. What I can imagine is, that your current average box size shrinks too fast with growing number of processes.

You might also try to play with the blocking factor or maximal grid size.

If I am allowed to make a wild guess in the blue, I would say that the amount of work per process is still too small, at least for your strong scaling test. Unfortunately, I don't know the characteristics of MLMG very well and someone else with more experience should maybe clarify what to expect on CPU computations.

Is this the only dimension of parallelization that you use?

  • Yes, no GPU, no OpenMP ... just MPI

And whatever happens automatically and without much control for you.. ;-)

Maikel

maikel commented 3 years ago

Are those results even bad? Just reading p = 97% sounds pretty high to me? What does it mean?

pedro-ricardo commented 3 years ago

Thank you, I've posted the results because I thought AMReX would have SpeedUp tests up in the sleeve to prove the massively parallel feature, therefore could point me that those curves are wrong and give the correct numbers and the directions to achieve them.

About the mesh, usually in CFD there is a magic number of 40,000 cells per process so you can guarantee more computation internally than communications ... but this vary with lots of things, so I'm running over a million cells in there to make sure.

I think that 101,580,800 is a very high number of cells for a CFD case and that should be enough. Also the results seem bad because as you can see by where that curve is going, I wont be able to gain basically nothing above say 128 processes. And that doesn't sound like massive parallel. But maybe I'm dreaming with impossible stuff. I'll look for some literature to set comparisons properly

The reason for the 32 processes is that I use a power of 2 to setup cases ... This should make things easier for the partition algorithms as I keep the mesh in a power of 2 as well (just a preference based on previous experiences). I always have 10 cores available but I use only 8 in each node, I always avoid using it all (again, just a preference).

No reason to not use the OpenMP, I just thought that MPI would be the responsible for scaling. But if mixing them is the way, I can give it a try.

I'll play a little bit with blocking_factor and max_grid_size .. I kept those at default.

maxpkatz commented 3 years ago

@pedro-ricardo by timing the entire execution you're also including the setup costs which may not be fully parallelized. What do the scaling curves look like when you're only counting the solve time (which is printed out for you)?

pedro-ricardo commented 3 years ago

@maxpkatz Good point! Thank you, I'll change the calculations to that time.

pedro-ricardo commented 3 years ago

Ok I've finished running those tests again, @maxpkatz (last time I disabled the verbose options). So far I can see that the solver time (output from MLMG: Timers:) is actually worse to measure the SpeedUp than the hole program because the inputs are too quick and there are some calculations like the RHS fill that scale very well and are kept out of this time.

About the max_grid_size variable I tested it and it does impact the SpeedUp measurements quite a lot

But I will close this issue, since this results that I found don't have anything to be confronted with to check rights or wrongs, yet. If someone find different results on the topic please post it and lets discuss.