Load balancing! - [merged] #513

Closed jmstone closed 2 months ago

In GitLab by @jmstone216 on Jan 28, 2023, 11:42

_Merges loadbalancing -> master

Implements all the functionality needed for load balancing with AMR on multiple MPI ranks. Significant changes were required from the functions implemented in Athena++ because of the way we use 5D Views to store data over MeshBlockPacks. However, I think I've found a way to make moving data withing Views to load balance and preserve the Z-ordering of MeshBlock elements that is reasonably efficient.

This has been tested with both Hydro and MHD (Newtonian) on up to 22 ranks. Compiles and runs on GPUs, but I have not tested load balancing thoroughly on multiple GPUs. For that I need a usage scenario. I think it is best to merge the load balancing infrastructure that exists now to master, and then to start using it for, e.g. dynamical relativity problems, and then debug whatever problems arise then.

I also have not measure performance and scaling except very crudely. I think a head-to-head comparison of AMR scaling and performance on multiple GPUs with Parthenon would be VERY informative, and should be pursued immediately.

One potential issue is memory usage. The current design requires adding additional send and receive buffers for load balancing, as well as fixing the maximum number of MeshBlocks allowed per rank at run time. If the latter is exceeded, the code simply exits. Better memory management and usage is probably warranted. But again, until AMR gets used heavily in real applications, then there is no point in developing more features, or trying to debug issues that might not actually be problems in real applications.

In GitLab by @jmstone216 on Jan 28, 2023, 11:42

requested review from @pdmullen

In GitLab by @pdmullen on Jan 28, 2023, 13:20

added 1 commit

7f5a455f - CI failing due to OOM, try less used GPU

Compare with previous version

In GitLab by @pdmullen on Jan 28, 2023, 13:25

Commented on src/driver/driver.cpp line 412

:thumbsup:

In GitLab by @pdmullen on Jan 28, 2023, 13:50

Commented on src/mesh/load_balance.cpp line 247

  if (!(no_errors)) {
    // ...
  }

In GitLab by @pdmullen on Jan 28, 2023, 13:51

Commented on src/mesh/load_balance.cpp line 247

also below

In GitLab by @pdmullen on Jan 28, 2023, 13:53

Commented on src/mesh/load_balance.cpp line 136

Radiation?

In GitLab by @pdmullen on Jan 28, 2023, 14:21

added 1 commit

6d29185b - Update SMR input files to reflect design change

Compare with previous version

In GitLab by @pdmullen on Jan 28, 2023, 14:23

Looks great to me, @jmstone216. On my end, I checked that restart files are back and working now.

In the course of this MR, didn't you discover that MPI tagging places a hard limit on the number of MeshBlocks we can use? Did I miss that check somewhere?

In GitLab by @pdmullen on Jan 28, 2023, 14:23

approved this merge request

In GitLab by @jmstone216 on Jan 28, 2023, 15:17

You are right, I added a new macro to athena.hpp which limits the maximum number of MeshBlocks per rank to be 2^14 = 16384. This would ensure we would have no problems with Intel MPI, which has the strictest limits on the number of tags allowed in any MPI implementation of which I am aware. I suppose I should add a check in build_tree to make sure this limit is not exceeded. Agreed?

In GitLab by @jmstone216 on Jan 28, 2023, 15:18

Thanks for all the other feedback. I'll make some updates based on your suggestions and merge in the next day or two.

In GitLab by @jmstone216 on Jan 28, 2023, 15:20

Commented on src/mesh/load_balance.cpp line 136

Not added yet. I didn't want to add anything I can't test thoroughly. As soon as we want to run an application with radiation and AMR, we can implement this and test. I could add a trap so the code quits if you try to run AMR+Radiation, if you think that is a good idea.

In GitLab by @jmstone216 on Jan 28, 2023, 15:21

Commented on src/mesh/load_balance.cpp line 247

Done.

In GitLab by @pdmullen on Jan 28, 2023, 15:28

Commented on src/mesh/load_balance.cpp line 136

I think that is a good idea. Might also need one for ion-neutral?

In GitLab by @pdmullen on Jan 28, 2023, 15:28

Agreed.

In GitLab by @jmstone216 on Jan 28, 2023, 15:34

Commented on src/mesh/load_balance.cpp line 136

OK, I'll add a trap for radiation. Ion-neutral should work (but it is untested) the way I wrote it since it is just Hydro+MHD.

In GitLab by @jmstone216 on Jan 28, 2023, 15:55