IAS-Astrophysics / athenak

Performance-portable version of the Athena++ astrophysical AMR-MHD code using Kokkos.
BSD 3-Clause "New" or "Revised" License
32 stars 20 forks source link

Load balancing! - [merged] #513

Closed jmstone closed 2 months ago

jmstone commented 1 year ago

In GitLab by @jmstone216 on Jan 28, 2023, 11:42

_Merges loadbalancing -> master

Implements all the functionality needed for load balancing with AMR on multiple MPI ranks. Significant changes were required from the functions implemented in Athena++ because of the way we use 5D Views to store data over MeshBlockPacks. However, I think I've found a way to make moving data withing Views to load balance and preserve the Z-ordering of MeshBlock elements that is reasonably efficient.

This has been tested with both Hydro and MHD (Newtonian) on up to 22 ranks. Compiles and runs on GPUs, but I have not tested load balancing thoroughly on multiple GPUs. For that I need a usage scenario. I think it is best to merge the load balancing infrastructure that exists now to master, and then to start using it for, e.g. dynamical relativity problems, and then debug whatever problems arise then.

I also have not measure performance and scaling except very crudely. I think a head-to-head comparison of AMR scaling and performance on multiple GPUs with Parthenon would be VERY informative, and should be pursued immediately.

One potential issue is memory usage. The current design requires adding additional send and receive buffers for load balancing, as well as fixing the maximum number of MeshBlocks allowed per rank at run time. If the latter is exceeded, the code simply exits. Better memory management and usage is probably warranted. But again, until AMR gets used heavily in real applications, then there is no point in developing more features, or trying to debug issues that might not actually be problems in real applications.

jmstone commented 1 year ago

In GitLab by @jmstone216 on Jan 28, 2023, 11:42

requested review from @pdmullen

jmstone commented 1 year ago

In GitLab by @pdmullen on Jan 28, 2023, 13:20

added 1 commit

Compare with previous version

jmstone commented 1 year ago

In GitLab by @pdmullen on Jan 28, 2023, 13:25

Commented on src/driver/driver.cpp line 412

:thumbsup:

jmstone commented 1 year ago

In GitLab by @pdmullen on Jan 28, 2023, 13:50

Commented on src/mesh/load_balance.cpp line 247

  if (!(no_errors)) {
    // ...
  }
jmstone commented 1 year ago

In GitLab by @pdmullen on Jan 28, 2023, 13:51

Commented on src/mesh/load_balance.cpp line 247

also below

jmstone commented 1 year ago

In GitLab by @pdmullen on Jan 28, 2023, 13:53

Commented on src/mesh/load_balance.cpp line 136

Radiation?

jmstone commented 1 year ago

In GitLab by @pdmullen on Jan 28, 2023, 14:21

added 1 commit

Compare with previous version

jmstone commented 1 year ago

In GitLab by @pdmullen on Jan 28, 2023, 14:23

Looks great to me, @jmstone216. On my end, I checked that restart files are back and working now.

In the course of this MR, didn't you discover that MPI tagging places a hard limit on the number of MeshBlocks we can use? Did I miss that check somewhere?

jmstone commented 1 year ago

In GitLab by @pdmullen on Jan 28, 2023, 14:23

approved this merge request

jmstone commented 1 year ago

In GitLab by @jmstone216 on Jan 28, 2023, 15:17

You are right, I added a new macro to athena.hpp which limits the maximum number of MeshBlocks per rank to be 2^14 = 16384. This would ensure we would have no problems with Intel MPI, which has the strictest limits on the number of tags allowed in any MPI implementation of which I am aware. I suppose I should add a check in build_tree to make sure this limit is not exceeded. Agreed?

jmstone commented 1 year ago

In GitLab by @jmstone216 on Jan 28, 2023, 15:18

Thanks for all the other feedback. I'll make some updates based on your suggestions and merge in the next day or two.

jmstone commented 1 year ago

In GitLab by @jmstone216 on Jan 28, 2023, 15:20

Commented on src/mesh/load_balance.cpp line 136

Not added yet. I didn't want to add anything I can't test thoroughly. As soon as we want to run an application with radiation and AMR, we can implement this and test. I could add a trap so the code quits if you try to run AMR+Radiation, if you think that is a good idea.

jmstone commented 1 year ago

In GitLab by @jmstone216 on Jan 28, 2023, 15:21

Commented on src/mesh/load_balance.cpp line 247

Done.

jmstone commented 1 year ago

In GitLab by @pdmullen on Jan 28, 2023, 15:28

Commented on src/mesh/load_balance.cpp line 136

I think that is a good idea. Might also need one for ion-neutral?

jmstone commented 1 year ago

In GitLab by @pdmullen on Jan 28, 2023, 15:28

Agreed.

jmstone commented 1 year ago

In GitLab by @jmstone216 on Jan 28, 2023, 15:34

Commented on src/mesh/load_balance.cpp line 136

OK, I'll add a trap for radiation. Ion-neutral should work (but it is untested) the way I wrote it since it is just Hydro+MHD.

jmstone commented 1 year ago

In GitLab by @jmstone216 on Jan 28, 2023, 15:55

Commented on src/mesh/load_balance.cpp line 247

changed this line in version 4 of the diff

jmstone commented 1 year ago

In GitLab by @jmstone216 on Jan 28, 2023, 15:55

added 2 commits

Compare with previous version

jmstone commented 1 year ago

In GitLab by @pdmullen on Jan 28, 2023, 16:39

LGTM :smile:

jmstone commented 1 year ago

In GitLab by @jmstone216 on Jan 29, 2023, 10:18

mentioned in commit fe2ec4036fc151a4a1db25a46fdeab0048d6ea6f