Closed jmstone closed 2 months ago
In GitLab by @jmstone216 on Jan 28, 2023, 11:42
requested review from @pdmullen
In GitLab by @pdmullen on Jan 28, 2023, 13:20
added 1 commit
In GitLab by @pdmullen on Jan 28, 2023, 13:25
Commented on src/driver/driver.cpp line 412
:thumbsup:
In GitLab by @pdmullen on Jan 28, 2023, 13:50
Commented on src/mesh/load_balance.cpp line 247
if (!(no_errors)) {
// ...
}
In GitLab by @pdmullen on Jan 28, 2023, 13:51
Commented on src/mesh/load_balance.cpp line 247
also below
In GitLab by @pdmullen on Jan 28, 2023, 13:53
Commented on src/mesh/load_balance.cpp line 136
Radiation?
In GitLab by @pdmullen on Jan 28, 2023, 14:21
added 1 commit
In GitLab by @pdmullen on Jan 28, 2023, 14:23
Looks great to me, @jmstone216. On my end, I checked that restart files are back and working now.
In the course of this MR, didn't you discover that MPI tagging places a hard limit on the number of MeshBlock
s we can use? Did I miss that check somewhere?
In GitLab by @pdmullen on Jan 28, 2023, 14:23
approved this merge request
In GitLab by @jmstone216 on Jan 28, 2023, 15:17
You are right, I added a new macro to athena.hpp which limits the maximum number of MeshBlocks per rank to be 2^14 = 16384. This would ensure we would have no problems with Intel MPI, which has the strictest limits on the number of tags allowed in any MPI implementation of which I am aware. I suppose I should add a check in build_tree to make sure this limit is not exceeded. Agreed?
In GitLab by @jmstone216 on Jan 28, 2023, 15:18
Thanks for all the other feedback. I'll make some updates based on your suggestions and merge in the next day or two.
In GitLab by @jmstone216 on Jan 28, 2023, 15:20
Commented on src/mesh/load_balance.cpp line 136
Not added yet. I didn't want to add anything I can't test thoroughly. As soon as we want to run an application with radiation and AMR, we can implement this and test. I could add a trap so the code quits if you try to run AMR+Radiation, if you think that is a good idea.
In GitLab by @jmstone216 on Jan 28, 2023, 15:21
Commented on src/mesh/load_balance.cpp line 247
Done.
In GitLab by @pdmullen on Jan 28, 2023, 15:28
Commented on src/mesh/load_balance.cpp line 136
I think that is a good idea. Might also need one for ion-neutral?
In GitLab by @pdmullen on Jan 28, 2023, 15:28
Agreed.
In GitLab by @jmstone216 on Jan 28, 2023, 15:34
Commented on src/mesh/load_balance.cpp line 136
OK, I'll add a trap for radiation. Ion-neutral should work (but it is untested) the way I wrote it since it is just Hydro+MHD.
In GitLab by @jmstone216 on Jan 28, 2023, 15:55
Commented on src/mesh/load_balance.cpp line 247
changed this line in version 4 of the diff
In GitLab by @jmstone216 on Jan 28, 2023, 15:55
added 2 commits
In GitLab by @pdmullen on Jan 28, 2023, 16:39
LGTM :smile:
In GitLab by @jmstone216 on Jan 29, 2023, 10:18
mentioned in commit fe2ec4036fc151a4a1db25a46fdeab0048d6ea6f
In GitLab by @jmstone216 on Jan 28, 2023, 11:42
_Merges loadbalancing -> master
Implements all the functionality needed for load balancing with AMR on multiple MPI ranks. Significant changes were required from the functions implemented in Athena++ because of the way we use 5D Views to store data over MeshBlockPacks. However, I think I've found a way to make moving data withing Views to load balance and preserve the Z-ordering of MeshBlock elements that is reasonably efficient.
This has been tested with both Hydro and MHD (Newtonian) on up to 22 ranks. Compiles and runs on GPUs, but I have not tested load balancing thoroughly on multiple GPUs. For that I need a usage scenario. I think it is best to merge the load balancing infrastructure that exists now to master, and then to start using it for, e.g. dynamical relativity problems, and then debug whatever problems arise then.
I also have not measure performance and scaling except very crudely. I think a head-to-head comparison of AMR scaling and performance on multiple GPUs with Parthenon would be VERY informative, and should be pursued immediately.
One potential issue is memory usage. The current design requires adding additional send and receive buffers for load balancing, as well as fixing the maximum number of MeshBlocks allowed per rank at run time. If the latter is exceeded, the code simply exits. Better memory management and usage is probably warranted. But again, until AMR gets used heavily in real applications, then there is no point in developing more features, or trying to debug issues that might not actually be problems in real applications.