Closed jlippuner closed 4 years ago
@tomidakn does this sound unreasonable to you? It doesn't sound that crazy to me, given the number of OpenMP threads and the memory bandwidth limitation.
Would be interesting to see how it performs with 24, 12, etc. threads. Also, could compare the flat MPI parallelization on-node.
What about just storing a pointer to the MeshBlock
in the NeighborBlock
struct instead of just a gid for which the MeshBlock needs to be looked up in this costly way?
Oh, I see--- I read your description too quickly and thought that the time was spent copying the boundary buffers.
If that much time is spent simply locating the neighbors, it does seem to be a problem. Still would be good to test the performance of the other configurations.
With only OpenMP and no MPI parallelism:
24 OMP threads: 281 s total wall time, Mesh::FindMeshBlock
takes 3701 CPU seconds = 3701 / 24 = 154 s wall time = 55% of total wall time.
12 OMP threads: 508 s total wall time, Mesh::FindMeshBlock
takes 3667 CPU seconds = 3667 / 12 = 306 s wall time = 60% of total wall time.
Something seems weird here. Yes, FindMeshBlock()
is recursively going through the tree with each call, and one could imagine storing its results (but these would need to be updated after every AMR adjustment). Still, it's not searching the tree so much as going straight for the desired node, so there should be negligibly few operations performed.
How big is the memory footprint of this simulation? Maybe the problem is chasing pointers that are never in cache? Or might there be cache coherency locks that are getting in the way, when each thread wants to search the same part of the tree simultaneously and the machine doesn't realize the tree search is read-only?
I agree flat MPI would be very interesting.
@c-white You maybe thinking of MeshBlockTree::FindMeshBlock
where a MeshBlockTREE is looked up given a LogicalLocation. Yes, that function is called recursively stepping into leaf nodes and so it should be O(log(N)).
However, I'm talking about Mesh::FindMeshBlock
where a MeshBlock (not MeshBlockTree) is looked up given a global index (GID). This function iterates through the entire linked list of MeshBlocks until it finds the MeshBlock with the given GID. It is thus O(N) and not called recursively. Because this is called multiple times for each MeshBlock, the whole thing ends up being an O(N^2) operation, which would explain why it takes so much time.
I think flat MPI would help in the sense of making N smaller (because the search is only over MeshBlocks on the current rank), but it would not change that it's an O(N^2) operation.
The memory footprint at the peak is about 3.4 GB with roughly 4k blocks.
Great analysis; seems like we should cache the GID and pointer to the MeshBlock
in an array or some other STL ADT that enables O(1) or O(logN(N)) lookup.
But keeping this data structure coherent is likely nontrivial in AMR as blocks are created and destroyed, so I will defer to @tomidakn on that.
This is my fault. When I wrote it I did not expect so many MeshBlocks per rank (partly because I usually use Flat-MPI), and currently it is just following the linked list from the top whenever it needs a neighbor MeshBlock on the same rank.
I think it is straightforward to fix. First, it is calling FindMeshBlock too often. If I store the pointer in the neighbor information, it will cut the cost of the search to ~1/8 in the case of second-order MHD. Also currently MeshBlocks are stored in a linked list but STL::vector, map, or unordered_map can work better.
Because the cost of FindMeshBlock is O(MeshBlocks per rank), if the total number of MeshBlocks and the number of cores are the same, Flat-MPI should work better as the number of the MeshBlocks per rank is smaller.
And I should not have used the same names for those functions.
Currently I am terribly busy but I'll fix it. Thank you for your report.
Hi Kengo,
I actually have put in some thought of this when I was writing the Particles module.
Please see Particles::LinkNeighbors() in particles.cpp. It may be useful for the same purpose. If in the end you work out a similar structure, I can ditch this function for all.
Sincerely,
Chao-Chin
http://unlv.edu/
Chao-Chin Yang
Postdoctoral Scholar
Department of Physics and Astronomy
University of Nevada, Las Vegas
ccyang@unlv.edu mailto:ccyang@unlv.edu
Office: 702-895-0961
Web http://www.physics.unlv.edu/~ccyang/
On Dec 9, 2019, at 3:27 PM, Kengo TOMIDA notifications@github.com wrote:
This is my fault. When I wrote it I did not expect so many MeshBlocks per rank (partly because I usually use Flat-MPI), and currently it is just following the linked list from the top whenever it needs a neighbor MeshBlock on the same rank.
I think it is straightforward to fix. First, it is calling FindMeshBlock too often. If I store the pointer in the neighbor information, it will cut the cost of the search to ~1/8 in the case of second-order MHD. Also currently MeshBlocks are stored in a linked list but STL::vector, map, or unordered_map can work better.
Because the cost of FindMeshBlock is O(MeshBlocks per rank), if the total number of MeshBlocks and the number of cores are the same, Flat-MPI should work better as the number of the MeshBlocks per rank is smaller.
And I should not have used the same names for those functions.
Currently I am terribly busy but I'll fix it. Thank you for your report.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/PrincetonUniversity/athena-public-version/issues/32?email_source=notifications&email_token=ACZK3BWIBMPIXAVFVG5ZG6DQX3H6FA5CNFSM4JYMVQO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGLCGEQ#issuecomment-563487506, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZK3BWDS6R2EQRPCTTBC23QX3H6FANCNFSM4JYMVQOQ.
This is another motivation to consolidated all the partial custom implementations of singly- and doubly- linked list data structures under the C++ STL; it would be much easier to swap data structures when we encounter issues like this.
I think I will make something like Chao-Chin implemented default for the main part of the code. I guess that will improve the performance significantly. But I'd take this opportunity to redesign the structure. I hope but cannot guarantee that it will be my Xmas present.
@tomidakn should we close this, since it is presumably fixed in the private repo? We could send @jlippuner the new version so that the exact same test could be run, in order to confirm this.
Yes, this is now fixed in the development branch.
Sorry it took too long.
I'm doing some simple profiling of Athena++ to see which parts use how much of the wall time. I'm running a simple hydro-only, 3D blast problem with AMR on a single node (no MPI) with 48 OpenMP threads (there are 2 CPUs with 24 cores each). Here's my
configure
command:And here's the input file:
The whole evolution takes about 217 seconds of wall time. I find that the
SEND_HYD
task takes by far the most time of all tasks (4772 seconds of CPU time, which is 58% of the time of all tasks, not counting any OpenMP idle time). Upon digging deeper into this task, I found that the call toMesh::FindMeshBlock
insideBoundaryVariable::CopyVariableBufferSameProcess
inBoundaryVariable::SendBoundaryBuffers
takes a total of 4372 seconds (this is only measuring the time ofMesh::FindMeshBlock
invoked inside theSEND_HYD
task, and not measuring the timeMesh::FindMeshBlock
takes when invoked in other tasks). Assuming no OpenMP idle time, the wall time forMesh::FindMeshBlock
is thus 4372 / 48 = 91 seconds, which is 42% of the total wall time.Are you aware that
Mesh::FindMeshBlock
is so expensive and might there be any ways to make it more efficient?