PrincetonUniversity / athena-public-version

(MOVED) Athena++ GRMHD code and adaptive mesh refinement (AMR) framework. Active repository --->
https://github.com/PrincetonUniversity/athena
BSD 3-Clause "New" or "Revised" License
160 stars 117 forks source link

Mesh::FindMeshBlock uses 40% of wall time #32

Closed jlippuner closed 4 years ago

jlippuner commented 4 years ago

I'm doing some simple profiling of Athena++ to see which parts use how much of the wall time. I'm running a simple hydro-only, 3D blast problem with AMR on a single node (no MPI) with 48 OpenMP threads (there are 2 CPUs with 24 cores each). Here's my configure command:

./configure.py --prob blast -omp -hdf5 --include /usr/include/hdf5/serial/ --lib_path /usr/lib/x86_64-linux-gnu/hdf5/serial/

And here's the input file:

<comment>
problem   = spherical blast wave
reference = Gardiner. T.A. & Stone, J.M., JCP, 205, 509 (2005) (for MHD version of test)
configure = --prob=blast

<job>
problem_id = Blast      # problem ID: basename of output filenames

<output1>
file_type  = hst        # History data dump
dt         = 0.000001       # time increment between outputs

<output2>
file_type  = hdf5        # Binary data dump
variable   = prim       # variables to be output
dt         = 0.01       # time increment between outputs
x3_slice   = 0.0

<output3>
file_type  = hdf5        # Binary data dump
variable   = uov        # variables to be output
dt         = 0.01       # time increment between outputs
x3_slice   = 0.0

<time>
cfl_number = 0.3        # The Courant, Friedrichs, & Lewy (CFL) Number
nlim       = -1         # cycle limit
tlim       = 1.0        # time limit
integrator  = vl2       # time integration algorithm
xorder      = 2         # order of spatial reconstruction
ncycle_out  = 1         # interval for stdout summary info

<mesh>
nx1        = 32         # Number of zones in X1-direction
x1min      = -1       # minimum value of X1
x1max      = 1        # maximum value of X1
ix1_bc     = outflow   # inner-X1 boundary flag
ox1_bc     = outflow   # outer-X1 boundary flag

nx2        = 32        # Number of zones in X2-direction
x2min      = -1      # minimum value of X2
x2max      = 1       # maximum value of X2
ix2_bc     = outflow   # inner-X2 boundary flag
ox2_bc     = outflow   # outer-X2 boundary flag

nx3        = 32         # Number of zones in X3-direction
x3min      = -1       # minimum value of X3
x3max      = 1        # maximum value of X3
ix3_bc     = outflow   # inner-X3 boundary flag
ox3_bc     = outflow   # outer-X3 boundary flag

num_threads = 48

refinement     = adaptive
numlevel       = 4
derefine_count = 5

<meshblock>
nx1 = 8
nx2 = 8
nx3 = 8

<hydro>
gamma           = 1.4 # gamma = C_p/C_v
iso_sound_speed = 0.4082482905   # equavalent to sqrt(gamma*p/d) for p=0.1, d=1

<problem>
compute_error = false  # check whether blast is spherical at end
pamb          = 0.1    # ambient pressure
prat          = 100.   # Pressure ratio initially
radius        = 0.1    # Radius of the inner sphere
thr           = 3.0

The whole evolution takes about 217 seconds of wall time. I find that the SEND_HYD task takes by far the most time of all tasks (4772 seconds of CPU time, which is 58% of the time of all tasks, not counting any OpenMP idle time). Upon digging deeper into this task, I found that the call to Mesh::FindMeshBlock inside BoundaryVariable::CopyVariableBufferSameProcess in BoundaryVariable::SendBoundaryBuffers takes a total of 4372 seconds (this is only measuring the time of Mesh::FindMeshBlock invoked inside the SEND_HYD task, and not measuring the time Mesh::FindMeshBlock takes when invoked in other tasks). Assuming no OpenMP idle time, the wall time for Mesh::FindMeshBlock is thus 4372 / 48 = 91 seconds, which is 42% of the total wall time.

Are you aware that Mesh::FindMeshBlock is so expensive and might there be any ways to make it more efficient?

felker commented 4 years ago

@tomidakn does this sound unreasonable to you? It doesn't sound that crazy to me, given the number of OpenMP threads and the memory bandwidth limitation.

Would be interesting to see how it performs with 24, 12, etc. threads. Also, could compare the flat MPI parallelization on-node.

jlippuner commented 4 years ago

What about just storing a pointer to the MeshBlock in the NeighborBlock struct instead of just a gid for which the MeshBlock needs to be looked up in this costly way?

felker commented 4 years ago

Oh, I see--- I read your description too quickly and thought that the time was spent copying the boundary buffers.

If that much time is spent simply locating the neighbors, it does seem to be a problem. Still would be good to test the performance of the other configurations.

jlippuner commented 4 years ago

With only OpenMP and no MPI parallelism:

24 OMP threads: 281 s total wall time, Mesh::FindMeshBlock takes 3701 CPU seconds = 3701 / 24 = 154 s wall time = 55% of total wall time.

12 OMP threads: 508 s total wall time, Mesh::FindMeshBlock takes 3667 CPU seconds = 3667 / 12 = 306 s wall time = 60% of total wall time.

c-white commented 4 years ago

Something seems weird here. Yes, FindMeshBlock() is recursively going through the tree with each call, and one could imagine storing its results (but these would need to be updated after every AMR adjustment). Still, it's not searching the tree so much as going straight for the desired node, so there should be negligibly few operations performed.

How big is the memory footprint of this simulation? Maybe the problem is chasing pointers that are never in cache? Or might there be cache coherency locks that are getting in the way, when each thread wants to search the same part of the tree simultaneously and the machine doesn't realize the tree search is read-only?

I agree flat MPI would be very interesting.

jlippuner commented 4 years ago

@c-white You maybe thinking of MeshBlockTree::FindMeshBlock where a MeshBlockTREE is looked up given a LogicalLocation. Yes, that function is called recursively stepping into leaf nodes and so it should be O(log(N)).

However, I'm talking about Mesh::FindMeshBlock where a MeshBlock (not MeshBlockTree) is looked up given a global index (GID). This function iterates through the entire linked list of MeshBlocks until it finds the MeshBlock with the given GID. It is thus O(N) and not called recursively. Because this is called multiple times for each MeshBlock, the whole thing ends up being an O(N^2) operation, which would explain why it takes so much time.

I think flat MPI would help in the sense of making N smaller (because the search is only over MeshBlocks on the current rank), but it would not change that it's an O(N^2) operation.

The memory footprint at the peak is about 3.4 GB with roughly 4k blocks.

felker commented 4 years ago

Great analysis; seems like we should cache the GID and pointer to the MeshBlock in an array or some other STL ADT that enables O(1) or O(logN(N)) lookup.

But keeping this data structure coherent is likely nontrivial in AMR as blocks are created and destroyed, so I will defer to @tomidakn on that.

tomidakn commented 4 years ago

This is my fault. When I wrote it I did not expect so many MeshBlocks per rank (partly because I usually use Flat-MPI), and currently it is just following the linked list from the top whenever it needs a neighbor MeshBlock on the same rank.

I think it is straightforward to fix. First, it is calling FindMeshBlock too often. If I store the pointer in the neighbor information, it will cut the cost of the search to ~1/8 in the case of second-order MHD. Also currently MeshBlocks are stored in a linked list but STL::vector, map, or unordered_map can work better.

Because the cost of FindMeshBlock is O(MeshBlocks per rank), if the total number of MeshBlocks and the number of cores are the same, Flat-MPI should work better as the number of the MeshBlocks per rank is smaller.

And I should not have used the same names for those functions.

Currently I am terribly busy but I'll fix it. Thank you for your report.

chaochinyang commented 4 years ago

Hi Kengo,

I actually have put in some thought of this when I was writing the Particles module.

Please see Particles::LinkNeighbors() in particles.cpp. It may be useful for the same purpose. If in the end you work out a similar structure, I can ditch this function for all.

Sincerely, Chao-Chin http://unlv.edu/
Chao-Chin Yang Postdoctoral Scholar Department of Physics and Astronomy University of Nevada, Las Vegas ccyang@unlv.edu mailto:ccyang@unlv.edu Office: 702-895-0961 Web http://www.physics.unlv.edu/~ccyang/

On Dec 9, 2019, at 3:27 PM, Kengo TOMIDA notifications@github.com wrote:

This is my fault. When I wrote it I did not expect so many MeshBlocks per rank (partly because I usually use Flat-MPI), and currently it is just following the linked list from the top whenever it needs a neighbor MeshBlock on the same rank.

I think it is straightforward to fix. First, it is calling FindMeshBlock too often. If I store the pointer in the neighbor information, it will cut the cost of the search to ~1/8 in the case of second-order MHD. Also currently MeshBlocks are stored in a linked list but STL::vector, map, or unordered_map can work better.

Because the cost of FindMeshBlock is O(MeshBlocks per rank), if the total number of MeshBlocks and the number of cores are the same, Flat-MPI should work better as the number of the MeshBlocks per rank is smaller.

And I should not have used the same names for those functions.

Currently I am terribly busy but I'll fix it. Thank you for your report.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/PrincetonUniversity/athena-public-version/issues/32?email_source=notifications&email_token=ACZK3BWIBMPIXAVFVG5ZG6DQX3H6FA5CNFSM4JYMVQO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGLCGEQ#issuecomment-563487506, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZK3BWDS6R2EQRPCTTBC23QX3H6FANCNFSM4JYMVQOQ.

felker commented 4 years ago

This is another motivation to consolidated all the partial custom implementations of singly- and doubly- linked list data structures under the C++ STL; it would be much easier to swap data structures when we encounter issues like this.

tomidakn commented 4 years ago

I think I will make something like Chao-Chin implemented default for the main part of the code. I guess that will improve the performance significantly. But I'd take this opportunity to redesign the structure. I hope but cannot guarantee that it will be my Xmas present.

felker commented 4 years ago

@tomidakn should we close this, since it is presumably fixed in the private repo? We could send @jlippuner the new version so that the exact same test could be run, in order to confirm this.

tomidakn commented 4 years ago

Yes, this is now fixed in the development branch.

tomidakn commented 4 years ago

Sorry it took too long.