N-BodyShop / changa

UIUC/PPL version of ChaNGa
http://hpcc.astro.washington.edu/tools/changa.html
GNU General Public License v2.0
42 stars 28 forks source link

TreePiece replication #164

Open trquinn opened 7 months ago

trquinn commented 7 months ago

This is @harshithamenon 's code to replicate treepieces to improve performance by distributing cache requests over several processors.

robertwissing commented 7 months ago

I ran some tests on the Lamb 80 million and works for some number of tree pieces(128,1024,4096) but breaks for others (960, 16384, 2**16).

Error:

Reason: Ok, before it handled this, but why do we have a null pointer in the tree?!? [52] Stack Traceback: [52:0] ChaNGa.mpi.smp.icc.ompi.karolina 0x9c3f96 CmiAbortHelper(char const, char const, char const, int, int) [52:1] ChaNGa.mpi.smp.icc.ompi.karolina 0x9c3f37 CmiAbort [52:2] ChaNGa.mpi.smp.icc.ompi.karolina 0x724abf TreePieceReplica::fillRequestNodeFromReplica(CkCacheRequestMsg) [52:3] ChaNGa.mpi.smp.icc.ompi.karolina 0x81dc93 CkDeliverMessageFree [52:4] ChaNGa.mpi.smp.icc.ompi.karolina 0x81052c [52:5] ChaNGa.mpi.smp.icc.ompi.karolina 0x810dbc _processHandler(void, CkCoreState) [52:6] ChaNGa.mpi.smp.icc.ompi.karolina 0x96d115 CsdScheduleForever [52:7] ChaNGa.mpi.smp.icc.ompi.karolina 0x96d07e CsdScheduler [52:8] ChaNGa.mpi.smp.icc.ompi.karolina 0x9bd4f6 [52:9] libpthread.so.0 0x2b7036ad2ea5 [52:10] libc.so.6 0x2b7038dd7b0d clone

trquinn commented 7 months ago

@robertwissing see if the recent commit fixes this problem.

robertwissing commented 7 months ago

That fixed the problem, but seems like it is slightly slower with the tree replication than without. I have not run it on the merger case, but I ran it on a refined dwarf (the one from your benchmark 8X, so 400M particles). Ran it with 1024 and 8192 cores and a bit slower on both core numbers. I saw that Idle time seem to go up in the tree replication run. Below is stats for the tenth step: Tree replication UCX 8N: Orb3dLB_notopo stats: maxObjLoad 0.657383 Orb3dLB_notopo stats: minWall 32.175646 maxWall 32.445406 avgWall 32.277810 maxWall/avgWall 1.005192 Orb3dLB_notopo stats: minIdle 2.700702 maxIdle 4.229554 avgIdle 3.183818 minIdle/avgIdle 0.848259 Orb3dLB_notopo stats: minPred 27.573949 maxPred 29.088928 avgPred 28.712396 maxPred/avgPred 1.013114 Orb3dLB_notopo stats: minPiece 72.000000 maxPiece 299.000000 avgPiece 104.166667 maxPiece/avgPiece 2.870400 Orb3dLB_notopo stats: minBg 0.154998 maxBg 0.407093 avgBg 0.212711 maxBg/avgBg 1.913825 Orb3dLB_notopo stats: orb migrated 78619 refine migrated 0 objects took 0.610235 seconds. Elapsed time: 391.025 Building trees ... took 0.184952 seconds. Elapsed time: 393.017 Calculating gravity (tree bucket, theta = 0.700000) ... Calculating gravity and SPH took 28.6747 seconds.

Regular UCX 8N: Orb3dLB_notopo stats: maxObjLoad 0.633993 Orb3dLB_notopo stats: minWall 30.508254 maxWall 30.743554 avgWall 30.563651 maxWall/avgWall 1.005886 Orb3dLB_notopo stats: minIdle 1.446566 maxIdle 2.361893 avgIdle 1.852712 minIdle/avgIdle 0.780783 Orb3dLB_notopo stats: minPred 27.852560 maxPred 28.718140 avgPred 28.442074 maxPred/avgPred 1.009706 Orb3dLB_notopo stats: minPiece 70.000000 maxPiece 299.000000 avgPiece 104.166667 maxPiece/avgPiece 2.870400 Orb3dLB_notopo stats: minBg 0.045028 maxBg 0.256918 avgBg 0.093987 maxBg/avgBg 2.733552 Orb3dLB_notopo stats: orb migrated 78574 refine migrated 0 objects took 0.534952 seconds. Elapsed time: 355.906 Building trees ... took 0.181137 seconds. Elapsed time: 356.088 Calculating gravity (tree bucket, theta = 0.700000) ... Calculating gravity and SPH took 28.4716 seconds.

trquinn commented 7 months ago

Looking at the load balancing data, this simulation does not seem to have a difficult time load balancing, so it's not clear that tree replication is needed. Key numbers are: maxPred/avgPred is very close to 1, indicating that the balancer thinks it's about to do a very good job; final "Calulating gravity" number is slightly less than maxPred, indicating that load balancing was even better than predicted. I would test on a more clustered simulation where the load balancer is obviously struggling.

robertwissing commented 7 months ago

I tried to commit, but got permission denied, the tree replication need to be added to the tree build in starform.cpp aswell: // Need to build tree since we just did a drift. buildTree(PHASE_FEEDBACK);

I ran the merger case which is more clustered, and here I do get quite the improvement. As can be seen below(for 4096 CPU).

I also ran this simulation with more tree pieces(42000 -> 160000), in an attempt to increase the minPiece number. but instead got minPiece: 0 in these runs. not sure why that is happening exactly.....

WITH TREE REPLICATION:

[Orb3dLB_notopo] sorting


Orb3dLB_notopo stats: maxObjLoad 0.749472 Orb3dLB_notopo stats: minWall 2.118554 maxWall 2.219700 avgWall 2.170132 maxWall/avgWall 1.022841 Orb3dLB_notopo stats: minIdle 1.149432 maxIdle 2.167409 avgIdle 1.427119 minIdle/avgIdle 0.805421 Orb3dLB_notopo stats: minPred 0.637064 maxPred 1.917029 avgPred 1.280334 maxPred/avgPred 1.497288 Orb3dLB_notopo stats: minPiece 2.000000 maxPiece 47.000000 avgPiece 10.937500 maxPiece/avgPiece 4.297143 Orb3dLB_notopo stats: minBg 0.047661 maxBg 0.308163 avgBg 0.197008 maxBg/avgBg 1.564221 Orb3dLB_notopo stats: orb migrated 32556 refine migrated 0 objects took 0.138386 seconds. Elapsed time: 61.7747 Building trees ... took 0.164258 seconds. Elapsed time: 62.1046 Calculating gravity (tree bucket, theta = 0.700000) ... Calculating densities/divv ... took 1.099997 seconds. Calculating pressure gradients ... took 0.302843 seconds. Kick Close: Rung 0: 3.35382e-06 uDot update: Rung 0 ... took 0.049003 seconds. Calculating gravity and SPH took 2.03107 seconds.

REGULAR:

[Orb3dLB_notopo] sorting


Orb3dLB_notopo stats: maxObjLoad 0.762852 Orb3dLB_notopo stats: minWall 2.049248 maxWall 2.137798 avgWall 2.087568 maxWall/avgWall 1.024061 Orb3dLB_notopo stats: minIdle 1.138510 maxIdle 2.127362 avgIdle 1.377826 minIdle/avgIdle 0.826309 Orb3dLB_notopo stats: minPred 0.856364 maxPred 1.810501 avgPred 1.315091 maxPred/avgPred 1.376712 Orb3dLB_notopo stats: minPiece 2.000000 maxPiece 33.000000 avgPiece 10.937500 maxPiece/avgPiece 3.017143 Orb3dLB_notopo stats: minBg 0.006768 maxBg 0.219217 avgBg 0.116041 maxBg/avgBg 1.889133 Orb3dLB_notopo stats: orb migrated 34842 refine migrated 0 objects took 0.127405 seconds. Elapsed time: 69.5427 Building trees ... took 0.218447 seconds. Elapsed time: 69.7612 Calculating gravity (tree bucket, theta = 0.700000) ... Calculating densities/divv ... took 2.152796 seconds. Calculating pressure gradients ... took 0.310139 seconds. Kick Close: Rung 0: 3.35382e-06 uDot update: Rung 0 ... took 0.0361415 seconds. Calculating gravity and SPH took 2.97707 seconds.

trquinn commented 7 months ago

Note: you can always do a pull request on a pull request. If you can point me to a branch on your fork, I can incorporate your changes.

trquinn commented 2 weeks ago

Robert reports another issue: I had an issue with the tree replication code though. When running multi-timestepping I get this error sometimes: ------------- Processor 2664 Exiting: Called CmiAbort ------------ Reason: Why did we ask for this bucket with no particles?

It seems to happen more frequently when using more treepieces.