Investigate performance regression from 6.5.0 to 6.6.0

ericjbohm commented 10 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/370

NAMD, OpenAtom, and AMR all have documented performance regressions in Charm 6.6 vs 6.5.

We need to figure out why this is so and redress the issue before 6.6 can be released.

nextRegression.txt

ericjbohm commented 5 years ago

Original date: 2013-12-05 22:46:40

Tried merging the envelope reduction commit (9c7bbe7085f5102f325e44566fdb8baff238c723) onto 6.5 and found that it performed slightly better than plain 6.5, for openatom on pami-bluegeneq-xlc.

Therefore, the envelope reduction patch is probably not at fault for the performance regression.

ericjbohm commented 5 years ago

Original date: 2013-12-05 23:46:39

Git bisect on BG/Q is a nightmare due to changes which had to accomodate floor changes. Need a way to reproduce the regression in a less volatile context.

PhilMiller commented 5 years ago

Original date: 2013-12-05 23:49:39

The AMR regression was observed on a single Mac OS node. Presumably, the same results would occur on a Linux workstation as well.

ericjbohm commented 5 years ago

Original date: 2013-12-05 23:54:08

Phil Miller wrote:

The AMR regression was observed on a single Mac OS node. Presumably, the same results would occur on a Linux workstation as well.

How do I compile and run the AMR test to reproduce the regression? Where is the source?

PhilMiller commented 5 years ago

Original date: 2013-12-06 00:01:37

git clone -b perfregression charmgit:users/alanger/amr

Should be able to just run make CHARMHOME=/path/to/built/charm in that directory and get a binary, advection.

PhilMiller commented 5 years ago

Original date: 2013-12-06 00:04:39

Wow, holy cow. I just tried the binaries I had been working with again, standalone, to give you command lines to work with, and got a consistent difference of 15%

./advection.66 7 32 200

ericjbohm commented 5 years ago

Original date: 2013-12-06 19:15:52

The advection example works pretty well for a single core test. I have it wrapped in a bisect script. Currently working around false positive problems due to bad commits which break charm compilation (shakes fist at Jonathan).

ericjbohm commented 5 years ago

Original date: 2013-12-06 20:16:54

git bisect good f3bef8ba8336bd768b10c7c00cda1046271eb3e3

first bad commit: [358321ecfa1ef602e91415d6ff0de2bdb8dd239b] efficiency/clenaup: remove when entry in map when the list is empty

This commit shifts performance of advection on intellect to over 6s.

There appears to be another commit which pushes performance from ~5.4s to ~5.9s, currently running a finer threshhold bisect test to find that one.

ericjbohm commented 5 years ago

Original date: 2013-12-06 20:55:26

Using git bisect run to further isolate the other performance degradation is complicated by the morass of SDAG commits that entirely break compilation. The git bad indications below don't distinguish between compilation failures and performance regression. The fact that the charm commit id runtime output doesn't correspond to the git log commit ids is not at all helpful.

git bisect log git bisect start

good: [af08d99d4f86518121444410341a858594de9d95] Update changelog for release git bisect good af08d99d4f86518121444410341a858594de9d95
bad: [f3bef8ba8336bd768b10c7c00cda1046271eb3e3] ref and deref forallclosure git bisect bad f3bef8ba8336bd768b10c7c00cda1046271eb3e3
good: [ee4ebaa4ab8482d806863dd250f2725833ef9076] Revert "BGQ: modify charmrun for BGQ to accept exit status 1 as a successful exit as per" git bisect good ee4ebaa4ab8482d806863dd250f2725833ef9076
good: [f29a46dfa5f4c5d4e8f2f0ce6620ebead20cdd56] Cleaning up checkpoint code in message-logging. git bisect good f29a46dfa5f4c5d4e8f2f0ce6620ebead20cdd56
good: [730252d11de8158260a9413d5e06da0a35807f58] machine layer: checkout net from master git bisect good 730252d11de8158260a9413d5e06da0a35807f58
good: [dd3fd971f8a459d60fb3c7d43ff5c958262d1610] reference the marshaled message that the closure holds (inhibiting system deallocation) git bisect good dd3fd971f8a459d60fb3c7d43ff5c958262d1610
good: [ce991c3c63c9dbeb36d2073d92a4a2f6da51f03d] add in forgotten pup call for generated refnum field git bisect good ce991c3c63c9dbeb36d2073d92a4a2f6da51f03d
bad: [3e5003c678a3639084fdcd419b13f3479314acb0] reformatting and reindentation of file git bisect bad 3e5003c678a3639084fdcd419b13f3479314acb0
good: [98887335a1db058435ff922f34ff494152346a1e] charmxi: explicit conversion to appease xlC++ git bisect good 98887335a1db058435ff922f34ff494152346a1e
bad: [15416286aeca7ffca274b05e1b9c33c629c6df65] fix problem with forward declarations git bisect bad 15416286aeca7ffca274b05e1b9c33c629c6df65
bad: [64ce490a9c03ec090a82f72e7cab48dd54186983] put PUP operators in the PUP namespace (standard convention) git bisect bad 64ce490a9c03ec090a82f72e7cab48dd54186983
good: [b9b95d0bc4ce7ade55ca60c360747095e1b512b0] use static_cast instead of reinterpret_cast git bisect good b9b95d0bc4ce7ade55ca60c360747095e1b512b0
first bad commit: [64ce490a9c03ec090a82f72e7cab48dd54186983] put PUP operators in the PUP namespace (standard convention)

ericjbohm commented 5 years ago

Original date: 2013-12-06 23:40:44

Modified script to exit with state 125 (skip) on Charm or advection compilation failures. Sadly, the latter are so abundant that they occlude the first bad commit. Leaving us with many commits that overlap with bugs that kill advection compilations. Will try to capture regression in smaller openatom case. Could use any non-SDAG code for this purpose, but openatom is the only one we know has a regression and no SDAG code.

PhilMiller commented 5 years ago

Original date: 2013-12-06 23:46:59

I just tested 95a6e29dba09827ce153f4d52a8a05a120e59e20, right before the big SDAG overhaul, and nearly all of the AMR performance regression is gone: 3.6s vs 3.7s, or under 3%. So, Jonathan and I can look into the SDAG effects, and that should give you a reasonable spot to look for other applications affected.

PhilMiller commented 5 years ago

Original date: 2013-12-06 23:47:42

Ugh, I realize that there might be other changes that create regressions after that was merged. I'll do a test between the post-merge performance and the current mainline to determine that.

PhilMiller commented 5 years ago

Original date: 2013-12-06 23:56:37

OK, just tested - the performance of AMR just after the SDAG overhaul is almost identical to HEAD before today's fix. So, we have a lot of tuning to do in the new SDAG implementation.

PhilMiller commented 5 years ago

Original date: 2013-12-06 23:57:53

Sorry for more spam, but I should complete my thoughts before posting:

The rest of the regression is thus accounted for by changes before the SDAG overhaul, as I said in #14.

ericjbohm commented 5 years ago

Original date: 2013-12-07 21:22:25

Ruled out the envelope change too soon. OpenAtom has a 3% performance regression from v651 to v660 in single core execution of the water_32M_10Ry benchmark. Git bisect pins that down to:

first bad commit: [9c7bbe7085f5102f325e44566fdb8baff238c723] move type specific envelope data to the end of messages. This reduces the envelope size by avoiding the big union type.

ericjbohm commented 5 years ago

Original date: 2013-12-09 20:46:16

The regression can be replicated on charm built --with-production using tests/charm++/pingpong. Charm 6.6 performs consistently worse for all message types. It is at least 10% for 1D Array ping pong. I have attached two fairly average runs (with a high iteration count for better run to run consistency) from intellect.

Charm 6.6
./pgm 10 50000 +setcpuaffinity
Charm++: standalone mode (not using charmrun)
Converse/Charm++ Commit ID: v6.6.0-rc1-9-g9bcafc5notably
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> cpu affinity enabled. 
Charm++> Running on 1 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.000 seconds.
Pingpong with payload: 10 iterations: 50000
Roundtrip time for 1D Arrays is 0.558581 us
Roundtrip time for 1D threaded Arrays is 4.495602 us
Roundtrip time for 2D Arrays is 0.578618 us
Roundtrip time for 3D Arrays is 0.565701 us
Roundtrip time for Fancy Arrays is 0.590420 us
Roundtrip time for Chares (reuse msgs) is 0.208721 us
Roundtrip time for Chares (new/del msgs) is 0.411181 us
Roundtrip time for threaded Chares (reuse) is 3.734522 us
Roundtrip time for Groups is 0.215597 us
Roundtrip time for Groups (1 KB pipe, no memcpy, no allocs) is 0.255795 us
Roundtrip time for Groups (1 KB pipe, no memcpy, w/ allocs) is 0.521202 us
Roundtrip time for Groups (1 KB pipe, w/ memcpy, w/ allocs) is 0.751162 us
Roundtrip time for NodeGroups is 0.242405 us
Program finished.

Charm 6.5.1
./pgm 10 50000 +setcpuaffinity
Charm++: standalone mode (not using charmrun)
Converse/Charm++ Commit ID: v6.5.1-0-gaf08d99
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> cpu affinity enabled. 
Charm++> Running on 1 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.000 seconds.
Pingpong with payload: 10 iterations: 50000
Roundtrip time for 1D Arrays is 0.490699 us
Roundtrip time for 1D threaded Arrays is 4.036040 us
Roundtrip time for 2D Arrays is 0.506463 us
Roundtrip time for 3D Arrays is 0.549083 us
Roundtrip time for Fancy Arrays is 0.540681 us
Roundtrip time for Chares (reuse msgs) is 0.193381 us
Roundtrip time for Chares (new/del msgs) is 0.286598 us
Roundtrip time for threaded Chares (reuse) is 3.332000 us
Roundtrip time for Groups is 0.194683 us
Roundtrip time for Groups (1 KB pipe, no memcpy, no allocs) is 0.240316 us
Roundtrip time for Groups (1 KB pipe, no memcpy, w/ allocs) is 0.352764 us
Roundtrip time for Groups (1 KB pipe, w/ memcpy, w/ allocs) is 0.499687 us
Roundtrip time for NodeGroups is 0.205202 us
Program finished.

PhilMiller commented 5 years ago

Original date: 2013-12-09 20:55:03

Even for groups there seems to be a 10% difference. Since this shows up on 1 PE, gprof ought to be able to tell us something about what's going on - it's not like there should be any idle time to confuse matters.

ericjbohm commented 5 years ago

Original date: 2013-12-09 21:29:44

Phil Miller wrote:

Even for groups there seems to be a 10% difference. Since this shows up on 1 PE, gprof ought to be able to tell us something about what's going on - it's not like there should be any idle time to confuse matters.

I have gprof results, but they're somewhat confusing. Performance of the -pg -g binaries is fairly similar overall as measured performance becomes dominated by intrumentation. CkLocMgr::deliver ends up with a higher overall proportion, but lower in cumulative seconds and the same in self seconds. Gprof seems to be distorting performance too much to be useful here.

v6.6.0
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
  7.63      0.10     0.10  4000012     0.00     0.00  CkLocMgr::deliver(CkMessage*, CkDeliver_t, int)
  6.87      0.19     0.09  4200061     0.00     0.00  _processHandler(void*, CkCoreState*)
  4.58      0.25     0.06  4210593     0.00     0.00  CmiStdoutFlush
  4.58      0.31     0.06  4200088     0.00     0.00  CsdNextMessage
  4.58      0.37     0.06  4200060     0.00     0.00  CqsEnqueueGeneral
  4.58      0.43     0.06  4000036     0.00     0.00  CkLocMgr::elementNrec(CkArrayIndex const&)
  3.82      0.48     0.05  2800030     0.00     0.00  _lookupGroupAndBufferIfNotThere(CkCoreState*, envelope*, _ckGroupID const&)
  3.82      0.53     0.05        2    25.00   102.87  CsdScheduleForever
  3.05      0.57     0.04  8400123     0.00     0.00  QdState::sendCount(int, int)
  3.05      0.61     0.04  2000006     0.00     0.00  CProxyElement_ArrayBase::ckSend(CkArrayMessage*, int, int) const
  2.29      0.64     0.03  3400041     0.00     0.00  CkDeliverMessageFree
  2.29      0.67     0.03   800018     0.00     0.00  IrrGroup::ckGetChareType() const
  2.29      0.70     0.03   799998     0.00     0.00  CthResume
  1.91      0.73     0.03  4000012     0.00     0.00  CkLocRec_local::type()
  1.91      0.75     0.03  2000006     0.00     0.00  CkArrayManagerDeliver

v6.5.1
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
  9.45      0.19     0.19  5401727     0.00     0.00  _int_malloc
  6.72      0.33     0.14  4200061     0.00     0.00  _processHandler(void*, CkCoreState*)
  5.97      0.45     0.12  5400389     0.00     0.00  _int_free
  5.47      0.56     0.11  4000012     0.00     0.00  CkLocMgr::deliver(CkMessage*, CkDeliver_t, int)
  3.98      0.64     0.08  5000089     0.00     0.00  CsdNextMessage
  3.48      0.71     0.07  5400434     0.00     0.00  mm_free
  3.48      0.78     0.07  4000036     0.00     0.00  CkLocMgr::elementNrec(CkArrayIndex const&)
  3.48      0.85     0.07  2000006     0.00     0.00  CkArrayManagerDeliver
  2.99      0.91     0.06  8400123     0.00     0.00  QdState::sendCount(int, int)
  2.99      0.97     0.06  5013268     0.00     0.00  CmiStdoutFlush
  2.99      1.03     0.06        2    30.00   242.12  CsdScheduleForever
  2.49      1.08     0.05  5401720     0.00     0.00  mm_malloc
  2.49      1.13     0.05  5000058     0.00     0.00  CqsEnqueueGeneral
  2.49      1.18     0.05   800018     0.00     0.00  CkDelegateMgr::~CkDelegateMgr()
  1.99      1.22     0.04  2000014     0.00     0.00  CkLocRec_local::invokeEntry(CkMigratable*, void*, int, bool)
  1.99      1.26     0.04   800019     0.00     0.00  malloc_consolidate

ericjbohm commented 5 years ago

Original date: 2013-12-09 23:28:58

Based on reading the source of the envelope reduction diff, there appear to be two possible sources for the overhead:

The extraSize((CkEnvelopeType)) calls. Pretty simple case statement, but could test a lookup table solution.
The extra alloc calls for extra data segment. Where we used to have 1 alloc, we now have 2.

I suspect the real issue is the latter. I don't see a way around that which preserves the extra data approach, but there may be a way to include more cases in the single allocation case. Ideally we would make the extra data overhead apply to extra overhead type things like fault tolerance, rather than to basically every charm message that we use in production, such as arrays.

PhilMiller commented 5 years ago

Original date: 2013-12-09 23:51:53

This may be a completely bone-headed question, but what extra alloc call? I looked over the envelope-size reduction commit when it was made, and again just now, and I don't see any added allocation.

I am a bit confused about how the space in which extraData lives does get allocated, though.

-      register UInt tsize = sizeof(envelope)+ 
+      register UShort extrasize = extraSize((CkEnvelopeType)type);
+      register UInt tsize0 = sizeof(envelope)+ 
             CkMsgAlignLength(size)+
            sizeof(int)*CkPriobitsToInts(prio);
-      register envelope *env = (envelope *)CmiAlloc(tsize);
+      register UInt tsize = tsize0 + extrasize;
+      register envelope *env = (envelope *)CmiAlloc(tsize0);

I see sizes of the main envelope, the message contents, and the priority bits being added, but then that length plus the extrasize stored as the total length.

PhilMiller commented 5 years ago

Original date: 2013-12-09 23:54:23

Scratch the confusion in my previous comment. I see that CmiAlloc adds envMaxExtraSize to requests to ensure enough total space is available.

My question about the extra allocation call still stands.

ericjbohm commented 5 years ago

Original date: 2013-12-10 00:02:00

Phil Miller wrote:

Scratch the confusion in my previous comment. I see that CmiAlloc adds envMaxExtraSize to requests to ensure enough total space is available.

My question about the extra allocation call still stands.

Sorry thinko. I meant the extra memcpy calls in CkAllocBuffer (and also in CkCopyMsg). Actual number of allocation calls isn't that bad, but additional load and store operations may be a problem.

PhilMiller commented 5 years ago

Original date: 2013-12-10 00:11:58

Those are extra function calls, but they should still touch less total data. Keep in mind that envelope is smaller by the maximum number of bytes that additional call will read and write. The calls themselves may be trouble, though.

ericjbohm commented 5 years ago

Original date: 2013-12-10 21:26:42

Replacing the sizing logic in extraSize with a return of envMaxExtraSize reduces the overhead in charm++ pingpong significantly (by about half), but not entirely. So I'm moving ahead with a lookup table scheme.

ericjbohm commented 5 years ago

Original date: 2013-12-17 00:15:49

This problem has been a bit of a tough nugget, but I've narrowed things down a little.

Pingpong in Charm 6.6 spends 15% more time in int_malloc than 6.5.1.
Most of the performance issues with pingpong can be remedied by building netlrts 6.6 instead of net 6.5.1.
I do not know why ping pong on a single core is better in netlrts, but it is.
netlrts only helps openatom performance by a few percentage points relative to net 6.5.1
replacing the case statement with a lookup table has negligible impact
replacing the lookup with envMaxExtraSize helps pingpong slightly and has a barely measurable improvement on OpenAtom
reducing convcore.c's envMaxExtraSize from 60 to 36 (the actual max in a normal build) has negligible impact

nikhil-jain commented 5 years ago

Original date: 2014-01-03 05:21:09

Eric - Any update on this? Did you get to try the malloc related changes, etc?

PhilMiller commented 5 years ago

Original date: 2014-01-05 01:10:19

From a hacked up version of tests/util/check.C:

1: 16
2: 16
3: 16
4: 16
5: 8 (6)
6: 16
7: 16
8: 16
9: 4
10: 4
11: 0
12: 0
13: 0
14: 0
15: 16
16: 8 (6)
17: 36 (34)    # Array Element Init
18: 24 (22)    # Array Element message
19: 16 (10)
Info: converse header: 24 envelope: 48, padding: 1

Sizes with the extradata structs in a packed layout in parentheses where they differ.

Our notable performance loss was from the excess 12 bytes of the 36-byte extradata that array element initialization messages demand putting (IIRC) those messages 8 bytes beyond a cache line boundary. The difference between those and array element messages is the presence of 'listener data'.

One thing I'll note is that in general usage, we only use 2 ints (8 bytes) worth of that (one for broadcast #, one for reduction #), so we can save 4 bytes simply by lowering CK_ARRAYLISTENER_MAXLEN from 3 ints to 2.

As shown above, packing saves another 2 bytes.

Thus, we're left with a need to shave off another 2 bytes to get an overall 8 byte reduction. The obvious targets would be ifNotThere and hopCount, which occupy a byte each. The field ifNotThere only requires 2 bits (values 0, 1, 2) which means it can fit in the spare 2 bits of the bitfield in attribs. If we move hopCount into attribs as well, we displace the single byte of padding (as seen on net-linux and net-linux-x86_64, possibly different elsewhere), so no other messages get any larger.

So, I've pushed the resulting code on branch pack, with new output

1: 16
2: 16
3: 16
4: 16
5: 6
6: 16
7: 16
8: 16
9: 4
10: 4
11: 0
12: 0
13: 0
14: 0
15: 16
16: 6
17: 28    # Array Element Init
18: 20    # Array Element message
19: 8
Info: converse header: 24 envelope: 48, sysmsg: 76

It passes make test on net-linux and net-linux-x86_64 but has not been tested elsewhere. If it improves on the microbenchmark results such that we no longer regress, then it will be worth testing elsewhere, reviewing, and potentially merging.

In the longer run, with the work on #108, we should be able to eliminate ForArrayEltMsg and ArrayEltInitMsg entirely, bringing the max envelope extra size down to 16 bytes.

nikhil-jain commented 5 years ago

Original date: 2014-01-05 04:26:07

Did you also do the regression tests (ping pong etc) on pack branch to test the effect, or it needs to be delegated (possibly to Eric since he is assigned this task?).

PhilMiller commented 5 years ago

Original date: 2014-01-05 04:28:49

It would be preferable for Eric to do that test, since he's already worked through potential sources of noise or other procedural pitfalls that aren't documented here.

ericjbohm commented 5 years ago

Original date: 2014-01-07 19:05:26

I'm seeing pingpong performance is slightly worse on the pack branch than the current head. Both being still notably worse than 6.5

6.5 net-linux-x86_64 built --with-production, run on intellect, result is median of 5 trials

./pgm 10 100000 +pemap 4 +setcpuaffinity Charm++: standalone mode (not using charmrun) Converse/Charm++ Commit ID: v6.5.1-1-gf3ff517 Charm++> scheduler running in netpoll mode. CharmLB> Load balancer assumes all CPUs are same. Charm++> cpu affinity enabled. Charm++> cpuaffinity PE-core map : 4 Charm++> Running on 1 unique compute nodes (8-way SMP). Charm++> cpu topology info is gathered in 0.000 seconds. Pingpong with payload: 10 iterations: 100000 Roundtrip time for 1D Arrays is 0.489650 us Roundtrip time for 1D threaded Arrays is 4.056780 us Roundtrip time for 2D Arrays is 0.507939 us Roundtrip time for 3D Arrays is 0.554221 us Roundtrip time for Fancy Arrays is 0.568111 us Roundtrip time for Chares (reuse msgs) is 0.174420 us Roundtrip time for Chares (new/del msgs) is 0.293381 us Roundtrip time for threaded Chares (reuse) is 3.271320 us Roundtrip time for Groups is 0.199962 us Roundtrip time for Groups (1 KB pipe, no memcpy, no allocs) is 0.240941 us Roundtrip time for Groups (1 KB pipe, no memcpy, w/ allocs) is 0.354757 us Roundtrip time for Groups (1 KB pipe, w/ memcpy, w/ allocs) is 0.518599 us Roundtrip time for NodeGroups is 0.204158 us

Current head: ./pgm 10 100000 +pemap 4 +setcpuaffinity Charm++: standalone mode (not using charmrun) Converse/Charm++ Commit ID: v6.6.0-rc1-29-g7e18651 Charm++> scheduler running in netpoll mode. CharmLB> Load balancer assumes all CPUs are same. Charm++> cpu affinity enabled. Charm++> cpuaffinity PE-core map : 4 Charm++> Running on 1 unique compute nodes (8-way SMP). Charm++> cpu topology info is gathered in 0.000 seconds. Pingpong with payload: 10 iterations: 100000 Roundtrip time for 1D Arrays is 0.532172 us Roundtrip time for 1D threaded Arrays is 4.643762 us Roundtrip time for 2D Arrays is 0.581131 us Roundtrip time for 3D Arrays is 0.577312 us Roundtrip time for Fancy Arrays is 0.590029 us Roundtrip time for Chares (reuse msgs) is 0.187180 us Roundtrip time for Chares (new/del msgs) is 0.359881 us Roundtrip time for threaded Chares (reuse) is 3.767910 us Roundtrip time for Groups is 0.207100 us Roundtrip time for Groups (1 KB pipe, no memcpy, no allocs) is 0.256319 us Roundtrip time for Groups (1 KB pipe, no memcpy, w/ allocs) is 0.466237 us Roundtrip time for Groups (1 KB pipe, w/ memcpy, w/ allocs) is 0.735803 us Roundtrip time for NodeGroups is 0.230637 us

pack: ./pgm 10 100000 +pemap 4 +setcpuaffinity Charm++: standalone mode (not using charmrun) Converse/Charm++ Commit ID: v6.6.0-rc1-22-g48693a5 Charm++> scheduler running in netpoll mode. CharmLB> Load balancer assumes all CPUs are same. Charm++> cpu affinity enabled. Charm++> cpuaffinity PE-core map : 4 Charm++> Running on 1 unique compute nodes (8-way SMP). Charm++> cpu topology info is gathered in 0.000 seconds. Pingpong with payload: 10 iterations: 100000 Roundtrip time for 1D Arrays is 0.554969 us Roundtrip time for 1D threaded Arrays is 4.593840 us./pgm 10 100000 +pemap 4 +setcpuaffinity Charm++: standalone mode (not using charmrun) Converse/Charm++ Commit ID: v6.6.0-rc1-22-g48693a5 Charm++> scheduler running in netpoll mode. CharmLB> Load balancer assumes all CPUs are same. Charm++> cpu affinity enabled. Charm++> cpuaffinity PE-core map : 4 Charm++> Running on 1 unique compute nodes (8-way SMP). Charm++> cpu topology info is gathered in 0.000 seconds. Pingpong with payload: 10 iterations: 100000 Roundtrip time for 1D Arrays is 0.554969 us Roundtrip time for 1D threaded Arrays is 4.593840 us Roundtrip time for 2D Arrays is 0.579851 us Roundtrip time for 3D Arrays is 0.612371 us Roundtrip time for Fancy Arrays is 0.584970 us Roundtrip time for Chares (reuse msgs) is 0.198028 us Roundtrip time for Chares (new/del msgs) is 0.355132 us Roundtrip time for threaded Chares (reuse) is 3.783691 us Roundtrip time for Groups is 0.211678 us Roundtrip time for Groups (1 KB pipe, no memcpy, no allocs) is 0.267043 us Roundtrip time for Groups (1 KB pipe, no memcpy, w/ allocs) is 0.470643 us Roundtrip time for Groups (1 KB pipe, w/ memcpy, w/ allocs) is 0.735240 us Roundtrip time for NodeGroups is 0.233979 us

Roundtrip time for 2D Arrays is 0.579851 us Roundtrip time for 3D Arrays is 0.612371 us Roundtrip time for Fancy Arrays is 0.584970 us Roundtrip time for Chares (reuse msgs) is 0.198028 us Roundtrip time for Chares (new/del msgs) is 0.355132 us Roundtrip time for threaded Chares (reuse) is 3.783691 us Roundtrip time for Groups is 0.211678 us Roundtrip time for Groups (1 KB pipe, no memcpy, no allocs) is 0.267043 us Roundtrip time for Groups (1 KB pipe, no memcpy, w/ allocs) is 0.470643 us Roundtrip time for Groups (1 KB pipe, w/ memcpy, w/ allocs) is 0.735240 us Roundtrip time for NodeGroups is 0.233979 us

PhilMiller commented 5 years ago

Original date: 2014-01-07 22:21:38

Just retested the performance difference of AMR on my Mac after upgrading to 10.9. I'm seeing 7.7% slowdown from 6.5.1 to current head. I think the Mac is generally more sensitive here, to whatever effect is hitting us. However, the pack branch there has less than 1% difference between head and the pack branch.

PhilMiller commented 5 years ago

Original date: 2014-01-09 20:36:12

On branch 'envelope' (Nikhil's modification to revert just the placement of the extra data, with my modification to allocate 36 bytes rather than 36 char*, AMR is running at parity with or better than 6.5.1 on Mac OS 10.9 with clang 3.4 (just released). I'll try to work down the additional changes, and see if we get even faster.

ericjbohm commented 5 years ago

Original date: 2014-01-09 22:33:20

On branch envelope, 1darray pingpong for 1 byte is worse than 2d and 3d.

If you switch from ALIGN_DEFAULT to ALIGN8 in xi-symbol.C (l 2139 and 2147), then 1darray pingpong performance improves to as good as, or better, than v651. However, 3darray gets worse. This change also brings the openatom water_32M_10Ry regression down from 1.43s/step to 1.42 s/step (vs v651 1.40s/step). {all tests single core}

nikhil-jain commented 5 years ago

Original date: 2014-01-10 17:07:23

Comparison:

pingpong with 1 Byte payload on branch envelope: Intra-processor Pingpong.. ./charmrun ./pgm +p1 Charmrun> started all node programs in 2.210 seconds. Converse/Charm++ Commit ID: v6.6.0-rc1-31-g384f088 Charm++> scheduler running in netpoll mode. CharmLB> Load balancer assumes all CPUs are same. Charm++> Running on 1 unique compute nodes (12-way SMP). Charm++> cpu topology info is gathered in 0.001 seconds. Pingpong with payload: 1 iterations: 1000 Roundtrip time for 1D Arrays is 0.712872 us Roundtrip time for 1D threaded Arrays is 3.551960 us Roundtrip time for 2D Arrays is 0.771046 us Roundtrip time for 3D Arrays is 0.797033 us Roundtrip time for Fancy Arrays is 0.862837 us Roundtrip time for Chares (reuse msgs) is 0.267029 us Roundtrip time for Chares (new/del msgs) is 0.416040 us Roundtrip time for threaded Chares (reuse) is 2.672195 us Roundtrip time for Groups is 0.254154 us Roundtrip time for Groups (1 KB pipe, no memcpy, no allocs) is 0.323772 us Roundtrip time for Groups (1 KB pipe, no memcpy, w/ allocs) is 0.596046 us Roundtrip time for Groups (1 KB pipe, w/ memcpy, w/ allocs) is 0.844002 us Roundtrip time for NodeGroups is 0.407696 us Inter-processor Pingpong.. ./charmrun ./pgm +p2 ControlSocket /tmp/nikhil`localhost:22 already exists, disabling multiplexing Charmrun> started all node programs in 1.506 seconds. Converse/Charm++ Commit ID: v6.6.0-rc1-31-g384f088 Charm++> scheduler running in netpoll mode. CharmLB> Load balancer assumes all CPUs are same. Charm++> Running on 1 unique compute nodes (12-way SMP). Charm++> cpu topology info is gathered in 0.002 seconds. Pingpong with payload: 1 iterations: 1000 Roundtrip time for 1D Arrays is 27.834892 us Roundtrip time for 1D threaded Arrays is 32.143831 us Roundtrip time for 2D Arrays is 27.830124 us Roundtrip time for 3D Arrays is 28.424025 us Roundtrip time for Fancy Arrays is 28.324842 us Roundtrip time for Chares (reuse msgs) is 25.724888 us Roundtrip time for Chares (new/del msgs) is 26.219845 us Roundtrip time for threaded Chares (reuse) is 30.842066 us Roundtrip time for Groups is 25.584936 us Roundtrip time for Groups (1 KB pipe, no memcpy, no allocs) is 25.875092 us Roundtrip time for Groups (1 KB pipe, no memcpy, w/ allocs) is 26.792049 us Roundtrip time for Groups (1 KB pipe, w/ memcpy, w/ allocs) is 27.736902 us Roundtrip time for NodeGroups is 26.469946 us

On 6.5.1 Intra-processor Pingpong.. ./charmrun ./pgm +p1 Charmrun> started all node programs in 1.482 seconds. Converse/Charm++ Commit ID: v6.5.1-1-gf3ff517 Charm++> scheduler running in netpoll mode. CharmLB> Load balancer assumes all CPUs are same. Charm++> Running on 1 unique compute nodes (12-way SMP). Charm++> cpu topology info is gathered in 0.001 seconds. Pingpong with payload: 1 iterations: 1000 Roundtrip time for 1D Arrays is 0.670910 us Roundtrip time for 1D threaded Arrays is 3.137112 us Roundtrip time for 2D Arrays is 0.704050 us Roundtrip time for 3D Arrays is 0.719070 us Roundtrip time for Fancy Arrays is 0.782013 us Roundtrip time for Chares (reuse msgs) is 0.236034 us Roundtrip time for Chares (new/del msgs) is 0.390053 us Roundtrip time for threaded Chares (reuse) is 2.351999 us Roundtrip time for Groups is 0.253677 us Roundtrip time for Groups (1 KB pipe, no memcpy, no allocs) is 0.330448 us Roundtrip time for Groups (1 KB pipe, no memcpy, w/ allocs) is 0.500202 us Roundtrip time for Groups (1 KB pipe, w/ memcpy, w/ allocs) is 0.697613 us Roundtrip time for NodeGroups is 0.393867 us Inter-processor Pingpong.. ./charmrun ./pgm +p2 ControlSocket /tmp/nikhil`localhost:22 already exists, disabling multiplexing Charmrun> started all node programs in 1.509 seconds. Converse/Charm++ Commit ID: v6.5.1-1-gf3ff517 Charm++> scheduler running in netpoll mode. CharmLB> Load balancer assumes all CPUs are same. Charm++> Running on 1 unique compute nodes (12-way SMP). Charm++> cpu topology info is gathered in 0.002 seconds. Pingpong with payload: 1 iterations: 1000 Roundtrip time for 1D Arrays is 28.063059 us Roundtrip time for 1D threaded Arrays is 31.307936 us Roundtrip time for 2D Arrays is 27.806044 us Roundtrip time for 3D Arrays is 27.679920 us Roundtrip time for Fancy Arrays is 28.769016 us Roundtrip time for Chares (reuse msgs) is 25.512218 us Roundtrip time for Chares (new/del msgs) is 25.945902 us Roundtrip time for threaded Chares (reuse) is 30.612946 us Roundtrip time for Groups is 25.407076 us Roundtrip time for Groups (1 KB pipe, no memcpy, no allocs) is 25.588989 us Roundtrip time for Groups (1 KB pipe, no memcpy, w/ allocs) is 25.969982 us Roundtrip time for Groups (1 KB pipe, w/ memcpy, w/ allocs) is 26.507139 us Roundtrip time for NodeGroups is 25.780916 us

nikhil-jain commented 5 years ago

Original date: 2014-01-10 17:23:15

Result when pingpong is compile with O3 (makes more sense to compare these)

Column 1 is envelope, Column 2 is 6.5.1

Intra-process

1D Arrays 0.408888 0.399113 1D threaded Arrays 2.971888 2.675056 2D Arrays 0.439167 0.420094 3D Arrays 0.463963 0.434875 Fancy Arrays 0.519991 0.506163 Chares (reuse msgs) 0.185013 0.169039 Chares (new/del msgs) 0.291824 0.262976 threaded Chares (reuse) 2.337933 2.18606 Groups 0.169754 0.162125 Groups (1 KB pipe, no memcpy, no allocs) 0.216007 0.192165 Groups (1 KB pipe, no memcpy, w/ allocs) 0.420094 0.277996 Groups (1 KB pipe, w/ memcpy, w/ allocs) 0.579834 0.426292 NodeGroups 0.184059 0.192165

Inter Process 1D Arrays 26.093006 25.099993 1D threaded Arrays 31.443119 30.184031 2D Arrays 25.743961 25.185108 3D Arrays 25.627136 25.187016 Fancy Arrays 25.781155 25.29192 Chares (reuse msgs) 25.593042 24.99485 Chares (new/del msgs) 25.995016 25.223017 threaded Chares (reuse) 30.816793 29.862881 Groups 25.342941 24.659157 Groups (1 KB pipe, no memcpy, no allocs) 25.618076 24.836063 Groups (1 KB pipe, no memcpy, w/ allocs) 26.098013 24.880886 Groups (1 KB pipe, w/ memcpy, w/ allocs) 26.220083 25.220871 NodeGroups 25.472164 24.951935

ericjbohm commented 5 years ago

Original date: 2014-01-13 22:05:21

I have pushed a change to the envelope branch with makes ALIGN_DEFAULT default to ALIGN8. The use of the align16 build option switches this to ALIGN16.

This resolves the performance regression for pingpong and mitigates the regression for openatom. I'm not entirely sold on align16 as the final name for this option, but it does the job. Use cases which need it can build it in, everyone else is unaffected.

Due to the fact that the align code is inside the general xi macro for var size messages, it would require some non-trivial refactoring to make this into an entry method tag. We would need to define some special trait for both the message and the entry point to handle the message and marshalled cases. I'm not convinced that we have a use case for that level of discrimination given the fact that the ChaNGa group was happy with changing all 64bit charm builds to the new 16 byte aligned scheme.

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-01-16 15:38:43

This is interesting. That means that most of time for inter node is spent on network or somewhere in machine layer code.

Nikhil, can you do the experiment on the same machine for converse pingpong?

Nikhil Jain wrote:

Result when pingpong is compile with O3 (makes more sense to compare these)

Column 1 is envelope, Column 2 is 6.5.1

Intra-process

1D Arrays 0.408888 0.399113 1D threaded Arrays 2.971888 2.675056 2D Arrays 0.439167 0.420094 3D Arrays 0.463963 0.434875 Fancy Arrays 0.519991 0.506163 Chares (reuse msgs) 0.185013 0.169039 Chares (new/del msgs) 0.291824 0.262976 threaded Chares (reuse) 2.337933 2.18606 Groups 0.169754 0.162125 Groups (1 KB pipe, no memcpy, no allocs) 0.216007 0.192165 Groups (1 KB pipe, no memcpy, w/ allocs) 0.420094 0.277996 Groups (1 KB pipe, w/ memcpy, w/ allocs) 0.579834 0.426292 NodeGroups 0.184059 0.192165

Inter Process 1D Arrays 26.093006 25.099993 1D threaded Arrays 31.443119 30.184031 2D Arrays 25.743961 25.185108 3D Arrays 25.627136 25.187016 Fancy Arrays 25.781155 25.29192 Chares (reuse msgs) 25.593042 24.99485 Chares (new/del msgs) 25.995016 25.223017 threaded Chares (reuse) 30.816793 29.862881 Groups 25.342941 24.659157 Groups (1 KB pipe, no memcpy, no allocs) 25.618076 24.836063 Groups (1 KB pipe, no memcpy, w/ allocs) 26.098013 24.880886 Groups (1 KB pipe, w/ memcpy, w/ allocs) 26.220083 25.220871 NodeGroups 25.472164 24.951935

ericjbohm commented 5 years ago

Original date: 2014-01-23 22:43:07

Ping Pong and OpenAtom's performance degradation has been resolved by reversion to the old contiguous block envelope scheme.

charmplusplus / charm

Investigate performance regression from 6.5.0 to 6.6.0 #370

I do not know why ping pong on a single core is better in netlrts, but it is.