chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 420 forks source link

Design and Proposal: Chapel Aggregation Library #10386

Open LouisJenkinsCS opened 6 years ago

LouisJenkinsCS commented 6 years ago

@bradcray @mppf @marcinz

Aggregation

Aggregation reduces communication by grouping individual units of data into collections of units of data to be processed in bulk. Aggregation is sometimes necessary for performance when data parallelism cannot be exploited, such as for irregular accesses of data, such as for histograms and graph algorithms. To remedy this, I propose a design for an aggregation library for Chapel, which may well be the first. The library should provide an easy-to-use interface and as well as some performance guarantees.

Privatization

The aggregation library should strive to eliminate as much implicit communication as possible, as so it will be privatized. Privatization, in case it is not known, is the process where copies of the original data structure are maintained on each locale, and any operation performed on a locale will be forwarded to it's respective privatized data structure. Another advantage of privatization besides for performance is that each locale can operate on their own privatized instance independently of the others, or in a locale-private manner.

Destination Buffers

In each privatized instance, we can maintain locale-private buffers which we can aggregate multiple units of data of type dataType into. Each locale can maintain one buffer per locale, like such...

var buffers : [LocaleSpace] Buffer(dataType);

When buffers are about to be flushed, the down-time, or the time when aggregate operations are stalled waiting on the buffer, must be minimized to ensure reasonable performance. To minimize this, we employ a buffer pool which will recycle buffers and create new one on-the-fly; whenever a buffer needs to flushed, the task flushing the buffer will be swap out the current buffer with a new one, where the old buffer will be handled by the task on the locale it was destined for.

The buffers must be parallel-safe, as multiple tasks can be operating on the same buffer at any given time. Not just that, appending to the buffer must be performed in a scalable manner, and as such we must design a scalable algorithm. Currently, I make use of a Fetch-and-Add counters, no Compare-and-Swap, which presumably is as scalable as you can get. Flushing the buffer manually works just fine, but automatic flushing is another problem.

Flushing the Buffer

When the buffer is full, we defer the handling of processing the buffer to the user by returning it. That is, we return Buffer(dataType) to the user whenever they wish to aggregate data; the buffer returned is always for the locale that they have asked to aggregate data to. Flushing of the buffer is handled by the last task that filled the buffer, so the user knows that the data that they have attempted to aggregate is already inside of the buffer that they have. Example usage can be seen below...

var buf = aggregator.aggregate(data, locid);
// If 'buf' is not nil, we are in charge of flushing buffer
if buf != nil {
  // Handle asynchronously
  begin on Locales[locid] {
    // 'buf' iteration will handle copying the entirety of the
    // aggregated data to the current locale first or in chunks
    // to be processed.
    forall b in buf {
      process(b);
    }
    // The 'buf' will maintain access to its buffer pool and submit
    // itself to it once it has been finished being used...
    buf.finished();
  }
}

The user can manually flush the buffers themselves, but instead of forcing them to call flush on each and every single buffer for each locale, we offer an iterator that yields the buffer on the locale it is to be processed on. Example usage can be seen below...

forall buffer in aggregator.flush() {
  forall b in buf {
    process(b);
  }
}

We can exploit Chapel's standalone iterators to make it extremely easy. Unfortunately, due to lack of adequate support for first-class functions, flushing of individual buffers when calling aggregate is not so easy.

API

// Note: This is a privatized class, but we're leaving the record-wrapping out of the API
class Aggregator {
  type dataType;
  var destinationBuffers : [LocaleSpace] Buffer(dataType);
  var bufferPool : BufferPool(dataType);
}
/* 
    Appends 'data' to the output buffer corresponding to Locales[locid]. Returns
    the buffer to be flushed if the current task was the last to fill it, otherwise 'nil'.
*/
proc Aggregator.aggregate(data : dataType, locid : int) : Buffer(dataType);
/*
   Yields a non-empty buffer to be processed on the current locale. 
*/
iter Aggregator.flush(param tag : iterKind) : Buffer(dataType) where tag == iterKind.standalone;

class Buffer {
  type dataType;
  var dom : {0..-1};
  var data : [dom] dataType;
  var bufferPool : BufferPool(dataType);
}
/*
  Adds 'this' back to its buffer pool. It is undefined behavior to use the buffer
  after invoking this method.
*/
proc Buffer.finished();
/*
  Iterates over all data in this buffer. Data is transferred to the current locale either 
  as a whole or in chunks to be iterated over, eliding communications.
*/
iter these();

Open Questions

Should there be automatic flushing of the buffers

Should we introduce a heuristic for automatic flushing over buffers? One suggestion by @marcinz is using the 'rate of change' as a heuristic, where we flush the buffer based on how often the buffer is being filled; for example, if the buffer is partially filled yet the amount of new additions slows down to a halt, this should result in an automatic flush, but if the buffer is filling itself up extremely quickly, this will not.

Should we allow users to limit the amount of buffers being flushed?

One issue I can forsee is that maybe if the buffers are too small or if processing the buffers take too long, it is possible that we spawn too many tasks and end up with OOM or with a bottleneck. This is something that can possibly be avoided by throttling the number of buffers that can be created at once for each locale so progress is only guaranteed if we can find a buffer to recycle if we have already exceeded the limit.

Should the buffers expand in size?

Should the buffers expand in size based on how rapidly they are used and requested? Should we use a certain multiple, like by 2 or 1.5? How should a heuristic be designed for this?

mppf commented 6 years ago

Could mention communication in the 1st paragraph "Aggregation", please ?

Does aggregator.aggregate always accept 1 item? I expect to be able to pass many data elements to it but I don't see how that happens here. How does (a lot of) data get into the buffers? Could you show us how to write a complete simple example (e.g. something RA-like or something like the histogram benchmark shown in #9782)?

Also tagging @ronawho for connection to histogram, and @e-kayrakli in case he has feedback.

Unfortunately, due to lack of adequate support for first-class functions, flushing of individual buffers when calling aggregate is not so easy.

You should be able to use a function object:

record MyHandler {
  proc this(...) {
     ...
  }
}
user passes a MyHandler instance, "flush" code just calls theHandler(someElement).
LouisJenkinsCS commented 6 years ago

Does aggregator.aggregate always accept 1 item? I expect to be able to pass many data elements to it but I don't see how that happens here.

You would have to pass the data in individually each time. Using the histogram as an example...

var aggregator = new Aggregator(int);
// Due to lack of remote-value forwarding, need to specify `in` intent since privatized
forall r in rindex with (in aggregator) {
  const loc = A.dist.idxToLocale(r);
  var buf = aggregator.aggregate(r, loc.id);
  if buf != nil {
    begin on loc {
      // Guaranteed that indices are local to current node...
      // Maybe could use some kind of optimized local access of array 
      for idx in buf do A[idx].add(1);
      buf.finished();
    }
  }
}
forall buf in aggregator.flush() {
  for idx in buf do A[idx].add(1);
}

You should be able to use a function object:

I can't say that I haven't tried using this approach, but besides for a few bugs I was hitting while attempting this (I'll file them later for potential improvement), there was one major difficulty that I was coming across, and that was have privatized instances share the same function object. We wouldn't want to have every access of the handler be a wide reference. However just in case I am imagining it incorrectly, how would you create a function object to handle the histogram case? As in, what would you expect the user should write? How are remote invocations on function objects handled at present, would it incur additional communication?

LouisJenkinsCS commented 6 years ago

I should note that this is the API of a working and functional prototype that I have right now. Perhaps I should have focused on the ideal aggregation library instead of what I have currently implemented.

mppf commented 6 years ago

there was one major difficulty that I was coming across, and that was have privatized instances share the same function object

I think that's a problem to solve, and related to other issues/open questions. If it's a record, you might be able to use the in intent to copy it around. But it doesn't really help matters to say "I can't do this because first class functions need work" because actually the first class functions are converted into function objects in the implementation.

mppf commented 6 years ago

You would have to pass the data in individually each time. Using the histogram as an example...

Ah, I see now how this works. Can you write draft chpldoc comments for aggregate in your API summary in the PR description?

LouisJenkinsCS commented 6 years ago

But it doesn't really help matters to say "I can't do this because first class functions need work" because actually the first class functions are converted into function objects in the implementation.

I would think that I would need a first-class functions implementation that would be able to do something like the following...

var aggregateFn = lambda(buf : Buffer(dataType)) : void {
  for idx in buf do A[idx].add(1);
}

Where I maintain the benefits of a normal forall loop. That is, I would expect remote-value forwarding for the array, with everything else being passed by-reference. If I recall correctly, you are not supposed to access variables in the outer scope for lambdas (and when I did I saw internal error: number of actuals does not match number of formals in this() [callDestructors.cpp:1033]) so you would need to find a way to pass A around without performing a by-value copy. My assumption is that do this in a hand-rolled function-object, you would need to write the following...

record HistogramAggregateHandler {
   // Privatized id and privatized instance of 'A'?
   var pid : int;
   var instance;
   // Allocate original function object
   proc init(A) {
      pid = A.pid;
      instance = A.instance;
   }
   // Copied when you copy the aggregate handler 'in'?
   proc init(other) {
      this.pid = other.pid;
      this.instance = other.instance;
   }
   // Indexes into privatized instance
   proc this(buf : Buffer(int)) {
      var arr = chpl_getPrivatizedCopy(instance.type, pid);
      for idx in buf do arr[idx].add(1);
   }
}

It isn't pretty (and I doubt the user would want to ever write a monstrosity like this), but do you think I'm on the right track here?

mppf commented 6 years ago

It isn't pretty (and I doubt the user would want to ever write a monstrosity like this), but do you think I'm on the right track here?

You can write a record inside of a function and methods in the record can refer to outer variables. It's likely that there are bugs, and design questions about this are similar to the first class function case, but at least the problems aren't unique to first class functions.

e.g.

proc test() {
  var x = 1;

  record AddX {
    proc init() { }
    proc this(arg:int) {
      return arg + x;
    }
  }

  var addx = new AddX();
  writeln(addx(2));
}

test();
LouisJenkinsCS commented 6 years ago

In my use-cases, I require aggregation to work outside of the current function. I.E, I have a privatized class (to be specific, my hypergraph and work queue) setup the aggregator which will be called later on-demand, outside of the current scope. Simplified example, even accessing class instance fields cause the same error to occur, hence making this the only approach that I can think of.

LouisJenkinsCS commented 6 years ago

In any event, maybe the focus should be shifted on the ideal... like, what would you expect the final and working version to look like, current issues and bugs aside? I would like it to look like this (for Histogram):

var aggregatorFn = lambda(localIndices : Buffer(int)) : void { 
   [idx in localIndices] A[idx].add(1); 
}
var aggregator = new Aggregator(int, aggregatorFn);
forall r in rindex {
   aggregator.aggregate(r, A[r].locale);
}
aggregator.flush();

Would that be sufficient?

LouisJenkinsCS commented 6 years ago

Also, I ran the histogram benchmark at 8 nodes without network atomics, just remote processor atomics, and the current aggregation library beats it by 4 - 5x faster, although the aggregation library is 4 - 5x slower than network atomics. However, since my aggregation library is meant to handle data that need to be used in normal on statements, it seems adequate.

I doubt that any aggregation library would ever compare to network atomics, so the whole 'non-blocking atomics' likely is the way to go for Histogram.

LouisJenkinsCS commented 6 years ago

Another question that I'd like to pose is this: Should there be some additional step where the user can combine and coalesce as much as of the data as possible? I'm trying to think of an example usage for the Histogram benchmark...

// Combines multiple aggregate operations into a single operation when possible...
// In this case, we combine multiple increments at once.
var coalesceFn = lambda(locid : int(64), localIndices : Buffer(int)) : Buffer((int, int)) {
   var counters : [A.localSubdomain(locid)] atomic int(64);
   [idx in localIndices] counters[idx].add(1);
   var newBuf = new Buffer((int, int));
   // Append all non-zero increments to new buffer, mapping index to value
   for (c, cIdx) in zip(counters, counters.domain) { 
      if c.peek() != 0 then newBuf.append((cIdx, c.peek()));
   }
   return newBuf;
}
var aggregatorFn = lambda(localIndices : Buffer((int, int)) : void { 
   [(idx, cnt) in localIndices] A[idx].add(cnt); 
}
var aggregator = new Aggregator(int, aggregatorFn, coalesceFn);
forall r in rindex {
   aggregator.aggregate(r, A[r].locale);
}
aggregator.flush();

That should speedup remote execution atomics significantly if there is a lot of overlap, and can show utility of adding an additional coalescing step. Thoughts? (Or am I talking to the air right now)

LouisJenkinsCS commented 6 years ago

Just posting more data, because why not. With the coalesceFn (or rather, in current implementation I perform coalescing on the buffer of aggregated data directly), I see a speedup over remote processor atomics by 10 - 11x, and only 2 - 3x slower than network atomics naive.

bradcray commented 6 years ago

I haven't followed this thread in full, but am curious whether, for your aggregated solution, you're using processor atomics? And if so, do you have to do so, or could you serialize the processing of the aggregated work buffers so that they could use direct, rather than atomic, operators (and would this speed things up?)

LouisJenkinsCS commented 6 years ago

I haven't followed this thread in full, but am curious whether, for your aggregated solution, you're using processor atomics? And if so, do you have to do so, or could you serialize the processing of the aggregated work buffers so that they could use direct, rather than atomic, operators (and would this speed things up?)

For my aggregated solution we are using processor atomics, but we are handling/dispatching in bulk locally to the target locale. That is, I am using a begin on where the asynchronous remote tasks perform increments on parts of the array that are local to the target locale. Network atomics actually somehow speed things up slightly, but not by much more than 0.1 ~ 0.2 seconds. I'm thinking that there is a lot of overlap here with the table being significantly smaller than the number of updates.

For Histogram, I think you'd see even more speedup if you had destination buffers for each locale, but just had all increments go to the local buffer, and then at the end reduce them all appropriately. However, that is too implementation-specific for the aggregation library, and its only applicable when the entire table can be duplicated on each locale. Note the Histogram's default settings are used:

const N=2000000 * here.maxTaskPar; // number of updates                                                                                                                                                                                                                 
config const M=1000 * here.maxTaskPar * numLocales; // size of table


Turns out that performance is consistent, whether the table is large or small.
LouisJenkinsCS commented 6 years ago

After thinking it over, I believe that its better to ditch the aggregationHandler and coalescingHandler and shed the need for any and all first-class functions and just stick with returning buffers. Its actually a lot cleaner using buffers than arguing over semantics of how and when some function gets called, and clarifying preconditions and postconditions of how data is handled, etc. We can give more control to the user this way...

  proc handleBuffer(buf : Buffer(int), loc : locale, array) {
    var cnt : [array.domain.localSubdomain(loc)] int(64);
    for idx in buf do cnt[idx] += 1;
    buf.finished();
    on loc {
      var localCnt = cnt;
      for (c, idx) in zip(localCnt, localCnt.domain) do if c > 0 then array[idx].add(c);
    }
  }
  var agg = new Aggregator(int);
  sync forall r in rindex {
    const loc = A[r].locale;
    if loc == here {
      A[r].add(1);
    } else {
      var buf = agg.aggregate(r, loc);
      if buf != nil then begin handleBuffer(buf, loc, A);
    }
  }
  forall buf in agg.flush() {
    handleBuffer(buf, here, A);
  }

While it may not be the prettiest, it does show the ideal I'm going for as an end-result. The above shows histogram with both coalescing and aggregation that doesn't require special library support, just good-old-fashion user ingenuity...

Performance results using default settings listed above...

Histogram (RA) Time: 144.119
Histogram (NA) Time: 4.99902
Histogram-Aggregated Time: 5.16833

Edit: Forgot to perform local operations directly and ended up aggregating them, performance almost equivalent to a network atomic RDMA! Edit2: Revised/refactored more so that there is code reuse.

LouisJenkinsCS commented 6 years ago

I was wondering if an interface could be devised so that the user could, possibly, create their own aggregation handlers... but at the same time, if they did not they could receive the buffers...

record AggregationHandler {
   // Called to process full buffers
   proc this(buf : Buffer(?dataType), loc : locale) {
      // ...
   }
   // Called to replicate on each locale for privatization
   proc clone() : this.type {
      // ...
   }
}

Then the user can decide to handle whether or not they wish to bother with aggregation handler. Also unlike before, the handler will not be called on the destination locale, instead it will be called on the source locale; the user handles how to dispatch/process on the target/destination locale. I realize that the user may or may not want to spawn an active message on the target locale, maybe they want to utilize network atomics? Maybe they want to coalesce the data before sending it?

So if the user specified an aggregation handler, then flushing can be automated and we can actually have more complex ways to handle this kind of stuff, like automatic flushing.

LouisJenkinsCS commented 6 years ago

So after disabling network atomics, I saw a significant performance improvement for aggregation...

Histogram (Fast Execute On) Time: 209.441
Histogram-Aggregated Time: 1.86789

Which actually beat normal network atomics. Its a good thing @ronawho stated that high-contention blocking atomics are significantly slower in issue #10551

ronawho commented 6 years ago

Its a good thing @ronawho stated that high-contention blocking atomics are significantly slower in issue #10551

Just to be clear -- this is only for atomics that are predominately used on a single-node. For that case, highly contended network atomics are currently only 2-3x slower than processor atomics, it's serial or minimally-contended network atomics that are significantly slower than processor atomics.

And if the atomics are actually modified by remote nodes, network atomics are way faster regardless of contention levels.

LouisJenkinsCS commented 6 years ago

image

# Locales, RA, NA, Aggregated
1, 0.31487, 0.359008, 0.407164
2, 45.8369, 4.86839, 10.1115
4, 71.1999, 5.35409, 5.984
8, 122.32, 6.91782, 3.77877
16, 151.091, 5.53941, 2.41778
32, 165.3, 6.8025, 1.88858

Thanks to @ronawho, by using chpl__processorAtomicType() I have significantly sped things up. Its much faster and even scales.

LouisJenkinsCS commented 6 years ago

And even better for GASNet!

image

# Locales, fast execute_on, aggregated
1, 0.349567, 0.417695
2, 290.77, 10.034
4, 447.792, 5.40902
8, 530.266, 3.52255

I know uGNI is supposed to be the better communication layer, but this library definitely shows massive improvement for those using non-Cray systems. We're looking at least 2 orders of magnitude speedup. A library like this would definitely be useful for Chapel users.

Edit:

The reason why we only go up to 8 is becuase GASNet crashes on the Histogram loop...

forall r in rindex {
   A[r].add(1);
}
LouisJenkinsCS commented 6 years ago

@mppf Just gonna ask this, do you think we could get a library like this accepted as an actual 'Package' rather than having it for Mason? I'm willing to put my CHGL (Chapel HyperGraph Library) for Mason since its utility is very limited, but a data aggregation and coalescing library can be super helpful, I'd think. Especially for those running GASNet, but also as well for those running uGNI. Also @bradcray when you get back from vacation.

mppf commented 6 years ago

@LouisJenkinsCS - I'm open to adding it to the Chapel repo in modules/packages. I think it's reasonable to include it in the release; we'd just be putting it there b/c we're not yet sure if it has a stable API etc etc.

bradcray commented 6 years ago

I'm open to adding it to the Chapel repo in modules/packages. I think it's reasonable to include it in the release; we'd just be putting it there b/c we're not yet sure if it has a stable API etc etc.

[Tagging @ben-albrecht and @Spartee on this (as mason leads) just to keep them in the loop.]

As I've said in other contexts, I think we should be moving towards pushing all of modules/packages out into mason-controlled territory to put more weight on mason, get out of the business of reviewing PRs for package modules and taking on the implied maintenance of them for all time, and help mason make the transition from being considered second-class in some way (as Louis suggests above). If we were already there, I'd say we should definitely do this for this library as well. Since we're not, I'm OK with putting it into the repo for now (assuming it's ready for that... I haven't had the chance to review this thread at all). But I'd also be very happy if it chose instead to be the / a poster child for the new world of mason-managed packages.

mppf commented 6 years ago

Regarding to mason-package-or-not, there are a bunch of questions, besides the testing question in issue #10559. In particular, which mason packages are distributed with the release? Are some mason packages (e.g. those distributed with the release?) managed similarly to Chapel master with chapel-lang github repos and code reviews and Apache 2 license and contributor agreements? Are some considered "test code" that we update as we do language changes, in the same way I'd update modules/packages today? Are some subject to nightly testing?

All this makes me think we need an intermediate strategy, say, putting certain packages in chapel-lang/ repositories and including them in nightly testing & possibly in releases.

A related question is how to approach standard module stability. If something is to be distributed with the release, is it managed differently? E.g. the Sort module isn't considered API stable, so is in modules/packages rather than modules/standard. Should it turn into a mason package? Or should we use a different strategy to communicate standard module API stability?

Anyway, I think the question from @LouisJenkinsCS came from hoping to create something that can be included in Chapel releases, but I can't speak for him.

LouisJenkinsCS commented 6 years ago

I agree with @mppf here in that I would definitely like to take the intermediate short term solution and have it integrated as a package module or in an official Chapel repository.

I will say that in the future, say post release, I won't mind it being moved to being a Mason package if it doesn't suffice for becoming an actual standard package module, and I'd be willing to work together with whoever to improve the Mason package manager enough so that Mason becomes more stable and visible.

I'll also say that in the long run, I definitely would like to move whatever libraries I have written, such as my distributed data structures, to Mason as it makes it easier to contribute and make changes to (PRs can be exhausting to write compared to just committing upstream).

LouisJenkinsCS commented 6 years ago

Also in terms of visibility, I was wondering if Chapel's website could have a specific section for Mason packages? As in, maybe it could contain a title, brief summary, and link to repository's documentation and source code, as well as a way to install it... Like the below example...

Distributed Data Structures

Built-in data structures are a necessity for any budding language, and in a language where distributed computing is at its core, data structures that can properly be maintained across clusters is desired. For my project, I have designed the core framework for a distributed data structures library, and have implemented two novel scalable distributed data structures, an ordered deque and an unordered multiset, that exceed a naive implementation by at least two orders of magnitude in a moderately sized cluster.

Repository: https://github.com/LouisJenkinsCS/Distributed-Data-Structures Documentation: https://louisjenkinscs.github.io/Distributed-Data-Structures/

Install this package with mason:

mason install distributed-data-structures

Build and run package's examples:

mason build --example distributed-data-structures
mason run --example distributed-data-structures

Edit: This may also be very off topic

bradcray commented 6 years ago

w.r.t. Michael's module policy questions, @ben-albrecht and I chatted at length about how we might manage modules in the near-term / distant future today and he's going to open up some new issues to capture that discussion and focus on specific items outside of this sprawling issue. Similarly, I think Louis's latest comment should also get a new feature request issue rather than being tacked on here (philosophically, I'd much rather have a dozen focused issues that can be commented on or closed independently over 1 that sprawls across a dozen subjects that make it time-consuming to digest and difficult to ever close).

ben-albrecht commented 6 years ago

See #10712 and #10713 for continued discussion on the topics of how modules should be adopted into the standard library and how we will facilitate the migration of package modules to mason packages, respectively.

LouisJenkinsCS commented 6 years ago

Edit: Made a mistake, this is weak scaling, not strong scaling, but performance does not differ much.

@mppf

You may find this interesting, and I certainly do and find it rather exciting because it means that we can improve CAL even more than it is right now. Currently, as a Chapel module, I am confident that there isn't much else we can do without integrating it into the runtime (in which I believe is the next step). Here is the graph where we have the number of locales in the X-axis, number of threads per locale in the Y-axis, and the operations per second in the Z-axis...

image

We see that we have some nice scaling from 1 to 32 locales and from 1 to 32 cores... but it isn't anywhere near as efficient as it can be. Here is the data for one locale...

1, 1, 2.33619e+07
1, 2, 6.34626e+06
1, 4, 1.11624e+07
1, 8, 9.9836e+06
1, 16, 9.38655e+06
1, 32, 7.69269e+06

image

See that? It doesn't strong-scale well due to contention on the two fetch-add counters. You really can't do much better than two fetch-add counters too. As well, the more locales you have, the less likely we have contention on those fetch-add counters so performance improves as expected.

Benchmark code:

 coforall loc in Locales do on loc {
   coforall tid in 1..here.maxTaskPar {
     for ix in 1..numOperations / here.maxTaskPar / numLocales {
       var buf = aggregator.aggregate(ix, ix % numLocales);
       if buf != nil then buf.done();
     }
   }
 }

(Yes I do take into account that numOperations/here.maxTaskPar/numLocales results in less overall operations than numOperations. In these runs I have numOperations=1024 * 1024 * 1024.

I believe that with thread-specific buffers the performance will improve significantly. Any thoughts @mppf

Edit:

Real strong scaling graph at 1 locale...

image

LouisJenkinsCS commented 6 years ago

I believe a large part of the performance drop at one locale can be fixed once issue #10771 is addressed, as it clearly is false sharing. Right now at one locale it shows pure contention, even on two fetch-add counters. In the issue, I show that performance drops when you perform a relaxed add on two counters at the same time... and that padding gets compiled away somehow?