chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.8k stars 421 forks source link

GlobalAtomicObject - Scalable Atomic Operations using Descriptor Tables #6663

Open LouisJenkinsCS opened 7 years ago

LouisJenkinsCS commented 7 years ago

Global Atomic Object

(PR available: #6717 )...

Details

Below, I go over some details that are present in the current implementation

Pointer Compression

This is the common case, but certainly not a novel concept in and of itself nor to Chapel itself, however it is sufficient for most cases where the number of locales fit within 16 bits. It takes advantage of the fact that all known architectures and operating systems only use 48 bits of the virtual address space, meaning we are free to embed the 16 bits of the locale id in the upper 16 bits. This means for 65K locales, we get the benefit of using atomics for a single 64-bit atomic operation, making atomics extremely fast.

Descriptor Tables

This could be a new design pattern really, but the idea, while novel in its application, is simple in theory. We maintain a table of objects, one for each locale, of which we use the index and locale id as a descriptor (a pseudo-pointer). When we want to atomically operate using a class instance, we first hash the class instance (it's 64-bit address) and perform an insertIfNotPresent operation on the target locale, and then embed the locale id in the upper 32 bits, and the index into the descriptor table into the lower 32-bits. Compression is the most expensive part, as it involves doing the above every time (although see improvements for a huge improvement), but decompression is cheap in that it can easily directly index into the descriptor table.

Performance

I compare GlobalAtomicObject (using descriptor tables) performance against 'Sync Var', the all-mighty sync variable, which currently gets a handicap by just acquiring the lock, perform some arbitrary cheap operation, then immediately release it. This is the 'baseline' which we must beat or else the implementation is useless. The next is Atomic on a 64-bit value, which is the 'ceiling', and can be seen as how much performance we can get for normal pointer compression versus keeping tables like this. It has two benchmarks: 'single' which is a single atomic operation, and 'LLSC' which is a pseudo Load-Linked Store-Conditional operation. LLSC looks like such...

var x = atomicVar.read();
atomicVar.compareExchangeStrong(x, f(x));

There is no retrying, just a one-and-done atomic operation (which I count as one). In 'Global Atomic', we use similar atomic operations as 'Atomic' does, although a few things need to be considered. While for 'Single', it performs a write (because a read is not interesting since it is a basic lookup in a locale's table), for data in local memory, which shows the ideal. Now 'LLSC' can perform operations for data that is remote, meaning that each time, its possible that during compression of the data, we may need to jump to other locales. It shows a more average case, but not the worse case when data to compareExchangeStrong is both remote. Currently, 'LLSC' for GlobalAtomicObject demonstrates 3 atomic operations: the read which is a simple lookup, and compression of both variables passed to compareExchangeStrong which one is local, the other remote.

It should be noted again that with pointer compression and a smaller number of nodes, performance is excellent. However, when one has more than the maximum allowed, it is still manageable and better than a sync variable is. It should also be emphasized that the runtime of GlobalAtomicObject with pointer compression is equivalent to Atomic.

I present two graphs: one with respect to runtime, and another in terms of raw operations per second.

Time

image

This shows the runtime of the benchmark. As can be seen, raw atomic operations grossly beats anything else, although in terms of atomic LL/SC, it does lag enough to demonstrate the power of global atomics. A single global atomic, while is magnitudes slower than a single 64-bit atomic, is only 2x slower than the LLSC.

Imagine if we had attempted to implement some MCAS operation to update both locations using descriptors. At best we would succeed within 4 LL/SC operations without obstruction (2 to set descriptors, 2 to update), of which a single Global Atomic variable would beat. Of course, in reality, the operation would require multiple retries, in the hundreds to thousands with real workloads that placed contention on it. This shows that, in terms of global atomics, this is the best option currently, as a global atomic is guaranteed to succeed in one operation.

Operations Per Second

image

This shows the raw operations per second. Notice that I removed 64-bit atomics entirely as they shadowed over everything else to the point that the performance boost wouldn't be noticeable. As can be seen however, both a single global atomic operation, and even the LL/SC combo operation with a potential remote reference (meaning an added on statement) beats a normal sync variable. In every case, GlobalAtomicObject wins outright, but performance penalties from remote references are very apparent and should be avoided at all costs.

Usage

Sample usage of GlobalAtomicObject can be seen below (and yes, it does work).

class C { var c : int; }
var a = new C(1);
var b = new C(2);
var x = new GlobalAtomicObject(C); // atomic C
var result : bool;

x.write(a);
result = x.compareExchange(a, b);
assert(result);
assert(x.read() == b);

// Note that you can call 'delete' on the object while having it still be present
// in the descriptor table. This may be okay for most cases where memory is not
// an issue since reused addresses will hash to the same node and that node will
// already point to the valid object the user is attempting to insert. However if
// memory is an issue, one can call '_delete' to ensure that the node itself is removed
// and can be recycled for later use.
delete a;
delete b;

// Is Safe because we only use the pointer itself. However, in cases where 'a' and 'b'
// can be reused by the same GlobalAtomicObject, then this could cause undefined behavior
// where another concurrent task adds the same pointer. In those cases, you should call
// '_delete' before 'delete'.
x._delete(a);
x._delete(b);

// As well, when GlobalAtomicObject goes out of scope, all nodes it had in use also
// get deleted...

GlobalAtomicObject(C) can easily become atomic C. If the language were to be modified, it wouldn't be too surprising to see the above become the below (which of course lacks the excessive comments), which is an exemplary ideal that @mppf insisted on:

class C { ... }
proc test() {
   var a = new C(1);
   var b = new C(2);
   var x: atomic C;
   var result:bool;
   x.write(a);
   result = x.compareExchange(a, b);
   assert(result);
   assert(x.read() == b);
   delete a;
   delete b;
}

Where delete can call some hook which also removes it from the table if it has not yet been reclaimed. I believe this can become a reality with little effort.

Improvements

Below, I go over some potential improvements that can be made by others who are up to it.

Pointer Compression + Descriptor Tables

It may be possible to combine normal pointer compression for the lower locales that fit within 16 bits. Combining both means we allow better performance on average, even for cases when they have more than 16 bits worth of locales, any locales with an id less than 2^16 get the power of atomics, while the ones greater than or equal to 2^16 get to use the much-slower-but-faster-than-sync-variables descriptors.

See next idea for how to differentiate.

Zone Compression

A theoretical model where we use a more dynamic type of compression which attempts to uses the lowest 32 bits for the virtual address, and divide the upper 32 bits into zone bits and locale bits. A 'zone' is a base 64-bit offset identified by zone id, which is allocated when a new unique id is found. With optimizations such as caching the zone mappings for other locales (I.E: First time a unique upper 32 bits are found, check if we have it cached, if not found jump to the owning locale and add it if not already, and update what we have cached, etc.)

The benefit here is that we get to keep using 64-bit pointer compression, and the amount of unique zones that can be kept track of depend on the number of bits needed to keep track of the locale id.

"What happens when we run out of zones?" - We use the descriptor tables again. We can just search the unique zones for the target locale first, if the zone isn't found and we can't create a new one, we just use descriptors.

"How do we differentiate between Zone Compression, Pointer Compression, and Descriptor Tables?" - Its possible to use the lowest 2 bits. This means lowering the potential maximum segments for descriptor tables, but it allows a much more dynamic compression strategy.

Caching Descriptors into Objects

This is a change that isn't theoretical but requires changes with the language itself. Majority of the overhead of descriptor tables is hashing the object, which requires jumping to the owning locales, but if the descriptors were cached then repeated accesses would be just about as fast as atomics. If it were cached, the benefits of recycling memory to be used in atomics would be tenfold (or hundredfold, as the benchmarks show).

Concurrent-Safe Implementation

The table utilizes a simplistic skip list implementation on a segmented memory pool. The memory pool allows indexing by allowing an easy translation from a 32 bit index into a segment and segment data index. It utilizes a simple bitmap (BigInteger). The implementation of the memory pool allows wait-free reads (hence indexing is fine) but requires synchronization to recycle/reuse memory. The skip list used to manage memory also requires synchronization, and a lock-free/wait-free implementation requires proper memory management such as Hazard Pointers which requires thread-local or task-local storage. Hence, synchronization is needed and we use a simple local sync variable.

If someone managed to make everything concurrent, it could improve performance further by 5 - 10x at least.

Edit:

Removed Introduction to avoid further derailment of discussion.

bradcray commented 7 years ago

In any budding language, especially one that boasts to be a productive concurrent/parallel language, atomics on all types are a necessity.

This statement strikes me as being a naive oversimplification. Is an atomic distributed array a necessity for any language that supports distributed arrays? I don't think so, and am not even sure what that would mean.

LouisJenkinsCS commented 7 years ago

Is an atomic distributed array a necessity for any language that supports distributed arrays?

The emphasis would be on productivity. "What can I do if X were atomic compared to if it were not". If X is a distributed array, what do we gain? Each element in a distributed array can be updated atomically as is, so perhaps not. The distributed array could be made atomic by allowing concurrent access while resizing, which is a huge plus and would further enable productivity, sure. Would I say that this is a necessity to allowing the user to be more productive? Arguably. However, that's not the case here.

Since X is a class instance, the amount of productivity gained by making it atomic is limited to the stretch of the imagination. You can:

1) Implement more complex algorithms and data structures, like scalable wait-free algorithms 2) Allows any local algorithm to be converted into a distributed algorithm 3) Literally becomes a universal construct that allows us to make any algorithm become distributed (RCU like in the case of atomic distributed arrays, STM, garbage collection, what-have-you)

Is the above a necessity? Absolutely. However unlike atomic distributed arrays, I have presented an actual solution which works and scales.

Edit:

For more clarity: Atomics for Class Instances is a necessity, but let me expand on the universal construct bit for a moment. If you have any type of data, you can create a field in an class instance. That's nothing special, but what it allows is: any type of data, whether an int, a string, a record or some distributed array, lets call it data, which can be a field in some wrapper class instance. If you wish to atomically update it, imagine the follow:

var globalData : atomic Obj;

// Read...
var current = globalData.read();

// Copy...
var modified = clone(current);
modified.field = data;

// Update...
globalData.compareAndSwap(current, modified);

You have potential RCU. Of course the next actual difficulty would be memory reclamation, but the point is that it would have been the same whether the algorithm was distributed or not.

Furthermore, if it sounded like I was bashing Chapel, I'm not, I wish to see it improve and as such I am offering suggestions to its problems.

bradcray commented 7 years ago

I think you're misinterpreting my comment. I don't have any objections to supporting atomic classes in Chapel, and have long believed that we should do so. My objection was with the opening blanket statement which I read as "any new language that has type X must necessarily have atomic type X." And I simply don't believe that to be true (and used "distributed array" as my example X). As a result of my reaction to the statement (+ time constraints), I rolled my eyes and stopped reading there rather than being intrigued and continuing on. Then eventually I came back to comment on my reaction.

To me, a more believable, less absolutist, and less objectionable statement would be something like "Any budding language, especially one that strives to be a productive concurrent/parallel language, should consider for each supported type T whether there would also be value in supporting an atomic variant of T." This statement seems far harder to argue with to me.

LouisJenkinsCS commented 7 years ago

Understood, it makes sense that it made a broad generalization, but I didn't expect it to have the kind of effect which would cause people to skip over it from the first sentence. I revised it so that it didn't generalize to any type, but to a specific type of data.

bradcray commented 7 years ago

Thanks!

To make sure I'm understanding the following:

let me expand on the universal construct bit for a moment. If you have any type of data, you can create a field in an class instance. That's nothing special, but what it allows is: any type of data, whether an int, a string, a record or some distributed array, lets call it data, which can be a field in some wrapper class instance.

Are you saying that the original statement was intended to convey "Supporting atomic classes in a language effectively adds support for arbitrary atomic types as well because given any type T we can create a class C_T with a field of that type and now by declaring an atomic C_T, we effectively have an atomic T"?

LouisJenkinsCS commented 7 years ago

Are you saying that the original statement was intended to convey "Supporting atomic classes in a language effectively adds support for arbitrary atomic types as well because given any type T we can create a class C_T with a field of that type and now by declaring an atomic C_T, we effectively have an atomic T"?

Technically, it can yes. The semantics would change (requires RCU or similar updates), performance may drop significantly, but it is possible to emulate atomic T with atomic C_T with effort (even if it is a lot). The point I was trying to make was that it necessary in that it made such things possible (even if they weren't optimal), but that was before you informed me the issue was because my generalization (of which I may have done again).

mppf commented 7 years ago

@gbtitus and @ronawho are planning to look at this proposal. (and me too, but I've already looked at it enough to be comfortable).

LouisJenkinsCS commented 7 years ago

@mppf

I had another idea in terms of how it can be implemented into the language itself. What if you had one global table (not coupled to a GlobalAtomicObject, but initialized first thing at runtime) that was managed across all nodes. Management of the objects (adding, propagating changes across nodes, removing them when the object is disposed of with delete, etc) is up to you (and I'm sure you guys can think up much better solutions), but this allows us to use the whole 64 bits. Embedding the 64-bit descriptor number (index into the table) into the object is still a valid optimization here, and we allow 2^64 potential objects. The compiler can properly instrument read, write, etc.

Boom, you've got your global descriptor table. The low level details would be beyond me though.

mppf commented 7 years ago

@LouisJenkinsCS - I'm not sure what to interpret from your other idea other than that we could come up with a completely different solution from what you created, but I already knew that. But I'm probably missing something. What would be the advantage of having one global descriptor table?

gbtitus commented 7 years ago

@LouisJenkinsCS: For some reason I'm feeling like one of the edits to the original proposal removed a lot of the context. Basically the need is for AMOs on global pointers which conceptually are larger than 64 bits (and we don't have 128-bit network AMOs), and the specific operations we need are just Load, Store, Swap, and Compare-and-Swap? Is that correct?

LouisJenkinsCS commented 7 years ago

@mppf

Allow me to contrast having one single global descriptor table to having one coupled to GlobalAtomicObject (or each atomic Obj).

1) It makes caching the descriptor possible in O(1) space, as it can be added directly to the object's header the first time it is used in any atomic operation (via checking if its cached descriptor is set). If you have a unique table coupled to each atomic Obj, then the cached descriptor is specific to each instance of atomic Obj. 2) There can be one table to rule them all, in that the table could hold 2^64 arbitrary wide_ptr_t data, and have it store any kind of object. Currently, you'd need one descriptor table per atomic object, and even if you had one table per runtime type, a lot of space will be wasted. This can be done in just one extra level of indirection. 3) atomic Obj can become just an arbitrary descriptor index.

Edit:

To backup whether this is more than enough: as you stated, an object must take up at least 32 bytes, so we're talking about 590.29581 exabytes worth of data before space becomes an issue.

LouisJenkinsCS commented 7 years ago

@gbtitus Yes, originally I had stated that in the removed introduction, and yes we only need those operations.

mppf commented 7 years ago

@LouisJenkinsCS - I think using a single global table is interesting as an improvement. Instead of asking us to solve all the problems - can you describe what operations should happen when to support such a thing. What would you need the compiler to do? What does the runtime do? I would expect as we integrate more with the compiler/runtime you'll need help to do so.

However I think we should view it as an improvement/optimization and try to get a 1st version merged. I.e. I think we should discuss the global version later and in a different issue. For the purposes of this issue, I think it would nice to simply say why it is harder (what are the challenges?).

gbtitus commented 7 years ago

I feel like I must be missing something obvious. One solution would seem to be to implement those AMOs by simply doing an on-stmt to the node where the AMO needs to be done and then doing it as a 128-bit processor AMO (either directly if there is CPU support, or indirectly using the x86 128-bit compare/exchange instruction CMPXCHG16B). In the current implementation this would take 2 network round trips per AMO, all things considered. What are the advantages of the proposed solutions over doing that?

LouisJenkinsCS commented 7 years ago

@gbtitus

The issue would be atomically updating them. What happens when you have another atomic operation begin while another is ongoing? (Misread what you meant by doing an on-statement, yes that would work if there was hardware support). In terms of the current implementation, what it does is allow us to have an immutable pseudo-reference to a wide pointer that we know won't change and can be represented by a normal 64-bit integer. I'm not sure how CMPXCHG16B is implemented, can you fill me in on the specifics? Does it do a double LL/SC that causes many operations to retry if they fail, or does it atomically operate on whole 128-bit data words? Regardless however, not all hardware have support for such operations and this would be a software implementation to fallback on.

LouisJenkinsCS commented 7 years ago

@mppf

With the current version, we can do the following...

1) Implement atomic C as new GlobalAtomicObject(C). 2) Add primitives such as __primitive("get cached descriptor", this, obj) and __primitive("set cached descriptor", this, obj), which can potentially map the current GlobalAtomicObject to the object's descriptor.

Currently, that's all that would be needed to get a nice efficient implementation going.

gbtitus commented 7 years ago

CMPXCHG16B is a 16-byte atomic compare-exchange. The LOCK prefix is needed in multicore situations. It's true that this operation is not necessarily present on all architectures -- it's not implemented on early AMD x86_64 and there may well be other processor architectures that don't have it at all. On those one would need to do something else.

However, I'd argue that the "something else" should not do any network ops. In other words, no matter what actual mechanism we use for the pointer updates, it should only involve local operations, because network communication is very costly. I think this rules out a Descriptor Table-based solution, because in the general case that has to do remote references.

That still leaves a lot of maneuvering room for implementation, though. For one thing, we might be able to go a bit further with straight pointer compression. If the pointers in question are aligned to more than a single-byte boundary then there are low-order address bits which must be zero and therefore do not have to be stored in the compressed form of a pointer. This would give us more node bits in the compressed form.

The bluntest possible tool might be something like maintaining, on every node, a local array of sync vars index by node, and protecting all updates to global pointers with a given locale value by locking the corresponding local sync. (I did say it was a blunt tool!) Speaking of which, in your comparison with sync var protection at the top, where (which node) were the sync vars involved, and was there one per global pointer, or just one for everything, or something else? Would it be possible for you to post your scaling comparison code(s)?

bradcray commented 7 years ago

For some reason I'm feeling like one of the edits to the original proposal removed a lot of the context.

I also got confused by trying to read the evolving document -- started last week, couldn't find my place this week.

LouisJenkinsCS commented 7 years ago

CMPXCHG16B is a 16-byte atomic compare-exchange.

Is there a similar atomic operation to perform reads, writes, and exchanges for 16 bytes? If it is anything similar to a Double-Word Compare-And-Swap, then wouldn't its purpose be to update two arbitrary 64-bit locations? While each location may be updated atomically, if you had to read each 64-bit value separately, then couldn't another atomic CMPXCHG16B operation could change its value mid-read?

However, I'd argue that the "something else" should not do any network ops.

Local to what? If the data refers to something remote and another remote node can atomically set it, then network operations are unavoidable. Communications across nodes are costly, but they do scale. If bounded, they are a constant overhead that can be managed.

I think this rules out a Descriptor Table-based solution, because in the general case that has to do remote references.

Not necessarily. Again, by caching the descriptor, this means that the only remote operation occurs the first time the object is used. Therefore, this remote operation merely occurs once. The other operation is using network atomics to set the 64-bit atomic field, which is also unavoidable. A network atomic operation would beat using an on statement any day (or at least in Chapel it does, I don't know how well optimized you can make it in the runtime).

If the pointers in question are aligned to more than a single-byte boundary then there are low-order address bits which must be zero and therefore do not have to be stored in the compressed form of a pointer. This would give us more node bits in the compressed form.

That's an interesting approach, definitely one that can be used in the meantime, but relatively soon you would run out of bits.

Would it be possible for you to post your scaling comparison code(s)?

Benchmark.

gbtitus commented 7 years ago

I'm going to respond to the parts of this separately, since I suspect some will be dealt with and disappear rapidly while others lead on to further conversation ...

CMPXCHG16B is a 16-byte atomic compare-exchange.

Is there a similar atomic operation to perform reads, writes, and exchanges for 16 bytes? If it is anything similar to a Double-Word Compare-And-Swap, then wouldn't its purpose be to update two arbitrary 64-bit locations? While each location may be updated atomically, if you had to read each 64-bit value separately, then couldn't another atomic CMPXCHG16B operation could change its value mid-read?

No, it's an octoword compare-exchange, that is, it operates on an single aligned sequence of 16 bytes, not on two separate 8-byte quadwords.

mppf commented 7 years ago

@gbtitus - I think it'd be reasonable to do an on with a CMPXCHG16B if CPU support is available. I view the approach @LouisJenkinsCS is trying here as the fall-back - in case 128-bit processor atomics are not available. Additionally I can't predict if 64-bit compare-and-swap but extra work on creating/destroying a instance for the atomic object would be faster for a given application.

A separate concern is that some algorithms might want to use a generation number in the other 8 bytes, so CMPXCHG16B could protect against the ABA problem.

@LouisJenkinsCS - I'm pretty sure CMPXCHG16B only updates contiguous 16 bytes.

mppf commented 7 years ago

@gbtitus - I'm not so sure the descriptor table results in more communication. @LouisJenkinsCS - please correct me if I am misunderstanding the strategy here.

In particular a typical pattern with compare-and-swap for objects is something like this:

e.g. updating a single-linked list in a lock-free manner.

Since the creation of the object occurs on the same place doing the compare-and-swap, the descriptor table that is updated is the one that is local to that node. Updating the descriptor table gives a 64-bit identifier for the class instance, which is then CAS'd in.

Of course, if another node goes to use the 64-bit identifier (say by reading the atomic class instance and then dereferencing it), the (remote) descriptor table will need to be consulted to get the wide pointer corresponding to that identifier.

But, if a node is dereferencing a remote class instance, there will be communication anyway. (Unless some caching strategy is used, say --cache-remote, but that could also cache the remote descriptor table).

gbtitus commented 7 years ago

However, I'd argue that the "something else" should not do any network ops.

Local to what? If the data refers to something remote and another remote node can atomically set it, then network operations are unavoidable. Communications across nodes are costly, but they do scale. If bounded, they are a constant overhead that can be managed.

Local to the place where the memory is being updated. This does get confusing, because we have two notions of locality. One is the node where the pointer points, that is, the node part of the global pointer. The other is the node where the global pointer resides, that is, where it lies in memory. In this case I'm talking about the latter. (I got confused about this myself in another part of the conversation, which I'll address in a few minutes.) I'm saying that when we update a memory location containing a global pointer, it should be sufficient to just do an on-stmt to the locale where that memory location is and then change it using only local operations, no network communication. And I think you'd say that a Descriptor Table-based solution actually obeys this rule because even though the table access conceptually involves network communication, in practice by caching table entries we can avoid the network.

gbtitus commented 7 years ago

I think this rules out a Descriptor Table-based solution, because in the general case that has to do remote references.

Not necessarily. Again, by caching the descriptor, this means that the only remote operation occurs the first time the object is used. Therefore, this remote operation merely occurs once. ...

Got it, I hadn't snapped to this being the purpose of the caching.

... A network atomic operation would beat using an on statement any day (or at least in Chapel it does, I don't know how well optimized you can make it in the runtime).

Yeah, an executeOn (the runtime moral equivalent of a Chapel on-stmt) can't even get close to a network AMO except in very specialized circumstances, and certainly can't ever match or beat it.

LouisJenkinsCS commented 7 years ago

One thing I'd need to cleared up here is this: Are we talking about the current implementation, or the ideal implementation?

In both cases... @mppf That is correct.

gbtitus commented 7 years ago

If the pointers in question are aligned to more than a single-byte boundary then there are low-order address bits which must be zero and therefore do not have to be stored in the compressed form of a pointer. This would give us more node bits in the compressed form.

That's an interesting approach, definitely one that can be used in the meantime, but relatively soon you would run out of bits.

Sure, depending on the definition of "soon", heh. But keep in mind that you only have to use this technique as long as you don't have an 16-byte compare-exchange AMO. If that's getting more common faster than large-scale systems are threatening to grow over 64k nodes (so straight pointer compression can be used), then it's the way to go. Plus it's not actually 64k-node systems we have to worry about, rather it's 64k-node distributed Chapel programs, and other scalability constraints may prevent that for a while anyway. So many engineering tradeoffs ...

gbtitus commented 7 years ago

I said:

The bluntest possible tool might be something like maintaining, on every node, a local array of sync vars index by node, and protecting all updates to global pointers with a given locale value by locking the corresponding local sync. ...

This is nonsense. I confused the node in the global pointer with the node on which the global pointer was being updated. In general if I'm updating a field containing a global pointer, the node numbers in the old and new values of that field are not the same and I can't do what I described. So, never mind this idea!

LouisJenkinsCS commented 7 years ago

keep in mind that you only have to use this technique as long as you don't have an 16-byte compare-exchange AMO

Okay, this brings up my previous question. Is there a way to atomically read/write/exchange 16-byte data? Atomic updates of any class instance globally I believe would be significant. I've looked around for anything like read16b, write16b, etc, and can't find anything. My question is: if you try to read a wide pointer and someone else does a CMPXCHG16B, would it read what was there before CMPXCHG16B finished, after, or in between?

gbtitus commented 7 years ago

Okay, this brings up my previous question. Is there a way to atomically read/write/exchange 16-byte data? Atomic updates of any class instance globally I believe would be significant. I've looked around for anything like read16b, write16b, etc, and can't find anything. My question is: if you try to read a wide pointer and someone else does a CMPXCHG16B, would it read what was there before CMPXCHG16B finished, after, or in between?

I think you'd have to emulate read/write/exchange by means of CMPXCHG16B itself.

LouisJenkinsCS commented 7 years ago

I think you'd have to emulate read/write/exchange by means of CMPXCHG16B itself.

That definitely is something I would like to test. (Note I am assuming that it is implemented in the hardware in such a way that resembles a normal compare-and-swap on 8 bytes) That means that you have a lot of intra-node traffic being done (and inter-node traffic with respect to NUMA nodes) just to access data in that manner. If you have a large amount of traffic (say each task from each node in a cluster accessing one atomic field, which isn't too uncommon for distributed data structures), then having all of those on statements would result in a massive performance loss. Even normally wait-free atomic operations like write and exchange just become lock-free. Add to that the fact that you have so many remote tasks absolutely overwhelming the node on which the memory lies on, and I think perhaps descriptor tables at the idealized 'everything-cached' solution may be a viable alternative, in that it avoids the 'CAS retry problem' bottleneck.

However, if the x86 LOCK prefix provides mutual exclusion at a hardware level with no need to retry, then it boils down to acquiring a (very low-level) spinlock for all operations, right?

gbtitus commented 7 years ago

Would it be possible for you to post your scaling comparison code(s)?

Benchmark.

The sync code seems rather unfair, since you're using a program-wide sync var to single-thread everything. Probably a production implementation would do something that scaled better, such as having a number of sync vars on each locale, say a multiple of the core count so as to reduce contention, and then for the update, doing an on-stmt to the locale where the field lived, somehow using the field address to select a sync var, and then using that sync var to protect the update. This would scale. I'd believe you if you argued that it wouldn't scale as well as solutions using network AMOs could. Nevertheless, when comparing to an alternative technique you don't like, it's always more convincing to show that alternative in the best possible light, not the worst. If you could do this and rerun, that would be great to see.

Also, for the code where you're using CompareExchange, I think you need an additional inner loop. The code at this line, for example, doesn't necessarily replace the value x with the value x + 1 in the variable atomicValue. It only does so if atomicValue actually contains the value x at the time the operation is done. I think instead you'd need something more like this:

const x = atomicValue.read();
while !atomicValue.compareExchangeStrong(x, x + 1) do chpl_task_yield();

Basically, this runs until comparison succeeds and the exchange occurs. (The yield is there to avoid starvation, if some other task needs to run in order to create the situation that allows you to make progress.)

LouisJenkinsCS commented 7 years ago

The sync code seems rather unfair, since you're using a program-wide sync var to single-thread everything.

You are definitely right, the syncOp benchmark uses a very naive implementation, but I also gave it a handicap as well. I was going to compare and contrast the overhead of a read, write, exchange, and compareExchange for sync, but any performance difference would be negligible as 99% of the overhead is from acquiring a remote lock. Ignore what is going inside of the lock, just the fact that a lock is acquired to do something.

Probably a production implementation would do something that scaled better, such as having a number of sync vars on each locale, say a multiple of the core count so as to reduce contention, and then for the update, doing an on-stmt to the locale where the field lived, somehow using the field address to select a sync var, and then using that sync var to protect the update.

The field is located on one locale, so in this case fine-grained locking won't work. As well, all examples here are rather simplistic in nature, but you do raise a valid point on performing an on statement on the node that owns it, which would definitely be better than remotely acquiring it, its just that the lock in this case is protecting one piece of data, and cannot be made any more granular than that.

Also, for the code where you're using CompareExchange, I think you need an additional inner loop. The code at this line, for example, doesn't necessarily replace the value x with the value x + 1 in the variable atomicValue. It only does so if atomicValue actually contains the value x at the time the operation is done.

See the 'Performance' section in the original post. I chose to avoid adding contention for CAS retrying because I know that a CAS retry loop won't scale. In fact, I'd argue that a sync variable can scale better than a 'CAS Retry AMO` can due to the fact that the contention would cause a significant decline in performance as we add more tasks.

Clarification

In summary: I will rerun the benchmark with LL/SC being a retry-loop, and have syncOp use an on statement.

LouisJenkinsCS commented 7 years ago

@gbtitus

Actually, perhaps I could construct a better benchmark, one in which we fairly compare both sync and GlobalAtomicObject and leave normal atomic out in general (as it has to do with atomic operations using class instances). Would you agree to that as a better benchmark? I'm thinking of constructing a list... It goes like this...

Global Atomic Object List

var head : GlobalAtomicObject(node);
var tail : GlobalAtomicObject(node);
var ourNode = new node();
var tailNode = tail.exchange(ourNode);
if tailNode == nil {
   head.write(ourNode);
} else {
   tailNode.next = ourNode;
}

Sync Variable List

var lock$ : sync bool;
var tail : node;
var head : node;
lock$ = true;
var n = new node();
if tail == nil { 
   tail = n;
   head = n;
} else {
   tail.next = n;
   tail = n;
}
lock$ = false;

In this case its fair in that they do roughly the same (although the GlobalAtomicObject isn't guaranteed to have everything be visible immediately).

Results (Time:

NumLocales Global Atomic List Remote Sync List On-Stmt Sync List
1 0.111723 1.10026 N/A
4 13.1698 106.642 76.1085

Any more and it would take far too long to be worth it. In this case, GlobalAtomicObject wins, but that's because the algorithm for the list wait-free and doesn't care about when nodes become visible, but again its meant to demonstrate raw atomic operations here. Code here. I'd like to emphasize again that I only recommend usage of global atomics for things that are wait-free or bounded in terms of # of retrying. Of course, if need be one can employ the "fast-path slow-path methodology - slidepaper"

Edit:

The performance boost in GlobalAtomicObject is likely due to switching from using Chapel arrays to tuples.. which is actually fascinating!

gbtitus commented 7 years ago

I think you'd have to emulate read/write/exchange by means of CMPXCHG16B itself.

That definitely is something I would like to test. ... That means that you have a lot of intra-node traffic being done (and inter-node traffic with respect to NUMA nodes) just to access data in that manner. If you have a large amount of traffic (say each task from each node in a cluster accessing one atomic field, which isn't too uncommon for distributed data structures), then having all of those on statements would result in a massive performance loss.

Just a quick note in case there's been confusion: it wouldn't necessarily be required that we do full Chapel-level on-stmts to implement this stuff using CMPXCHG16B. It could be done with direct runtime support for a low-level equivalent ExecuteOn which would have much lower overheads (and wouldn't need a full-blown Chapel task on the remote side). So, I think it's fair to interpret any use of the term "on-stmt" in here as referring to some kind of remote execution but not necessarily a full Chapel on. That said, the Active Message handling in support of the remote execution would still be required, and that's the largest cost difference between a network AMO, and a processor AMO initiated remotely.

Even normally wait-free atomic operations like write and exchange just become lock-free. Add to that the fact that you have so many remote tasks absolutely overwhelming the node on which the memory lies on, and I think perhaps descriptor tables at the idealized 'everything-cached' solution may be a viable alternative, in that it avoids the 'CAS retry problem' bottleneck.

"Absolutely overwhelming" seems hyperbolic. Here's why. I believe that Read, Write, Exchange, and in most usage even Compare/Exchange itself (e.g. when used for pointer swinging) can be implemented using remote execution of a processor AMO CMPXCHG such that the CAS retries are done locally on the remote node, not over the network. Now, either the field we're updating is contended or it is not. If it is contended then even if we arrange (by a descriptor table, say) to only need 64-bit CMPXCHG we'd still better not do our CAS retries over the network because network refs are 1000x slower, recall. So if the update is contended we want to use remote execution and processor atomics even if we do have 16-byte network AMOs. If the update is not contended, on the other hand, then because remote execution is definitely slower than network AMOs (probably no better than 1/4 as fast), a solution based on 8-byte network AMOs, such as compressed pointers or descriptor tables, will definitely outperform one based on remotely initiated 16-byte processor AMOs. Nevertheless, the node won't be "overwhelmed", not by an un-contended update. Un-contended updates can't overwhelm anything.

(Just to cover all the bases: if remote execution of processor atomics is 4x slower than network atomics then granted there is a range of rates for inbound un-contended updates such that network atomics can do them at full speed while remote execution and processor atomics will be throttled. But such update rates can't be reached without really serious workload imbalance, for example all the nodes pounding on one of them, and I'd argue that this is a special case we need not design for.)

However, if the x86 LOCK prefix provides mutual exclusion at a hardware level with no need to retry, then it boils down to acquiring a (very low-level) spinlock for all operations, right?

You need the LOCK prefix for any x86 CMPXCHG atomic op instruction if you want it to be safe in a multicore setting. As far as I can tell the CMPXCHG16B instruction works exactly like every other x86 atomic op in this regard. There's nothing special about it.

LouisJenkinsCS commented 7 years ago

If the update is not contended, on the other hand, then because remote execution is definitely slower than network AMOs (probably no better than 1/4 as fast), a solution based on 8-byte network AMOs, such as compressed pointers or descriptor tables, will definitely outperform one based on remotely initiated 16-byte processor AMOs.

You've convinced me that in most cases your approach to using hardware atomic operations would be the best general solution, but it has me wondering if the user should have an option to pragma that a certain field or local variable is not a point of major contention and perhaps use an approach that is 4x faster. Just an idea.

CMPXCHG16B instruction works exactly like every other x86 atomic op

Gotcha, I definitely should have done some more research first. That further dissolves any doubt I have.

gbtitus commented 7 years ago

All in all I think at this point I might favor something based on compressed pointers, myself. Even when we have CMPXCHG16B the remote execution needed to use it is no better than 4x slower than network AMOs and in real-world use may be much worse than that. The compressed pointer solution wins out over the descriptor table solution for me due to avoiding (as I understand it) the need to do anything at all when a class instance is created or destroyed, and the complication of caching table entries. In fact, really all it does is change our wide pointer representation from this (as expressed in C):

struct {
  uint64_t node;
  uint64_t ptr;    // cast to type* to use
};

to this:

struct {
  unsigned int node : 16;
  unsigned int ptr  : 48;    // cast to type* to use
}

Granted that we're limited to 64k nodes with this technique, which would make it architecturally our tightest constraint. That is, it would be the place where we were closest to not being able to reflect the whole architecture. But we could apply it only to atomic class instances as opposed to all class instances if we wanted to and as I think I mentioned before, we're still probably some way off from being able to effectively scale a Chapel program to this size anyway.

LouisJenkinsCS commented 7 years ago

@gbtitus

Could we at least strike a compromise: compressed pointers built into the language, but GlobalAtomicObject available as a potential module for those who wish to use more than that? Again, it does work, it may as well be offered, right? That way if there is enough demand, perhaps later the idea of integrating it into the runtime can arise.

gbtitus commented 7 years ago

... wondering if the user should have an option to pragma that a certain field or local variable is not a point of major contention and perhaps use an approach that is 4x faster.

Agreed -- I think the language could benefit from a notation that a particular atomic var (field or otherwise) is expected to be contended or not. This is useful information both for programmers maintaining the code later, and for the compiler/module/runtime which together arrange to implement the atomic var. I'm not particularly a fan of pragmas, myself. Maybe we could say contended atomic instead of plain atomic. :-)

mppf commented 7 years ago

I'm just catching up on this issue and I have a few things to say.

  1. I think the discussion conflates two issues: a. using a network AMO in a CAS-loop (at all, and in particular for remote AMOs) b. the descriptor table approach vs using compression/CMPXCHG16B
  1. I'm not as confident as Greg is at guessing which designs here will be better. In that veign, I'd like to see some performance comparison for a some plausible benchmarks:
    • list example as you have it
    • another plausible example that uses a CAS loop
      • one variant using an on statement around the CAS
      • one variant not using an on statement around the CAS

And then, is it possible to write the benchmarks in a way that makes a configurable amount of contention?

Then I'd like to see comparison of these approaches:

Then, we can gather evidence to suggest if Greg is right about which algorithms to prefer in high contention vs low contention.

Lastly, it might be that pointer compression + CMPXCHG16B is enough; we might be able to reasonably assume that a system will support one or the other.

mppf commented 7 years ago

Also, I think an important future step (not necessarily for this issue/PR) is to investigate how one can solve the ABA problem for lock/wait-free multilocale programming in Chapel.

gbtitus commented 7 years ago

I still have a few responses to write on this issue, but none of them are big and I expect we're winding down toward conclusion.

In the meantime, a much more fundamental question occurred to me last night long after I'd left work: do we really expect that people will write the kind of pointer-swinging code in Chapel that they do in, say, C? I'm not asking whether they'll want to write lock-free or wait-free algorithms; certainly they will. What I'm asking is: when they write those algorithms in Chapel will they do so in a C-like way by pointer-swinging class instance refs, or should we expect (and even encourage) them to do it in a more abstract way using other data structure representations?

gbtitus commented 7 years ago

Could we at least strike a compromise: compressed pointers built into the language, but GlobalAtomicObject available as a potential module for those who wish to use more than that? Again, it does work, it may as well be offered, right? That way if there is enough demand, perhaps later the idea of integrating it into the runtime can arise.

I'm not averse to this but I think it's more a question for some other reviewer (looking your way, @mppf). My expertise is in mostly lower in the software stack.

LouisJenkinsCS commented 7 years ago

@mppf

another plausible example that uses a CAS loop

I can do the next best thing. I'll try to globalize a current (well-known) data structure to test how they handle massive amounts of contention: Michael Scott's Lock-Free Queue. The queue's performance drops as you add more threads due to excessive contention placed on the head and tail, so at this point it comes down to 'How much'. I can do it based on runtime to test it. However, it should be noted again that we still have some issues with a lack-of-caching, but I can probably remedy this by allocating the queue nodes and caching their descriptors ahead of time. This could be very interesting... (I'll leave the sync variable out though).

LouisJenkinsCS commented 7 years ago

Forgot to ask: how would I test the cmpxchg16b thing from Chapel? Or would it have to be called from C?

gbtitus commented 7 years ago

@LouisJenkinsCS: You'd have to do it from C and even then, probably only gcc supports it and that not very well. I went looking at that stuff and it's sort of half-supported. There's an __int128 built-in type and the atomic compare-and-swap intrinsics can work on it, but you have to throw a compiler option (-mcx16) to enable that. So I'm not sure about the practicality of depending on it for production code right now.

LouisJenkinsCS commented 7 years ago

Hm... @mppf Any suggestions on how to proceed on testing this? Is there a way to have chapel build with -mcx16 flag added?

mppf commented 7 years ago

I thought cmpxchg16b would just amount to some assembly code snippet (in actual assembly or assembly-in-C). I wouldn't rely on GCC intrinsics for it at this stage. Actually it's used in qthreads and gasnet - see https://github.com/chapel-lang/chapel/blob/master/third-party/qthread/qthread-src/src/threadqueues/nottingham_threadqueues.c#L163 and https://github.com/chapel-lang/chapel/blob/master/third-party/gasnet/gasnet-src/gasnet_atomic_bits.h#L904 - perhaps if we ask nicely it one of these could be pulled out into its own header file.

Anyway I do think the experimentation I described is worthwhile as we decide what needs to be directly supported in the language.

LouisJenkinsCS commented 7 years ago

Gotcha, I'll try to have it and the results by Friday

gbtitus commented 7 years ago

Here's one more little bit of possibly useful information. I've been estimating that remotely initiated processor atomics probably can't perform much better than 1/4 the speed of network atomics. It turns out this is really easy to test. On a Cray XC with comm=ugni, we can toggle the ra-atomics test between doing network AMOs and doing remotely initiated processor AMOs just by whether or not we have a hugepage module loaded. In order to use network AMOs we have to register memory with the NIC and in order to do that we have to put the heap on hugepages. No hugepages, no network AMOs. And when we do use remotely initiated processor AMOs it's through a fairly lightweight runtime-internal mechanism, not a Chapel on-stmt. So this is a pretty good analogy to the situations we've been discussing. (In this case the AMO is a XOR rather than a CMPXCHG, but I doubt that matters.) It turns out that network AMOs are 30x faster than remotely initiated processor AMOs for this test, not just 4x or 5x faster. Now granted my 4x was a floor and I haven't dug into why the difference is so much greater than I expected (I have more ideas than time) but I thought I should mention this result.

LouisJenkinsCS commented 7 years ago

Interesting... while it still doesn't leave me with much room for doubt that remote AMO is better for contention-heavy CAS-retry loop, it does give me hope for descriptor tables. I wonder though, does that 30x increase or decrease as you add more nodes?