Open LouisJenkinsCS opened 7 years ago
Re "Communication is not bad", I think your example was interesting but I'm not sure I'd describe it that way. I mean, you're saying that an unbounded amount of communication under contention kills performance. And it's probably worse to have unbounded number of compare-exchange when communication is involved than it is locally, because the latency on the operation is higher (so it might be more likely another operation changes the value).
My main question here is - what are these bit-maps of all memory you mentioned? Nothing about the descriptor table idea implies to me that we need bitmaps... but perhaps I have not thought through some issue.
My main question here is - what are these bit-maps of all memory you mentioned? Nothing about the descriptor table idea implies to me that we need bitmaps... but perhaps I have not thought through some issue.
The bitmaps are mainly to keep track of which descriptors are in use and which are not. The descriptors themselves do not refer to the actual address space, but some index into an array on some locale. The bitmap is to provide a lower (but essential) overhead to keeping track of which indices can be used/reused.
Bitmaps may not be needed if instead of Chapel arrays we use just raw pointers to memory like c_malloc
(but this is a prototype after all). Currently the user must allocate the data first using new
and then 'register' it (I.E: Adding to array) which will return the descriptor (I.E: Index the data was added to). Ideally, allocating large chunks of memory and implementing a basic memory manager would provide much better performance (but takes more time to implement).
I would expect a pattern like this:
class C { ... }
proc test() {
var a = new C(1);
var b = new C(2);
var x: atomic C;
var result:bool;
x.write(a);
result = x.compareExchange(a, b);
assert(result);
assert(x.read() == b);
delete a;
delete b;
}
I would imagine that would mean you would need to perform this register
operation on the first atomic operation for that class.
Truth be told, under more ideal cases it would be better if a
and b
could be created by some other object and used as such...
class C { ... }
var gdt = new GDT(C);
proc test() {
// Maybe have parameters passed be used to reflectively call the constructor of C?
var (idxA, a) = gdt.create(1);
var (idxB, b) = gdt.create(2);
var x: atomic uint;
var result:bool;
x.write(idxA);
result = x.compareExchange(idxA, idxB);
assert(result);
assert(gdt.read(idxB) == b);
}
Currently its ugly, but either the runtime or compiler will need to keep track of the actual index of the object or the user has to. I'm not sure if you can hash the actual pointer address to an index, but maybe something similar can be done to remove the need to use the index and support just using the raw object. I can experiment with this as well.
Right, it's clear that would be more convenient to implement, but I'm not so sure I'd consider it a full solution to the problem. After all, we are trying to make a productive parallel language, so we want to minimize concerns for our users.
Anyway, as far is implementing goes, yeah, keep going & experiment with what it would take to get my example pattern to work. Thanks!
@mppf @e-kayrakli
Here is a document detailing how I plan to implement global atomics in a fashion that can provide reasonable semantics (I.E: No need to carry descriptors, wide-pointer compression strategy, global atomics, etc.)
I wish to start on it this weekend (as after I need to satisfy Engin's requirements for the Queue) so by then I should know if it does work or not. As well, the FCHLock
(Flat Combining Hierarchical Lock) determines if I can make a global priority queue or not with reasonable performance, which also can be affected by this. If any advice or feedback could be given on it by Friday night I'd appreciate it.
Also as well, @mppf I believe it was mentioned before that Linux only allows a 47-bit virtual address space, but I can't 100% verify that. I mean, if that is the case for a work-around I can always just shave off the top 17 bits for locale id (which only allows 131K locales, but it could be an even more temporary but immediate solution, taking literally next to no effort to do). I just can't make assumptions without knowing how Chapel handles things currently. gbt1 mentioned that network tops at 128k, but I can see it increasing easily within a few years (of which my solution won't work for). However, if both are 'fine for now' solutions, then its sufficient to just have top 17 bits be locale id, lower 47 bits be virtual address.
Edit:
Turns out, as seen here that my solution is rather overly complicated. "Most operating systems and applications will not need such a large address space for the foreseeable future, so implementing such wide virtual addresses would simply increase the complexity and cost of address translation with no real benefit." Absolutely crazy... so I can just actually compress it as I see fit. Or rather, my solution isn't overly complicated, but one that is resistant to change.
I think it's important that we have a strategy for my above Chapel code example to work, even if it's not the first thing we implement.
I don't understand why you need a "Zone" strategy at all. Let's back up a bit and talk through the problems that really need to be solved here.
I think what I was imagining was: a) wide pointer compression into 64-bits b) a fall-back strategy that works differently (e.g. by referring to an allocated descriptor index in a table of descriptors).
See https://github.com/chapel-lang/chapel/blob/master/runtime/include/chpl-wide-ptr-fns.h for some existing code that does wide pointer compression, but that wouldn't work on very large systems. (I.e. I'm asserting (a) is already implemented and you could steal/reuse that implementation. ).
Anyway, for (b), could we propagate information about a mapping from index to node in a dynamic way, only when needed? For example, here is an "obvious" strategy for (b) - that is, a straw-man:
:atomic obj
can either store a compressed pointer or a descriptor (so you'd use the top or bottom bit to distinguish between these two. Note that the bottom bit can be "stolen" from an address because all class instances are allocated & so will be aligned - usually 16 byte aligned in practice).For terminology, let's call the thing we actually store in an atomic class variable an "atomic instance descriptor".
How would that work with my example?
class C { ... }
proc test() {
var a = new C(1); // works as it does now
var b = new C(2); // as it does now
var x: atomic C; // atomic C is 64 bits in a format we are inventing: "atomic instance descriptor".
var result:bool;
x.write(a);
result = x.compareExchange(a, b);
// compareExchange will:
// 1) Convert a and b to atomic instance descriptors (possibly by compression,
// possibly by searching in or updating some data structures)
// 2) Perform the 64-bit atomic compareExchange operation on the descriptors
assert(result);
assert(x.read() == b);
// read() will:
// 1) atomically read a 64-bit atomic instance descriptor
// 2) unpack the atomic instance descriptor into a 128-bit wide pointer
// (possibly by reading it directly, possibly with some data structure).
delete a;
// delete a could remove the atomic instance descriptor for a from some data structure
delete b;
// delete b could remove the atomic instance descriptor for a from some data structure
}
So, for the example to be efficient, we need to have reasonable ways to do: i) Go from a wide address of a class instance to an atomic instance descriptor ii) Go from a atomic instance descriptor to an address
We could conceivably store the atomic instance descriptor in the class instance, once it is computed. We're making the compiler/language after all, so we could add a 64-bit element to all class instances. That might solve (i). Or maybe there is a better solution.
For (ii), we could literally use a global array as I was describing above. It could even be block distributed... Or, maybe the best solution here uses a limited number of atomic descriptors per locale. So that the format would be 32-bits of locale number/node id, and then a 32-bit offset into that locale's individual table. That's not really that different from using a Block distributed array, but it might add some flexibility. In particular, you could imagine using a skip list (or something) instead of an array, to make inserting/freeing more efficient.
Or, what if these arrays were per-pthread, so that there isn't even any locking overhead in managing the data structures? Whatever pthread is doing (i) would update its table.
Is it a significant limitation to only allow 2^32 atomic instance descriptors at a given time? I'm not sure.
Related ideas:
Even though these are related, I don't think it's a good idea to tie this effort to those other components, unless we really really need to.
I don't understand why you need a "Zone" strategy at all.
I believe I should reiterate on the point of Zones. Zones makes an assumption on
3 things to compress the most significant 32-bits of the addr
portion of
a wide pointer...
1) The Virtual Memory Manager will monotonically allocate virtual addresses and are not random, decreasing the amount of unique most significant 32-bits. 2) Chapel's memory manager reuses memory, further reducing the unique most significant 32-bits. 3) The number of unique most significant 32-bits do not exceed the amount of bits used to keep track of it.
Assuming all locales cache each other's Zone base offsets and their respective mappings, then we can easily do this without necessary intervention from the runtime or compiler. Note again, this was from the perspective of using GDT
as a module, but in your example it can still work, and if anything it can do so in a much better way. Assuming instead of just having the atomic C
be an 'atomic instance descriptor', have it just be a GDT
.
To skip to the good part... We'll assume that a.addr
and a.localeId
refer to the address
and localeId respectively...
x.write(a);
// write will:
// 1) Descriptor becomes (a.localeId << 40) | zoneId(a) << 32 | a.addr & 0xFFFFFFFF
// 2) This is written to some atomic uint field.
// Note: zoneId performs the 'magic' in terms of compression... that's certainly possible but a WIP
result = x.compareExchange(a, b);
// compareExchange will:
// 1) Convert a and b to atomic instance descriptors (possibly by compression,
// possibly by searching in or updating some data structures)
// 2) Perform the 64-bit atomic compareExchange operation on the descriptors
assert(result);
assert(x.read() == b);
// read() will:
// 1) atomically read a 64-bit atomic instance descriptor
// 2) Unpacks by doing the following:
// var descr = ...;
// var localeId = descr >> 40;
// var addr = (zone((descr >> 32) & 0xFF)) | descr & 0xFFFFFFFF;
// return ...; /* Handle conversion back to wide pointer with localeId and addr */
// Note: zone performs the 'magic' in terms of decompression... while it is a WIP, what it does
// is return a 64-bit base that the offset (lower 32 bits) is added to.
The above again makes relatively dangerous assumptions: Only 256 unique most significant 32 bits
are used. Truth be told, it could be potentially improved by changing how many bits are
kept the addr
part and how many can be used for zoneId
. Just needed to make this case apparent
before its dismissed, as I still think it could work (as you said, the language can be changed
to provide certain guarantees)
a) wide pointer compression into 64-bits
Its good to hear that its been done before, definitely makes my job easier and also makes more sense that someone else on the development team thought of it first.
b) a fall-back strategy that works differently (e.g. by referring to an allocated descriptor index in a table of descriptors).
creating a descriptor simply adds a class instance pointer to a free slot in the array & returns its index
This is definitely viable if we do what you said...
We could conceivable store the atomic instance descriptor in the class instance, once it is computed.
This would be acceptable. If we had each type keep track of it's descriptor, and each type have a specific GDT
it would also work out well. This allows the current approach of keeping a bitmap, as well as allow me to incrementally grow the GDT
by allocating and chaining another array (twice the original size) and to allow reads of currently existing descriptors to be free of synchronization. This may be the easiest way as well as an acceptable proof-of-concept/prototype.
using a skip list (or something) instead of an array, to make inserting/freeing more efficient.
Makes any remote operation (such as a read
on another node) to become too expensive, as they too need to traverse the nodes in the list to find it.
Or, what if these arrays were per-pthread, so that there isn't even any locking overhead in managing the data structures?
That's an interesting one... would this also allow certain NUMA-friendly/aware optimizations? Perhaps you could just have it be one for each NUMA node if its that the case. However, having one per pthread, especially if its hyperthreaded and has tons of processors per node and tons of nodes, then I can forsee it being an issue. Having too many GDT
structures is a huge space-overhead, but per NUMA-node should be okay since each has their own memory anyway, right?
Assuming instead of just having the atomic C be an 'atomic instance descriptor', have it just be a GDT.
I can't really make sense of that one...
using a skip list (or something) instead of an array, to make inserting/freeing more efficient.
Makes any remote operation (such as a read on another node) to become too expensive, as they too need to traverse the nodes in the list to find it.
Which operations need to access this data structure, and how do they access it? myAtomicC.read() goes from index to class instance wide pointer. That basically just amounts to accessing the i'th element of the array/whatever. Different data structures we could choose will represent different tradeoffs between the speed of the various operations.
I don't object to the zone idea exactly, I just think it needs to fit into the framework I was describing & should be evaluated among some of these other ideas.
Assuming instead of just having the atomic C be an 'atomic instance descriptor', have it just be a GDT.
I can't really make sense of that one...
That's my fault, I mean that atomic C
becomes some other object that keeps track of both the GDT
(Global Descriptor Table/Global Array of class instances) and the current descriptor set.
That basically just amounts to accessing the i'th element of the array/whatever.
Now that's another thing that's my fault. Truthfully, I've never implemented a skip list before so when I thought 'list' I thought 'linked list', not that it could be array based. I rescind what I said, a skip list seems to be the ticket then.
I don't object to the zone idea exactly, I just think it needs to fit into the framework I was describing & should be evaluated among some of these other ideas.
Gotcha, the thing is when it comes to the runtime or compiler I have no idea what is and isn't possible. As well, I've never developed applications for HPC clusters (and this is my first time delving into distributed computing) so I don't have any input to give on whether or not something is enough or what is expected of a sample user program. However to give what little input I can give now...
Or, we start with 2^32 max and expect to extend that in the future if the limit becomes a problem.
This is one thing I can agree with. I honestly can't say whether or not its good enough even in the present, but that's all I can assume.
A big machine could allocate that many class instances. An allocated class instance is going to be 32 bytes at a minimum, in practice, as far as I understand. (16 byte aligned but storing >= 1 byte). So that's 128 GiB of storage for the classes themselves, which is a lot, but might even be possible on 1 node.
I honestly don't know, but my opinion on it would be that if a node exceeds what is set as the maximum threshold, it should just be a runtime error, and that the user should try another work-around. Of course, a fallback for this could be to just use the hwlock
solution if the user passes a certain flag, --atomics=hwlock
or something? Again, I'm not even sure what the standard way of handling this in Chapel is.
One might investigate how many atomic instance descriptors would be used in a sample application using atomic class instances.
This could be doable for my project by just using it in the queue. If a data structure can't exceed that limit, then I doubt anything else can.
As well, to reiterate, the best thing I can potentially do is to implement it as an actual module, and then analyze how it can be integrated into the language itself. Currently a way to hash pointers to actual indexes are needed, and the only way to prove a solution works currently is to build an early prototype for it (at least IMO). The solution I agree with is having fields in class instances for it's descriptor unique to an atomic
variable (definitely most straight forward). Its impossible from a module, but the 'zone' one might work. At least after the module is established/developed, it should make it 10x easier to add to the language itself and/or make other modifications to it.
Right, but when you're doing such a prototype implementation, it's important to know what a plausible API for the compiler to use will be.
I can think of an adaptive 4-phase strategy...
Normal Compression of Wide Pointers (17 bits for locale, 47 Bits for offset, based on assumption that only 47 bits are ever used), which allows 131K locales and the entire usable address space. This should be sufficient and does not cause any performance loss whatsoever as barely any translation is needed. Note that I don't leave a bit for marking/tagging, but one can be added if you need to pack it into a 64-bit word, then it would change the condition to NumLocales ≤ 2^16.
We use the Zone compression strategy to keep track of things like this. The more nodes you have, the less bits you have for Zone (and if you exceed the zone, then we can adapt into the third phase). In this case, if you have exactly 2^18 NumLocales, then you can reserve the rest for zones, meaning its more likely that you can continue using this strategy. As mentioned before, Zones just map the most significant 32 bits of the address to a zone id as they are encountered, which only has slight minor performance losses for caching another locale's zone mappings.
We use the global descriptor table strategy. Any allocations that exceed the zone, will be instead added here. There is a significant penalty in terms of managing everything, but what else can you do?
Just fall back to using a sync
variable or some other lock. No amount of compression will help in that case.
Would you agree to something like this?
Edit:
Updated a 4th phase
Okay, now I found out that I can't really create Wide Pointers from Chapel... I can't call any of the runtime functions that expect or return wide_ptr_t
because there is no equivalent type in Chapel (and compiler throws error to avoid type-punning, so using a record of equivalent size won't work either). As well, any change I make to the compiler will make my proposal at the end null-and-void so I can't do that route at all... I guess, really, even compression won't work, I'll just have to skip straight to descriptor tables... So I guess really I'll try to implement concurrent skip lists and handle allocation in that manner and just see if I can hash objects... @e-kayrakli I can do that right? Hash objects to a unique uint
or something comparable?
Hash objects to a unique uint or something comparable?
I am not sure if I am understanding it right.. but yes you can cast references to uint or something similar using compiler primitives. And once you have it cast, then hash it as if it is just an integer and not a reference. If it is wide reference (which I think is the actual question you are asking) I think you can have an hash function that takes both locale id and address as an input. (There are wide_get_*
primitives that can help you get different fields of a wide pointer easily, and probably much faster than e.g. x.locale.id)
Okay, now I found out that I can't really create Wide Pointers from Chapel... I can't call any of the runtime functions that expect or return wide_ptr_t because there is no equivalent type in Chapel
You should be able to call the functions in runtime/include/chpl-wide-ptr-fns.h, but you'll have to use the foreign function interface to do so. Including telling Chapel that wide_ptr_t is an extern type.
Additionally, there are compiler primitives to get the locale or address portion of a wide pointer. _wide_get_locale, _wide_get_node, _wide_get_addr. See for example
test/multilocale/ferguson/get-addr-node.chpl
I would expect that these compiler primitives are all that you need for an initial attempt.
I don't understand why compiler adjustments would make your proposal useless.
I done it! I have done as you asked Michael, and created the descriptor table approach and have a MWE that is similar to the ones you desired. I even embedded everything within a single GlobalAtomicObject
which manages everything, and as a record it provides automatic memory management and cleans itself up nicely.
class C { var c : int; }
var a = new C(1);
var b = new C(2);
var x = new GlobalAtomicObject(C); // atomic C
var result : bool;
x.write(a);
result = x.compareExchange(a, b);
assert(result);
assert(x.read() == b);
// Note that you can call 'delete' on the object while having it still be present
// in the descriptor table. This may be okay for most cases where memory is not
// an issue since reused addresses will hash to the same node and that node will
// already point to the valid object the user is attempting to insert. However if
// memory is an issue, one can call '_delete' to ensure that the node itself is removed
// and can be recycled for later use.
delete a;
delete b;
// Is Safe because we only use the pointer itself.
x._delete(a);
x._delete(b);
// As well, when GlobalAtomicObject goes out of scope, all nodes it had in use also
// get deleted...
I have also tested it in a multi-locale environment, and can prove that it works (although the test doesn't have much coverage yet aside for ensuring that the objects returned are consistent),
So in essence, I have proof of correctness for this proof of concept. Next up is performance, of which I need to measure each operation (compared to sync
equivalent as a baseline and atomic
on 64-bit as a ceiling). I'll make a follow-up post once I have such results.
I present the GlobalAtomicObject
, the first of it's kind to allow atomic operations
on class instances (128-bit wide pointers)! GlobalAtomicObject
currently suffers
from poor optimizations due to lack of time, but it sure is promising! I compare it's
performance against 'Sync Var', the all-mighty sync
var, which currently gets a handicap by
just acquiring the lock, perform some arbitrary cheap operation, then immediately release it.
This is the 'baseline' which we must beat or else the implementation is useless.
The next is Atomic
on a 64-bit value, which is the 'ceiling', and can be seen as how
much performance we can get for normal pointer compression versus keeping tables like this.
It has two benchmarks: 'single' which is a single atomic operation, and 'LLSC' which is
a pseudo Load-Linked Store-Conditional operation. LLSC looks like such...
var x = atomicVar.read();
atomicVar.compareExchangeStrong(x, f(x));
There is no retrying, just a one-and-done atomic operation (which I count as one).
In 'Global Atomic', we use similar atomic operations as 'Atomic' does, although
a few things need to be considered. While for 'Single', it performs a write
(because a read
is not interesting since it is a basic lookup in a locale's table),
for data in local memory, which shows the ideal. Now 'LLSC' can perform operations for
data that is remote, meaning that each time, its possible that during compression
of the data, we may need to jump to other locales. It shows a more average case, but
not the worse case when data to compareExchangeStrong
is both remote. Currently,
'LLSC' for GlobalAtomicObject
demonstrates 3 atomic operations: the read
which
is a simple lookup, and compression of both variables passed to compareExchangeStrong
which one is local, the other remote.
It should be noted again that with pointer compression and a smaller number of nodes,
performance is excellent. However, when one has more than the maximum allowed, it is
still manageable and better than a sync
variable is.
I present two graphs: one with respect to runtime, and another in terms of raw operations per second.
This shows the runtime of the program. As can be seen, raw atomic operations grossly beats anything else, although in terms of atomic LL/SC, it does lag enough to demonstrate the power of global atomics. A single global atomic, while is magnitudes slower than a single 64-bit atomic, is only 2x slower than the LLSC.
Imagine if I had attempted to implement some MCAS operation to update both locations using descriptors. At best we would succeed within 4 LL/SC operations without obstruction (2 to set descriptors, 2 to update), of which a single Global Atomic variable would beat. Of course, in reality, the operation would require multiple retries, in the hundreds to thousands with real workloads that placed contention on it. This shows that, in terms of global atomics, this is the best option currently, as a global atomic is guranteed to win.
As great as GlobalAtomicObject
is in being novel and the first solution, it isn't the best
solution. Any compression to fit within 64 bytes without overhead of extra communication would
get the phenomenal performance of a single atomic operation.
This shows the raw operations per second. Notice that I removed 64-bit atomics entirely as they
shadowed over everything else to the point that the performance boost wouldn't be noticeable.
As can be seen however, both a single global atomic operation, and even the LL/SC combo
operation with a potential remote reference (meaning an added on
statement) beats a normal
sync
variable. In every case, GlobalAtomicObject
wins outright, but performance penalties from
remote references are very apparent and should be avoided at all costs.
Some improvements to the GlobalAtomicObject
that can be made... I can't do much more than the first one (and even then not without some help), the rest would be too time consuming, and the ground work has already been done, so I'm sure someone else can pick up on this (or maybe after the project is over perhaps I can come back and do it myself).
This is something I still want to do, but it'd be great if I could have a good example to do it from Chapel. That's all I'd need to be able to start and complete it. Currently it only uses descriptor tables which uses a skip list for allocation/deallocation and a segmented memory pool that recycles nodes (and indexed directly during a read
operation).
We use a sync
variable to synchronize access locally, which is a huge bottleneck. If a wait-free or lock-free skip list were to be used, I'm certain that performance would improve by about 5 - 10x.
From using malloc
over new for recycling, to caching the descriptor into the object itself, etc. Even implementing it in C alone would provide better performance I'm sure, especially as apart of the language/runtime itself.
Cool!
The Chapel type system doesn't normally allow ptr-int casting but we can work around that with some special functions. The below example should have everything you need to get started on the pointer compression part.
proc c_void_ptr_to_uint(ptr:c_void_ptr)
{
return __primitive("cast", uint, ptr);
}
proc uint_to_c_void_ptr(u:uint)
{
return __primitive("cast", c_void_ptr, u);
}
proc test(obj:object)
{
var loc = __primitive("_wide_get_locale", obj);
var node = chpl_nodeFromLocaleID(loc);
var subloc = chpl_sublocFromLocaleID(loc);
var raddr = __primitive("_wide_get_addr", obj);
// TODO: reasonable checking for pointers/nodes
// too big to fit into this format.
// TODO: configurable format
// Let's "pack" the wide pointer
var uraddr = c_void_ptr_to_uint(raddr);
var unode = node:uint;
var packed:uint = 0;
packed = uraddr & 0x0000ffffffffffff;
packed |= unode << 48;
writef("node %xi raddr %016xi\n", unode, uraddr);
writef("packed to %016xi\n", packed);
// Let's "unpack" the wide pointer.
var got_uraddr = packed & 0x0000ffffffffffff;
var got_unode = packed >> 48;
var got_raddr = uint_to_c_void_ptr(got_uraddr);
writef("unpacked to node %xi raddr %016xi\n", got_unode, got_uraddr);
// check pointer value matches.
assert(raddr == got_raddr);
// check node matches
assert(obj.locale.id == got_unode);
}
class C {
var x:int;
}
for loc in Locales {
on loc {
var c = new C(10);
test(c);
}
}
For the other improvements, I'd expect the wait-free/lock-free version would be a big win but that implementing it in C vs Chapel won't matter much.
You might also need a way to go from a node and a local pointer to a wide pointer. I can probably help you with that part too but I'm hoping you can make progress without it for the moment.
I greatly appreciate the examples, definitely handy for handling compression and decompression, however I really do need to be able to obtain the object, which means I do need to create the wide pointer. If there was a way to actually copy a dummy wide-pointer by value, then modify the copied by-value wide pointer, that could work so long as the compiler allows it. Issue is, the fact that it throws "TODO - don't like type-punning record/union" is a huge roadblock, because it would work if it allowed it. I'm assuming that it throws a compiler warning because GCC will throw a compiler warning and any compiler warning gets treated as a compiler error?
A simple function could be added to runtime/include/chpl-wide-ptr-fns.h to create a wide pointer given the type
, localeId
and addr
right?
Nevermind, realized that this was it. I managed to side-step the compiler a bit by using c_ptrTo
to get the address to a wide pointer created on the stack and copied directly into it. Dangerous, but best I can do. Full MWE below
extern type wide_ptr_t;
extern type c_nodeid_t;
extern proc chpl_return_wide_ptr_node(c_nodeid_t, c_void_ptr) : wide_ptr_t;
proc c_void_ptr_to_uint(ptr:c_void_ptr)
{
return __primitive("cast", uint, ptr);
}
proc uint_to_c_void_ptr(u:uint)
{
return __primitive("cast", c_void_ptr, u);
}
class C {
var x:int;
}
proc test(obj:object)
{
var loc = __primitive("_wide_get_locale", obj);
var node = chpl_nodeFromLocaleID(loc);
var subloc = chpl_sublocFromLocaleID(loc);
var raddr = __primitive("_wide_get_addr", obj);
// TODO: reasonable checking for pointers/nodes
// too big to fit into this format.
// TODO: configurable format
// Let's "pack" the wide pointer
var uraddr = c_void_ptr_to_uint(raddr);
var unode = node:uint;
var packed:uint = 0;
packed = uraddr & 0x0000ffffffffffff;
packed |= unode << 48;
writef("node %xi raddr %016xi\n", unode, uraddr);
writef("packed to %016xi\n", packed);
// Let's "unpack" the wide pointer.
var got_uraddr = packed & 0x0000ffffffffffff;
var got_unode = packed >> 48;
var got_raddr = uint_to_c_void_ptr(got_uraddr);
writef("unpacked to node %xi raddr %016xi\n", got_unode, got_uraddr);
// check pointer value matches.
assert(raddr == got_raddr);
// check node matches
assert(obj.locale.id == got_unode);
// Start here for new stuff... we create the wide pointer from decompressed information
var wideptr = chpl_return_wide_ptr_node(got_unode, got_raddr);
// Note: We fool the compiler into creating a wide pointer by using an
// 'on here' type of statement. I indexed into 'Locales' because I don't want the
// compiler to 'optimize' it away...
var newObj : C;
on Locales[here.id] do newObj = new C();
// Warning: If this is *not* a wide pointer, it will overwrite part of the stack.
// If it is, it works. Dangerous, but only way to do so at the moment.
c_memcpy(c_ptrTo(newObj), c_ptrTo(wideptr), 16);
assert(obj == newObj);
}
for loc in Locales {
on loc {
var c = new C(10);
test(c);
}
}
@LouisJenkinsCS - looks like an OK workaround for now. At some point, I think that somebody (possibly me) should add the reverse to __primitive("_wide_get_addr", obj) -- namely, a primitive that creates a wide pointer of a particular type from a local void* and a locale.
a primitive that creates a wide pointer of a particular type from a local void* and a locale.
FWIW, I have implemented something like that for my work: https://github.com/e-kayrakli/chapel/blob/prefetch-master/compiler/codegen/expr.cpp#L5317
It can be used with a C pointer or wide reference (I implemented support for narrow reference as well, but it was useless for me, thus I commented it out). You can read lockOff
in the code as the locale ID for a general case. I don't mind polishing it, if need be.
@mppf
It's been done. Not only does it perform the compression strategy for locales that fit within 16 bits, but also maintains manageable/respectable for locales that are within 17 and 32 bits. It is as fast as atomics (in this case it was faster, but likely due to environmental differences such as scheduling, network traffic, etc). I jumped straight to 32 locales because there's no point rerunning everything as its clear that sync
is inferior (and it takes a long time to finish).
32 Locales
Sync | Atomic Single | Atomic LLSC | Global Atomic Single | Global Atomic LLSC |
---|---|---|---|---|
316.75 | 0.215319 | 6.78373 | 0.203364 | 6.90024 |
@LouisJenkinsCS - cool, let's work towards putting this in to a PR that can go onto master.
Sure, I can handle something like that. Should I do the following...
1) Clone Chapel (already done)
2) Create a branch called 'global-atomics'
3) Move GlobalAtomicObject.chpl
into its modules/standard
a) Requires me to condense everything into a single file, but manageable
4) Include the test into 'test/global_atomics' with '.good' file
5) Make PR into master
If it doesn't have to be today, I'll be be able to refactor it more into something that is presentable (currently its more of something I coded in 2 days).
@mppf -- Should this be a standard module or an internal one?
Doesn't have to be today. You'll also need to socialize the basic idea (design, proposed interface, plans for compiler support that you won't implement but maybe I will).
I'd make it an internal module, given that users will eventually just use : atomic MyClass
types to get to it.
edit: socialize the idea by creating an issue or PR message or email summarizing it. Send email to chapel-developers with a link or with the description.
@mppf @e-kayrakli
Below is a rough draft for the PR proposal I want to make tomorrow afternoon. Let me know what you think. The code still needs to be refactored tomorrow, but I need to have a good opener I'd think.
In any budding language, especially one that boasts to be a productive concurrent/parallel language, atomics on all types are a necessity. Currently atomics are only supported up to 64-bit values, and because PGAS languages allow pointers into the address space of other nodes, a normal 64-bit pointer may no longer suffice. The introduction of wide pointers, which are 128-bit values containing both the node id and address, are necessary but introduce a new problem: majority of current hardware is restricted to 64-bit atomic operations, so how can allow most platforms to atomically modify a 128-bit value?
In this proposal I suggest a solution that not only is viable, but also is scalable.
I introduce the GlobalAtomicObject
, a module which hopefully can be integrated
into the core of the language itself. It is a prototype but is extremely promising,
and hopefully can be expanded on by others. At best I believe it should be integrated
into the language, and at least included as a module.
Below, I go over some details that are present in the current implementation
This is the common case, but certainly not a novel concept in and of itself nor to Chapel itself, however it is sufficient for most cases where the number of locales fit within 16 bits. It takes advantage of the fact that all known architectures and operating systems only use 48 bits of the virtual address space, meaning we are free to embed the 16 bits of the locale id in the upper 16 bits. This means for 65K locales, we get the benefit of using atomics for a single 64-bit atomic operation, making atomics extremely fast.
This could be a new design pattern really, but the idea, while novel in its application,
is simple in theory. We maintain a table of objects, one for each locale, of which
we use the index and locale id as a descriptor (a pseudo-pointer). When we want to
atomically operate using a class instance, we first hash the class instance (it's 64-bit address)
and perform an insertIfNotPresent
operation on the target locale, and then embed
the locale id in the upper 32 bits, and the index into the descriptor table into the
lower 32-bits. Compression is the most expensive part, as it involves doing the above
every time (although see improvements for a huge improvement), but decompression
is cheap in that it can easily directly index into the descriptor table.
I compare GlobalAtomicObject (using descriptor tables) performance against 'Sync Var',
the all-mighty sync
variable, which currently gets a handicap by just acquiring the lock,
perform some arbitrary cheap operation, then immediately release it.
This is the 'baseline' which we must beat or else the implementation is useless.
The next is Atomic
on a 64-bit value, which is the 'ceiling', and can be seen as how
much performance we can get for normal pointer compression versus keeping tables like this.
It has two benchmarks: 'single' which is a single atomic operation, and 'LLSC' which is
a pseudo Load-Linked Store-Conditional operation. LLSC looks like such...
var x = atomicVar.read();
atomicVar.compareExchangeStrong(x, f(x));
There is no retrying, just a one-and-done atomic operation (which I count as one).
In 'Global Atomic', we use similar atomic operations as 'Atomic' does, although
a few things need to be considered. While for 'Single', it performs a write
(because a read
is not interesting since it is a basic lookup in a locale's table),
for data in local memory, which shows the ideal. Now 'LLSC' can perform operations for
data that is remote, meaning that each time, its possible that during compression
of the data, we may need to jump to other locales. It shows a more average case, but
not the worse case when data to compareExchangeStrong
is both remote. Currently,
'LLSC' for GlobalAtomicObject
demonstrates 3 atomic operations: the read
which
is a simple lookup, and compression of both variables passed to compareExchangeStrong
which one is local, the other remote.
It should be noted again that with pointer compression and a smaller number of nodes,
performance is excellent. However, when one has more than the maximum allowed, it is
still manageable and better than a sync
variable is. It should also be emphasized that
the runtime of GlobalAtomicObject
with pointer compression is equivalent to Atomic
.
I present two graphs: one with respect to runtime, and another in terms of raw operations per second.
This shows the runtime of the benchmark. As can be seen, raw atomic operations grossly beats anything else, although in terms of atomic LL/SC, it does lag enough to demonstrate the power of global atomics. A single global atomic, while is magnitudes slower than a single 64-bit atomic, is only 2x slower than the LLSC.
Imagine if we had attempted to implement some MCAS operation to update both locations using descriptors. At best we would succeed within 4 LL/SC operations without obstruction (2 to set descriptors, 2 to update), of which a single Global Atomic variable would beat. Of course, in reality, the operation would require multiple retries, in the hundreds to thousands with real workloads that placed contention on it. This shows that, in terms of global atomics, this is the best option currently, as a global atomic is guaranteed to succeed in one operation.
This shows the raw operations per second. Notice that I removed 64-bit atomics entirely as they
shadowed over everything else to the point that the performance boost wouldn't be noticeable.
As can be seen however, both a single global atomic operation, and even the LL/SC combo
operation with a potential remote reference (meaning an added on
statement) beats a normal
sync
variable. In every case, GlobalAtomicObject
wins outright, but performance penalties from
remote references are very apparent and should be avoided at all costs.
Sample usage of GlobalAtomicObject
can be seen below (and yes, it does work).
class C { var c : int; }
var a = new C(1);
var b = new C(2);
var x = new GlobalAtomicObject(C); // atomic C
var result : bool;
x.write(a);
result = x.compareExchange(a, b);
assert(result);
assert(x.read() == b);
// Note that you can call 'delete' on the object while having it still be present
// in the descriptor table. This may be okay for most cases where memory is not
// an issue since reused addresses will hash to the same node and that node will
// already point to the valid object the user is attempting to insert. However if
// memory is an issue, one can call '_delete' to ensure that the node itself is removed
// and can be recycled for later use.
delete a;
delete b;
// Is Safe because we only use the pointer itself. However, in cases where 'a' and 'b'
// can be reused by the same GlobalAtomicObject, then this could cause undefined behavior
// where another concurrent task adds the same pointer. In those cases, you should call
// '_delete' before 'delete'.
x._delete(a);
x._delete(b);
// As well, when GlobalAtomicObject goes out of scope, all nodes it had in use also
// get deleted...
GlobalAtomicObject(C)
can easily become atomic C
. If the language were to be
modified, it wouldn't be too surprising to see the above become the below (which
of course lacks the excessive comments), which is an exemplary ideal that @mppf
insisted on:
class C { ... }
proc test() {
var a = new C(1);
var b = new C(2);
var x: atomic C;
var result:bool;
x.write(a);
result = x.compareExchange(a, b);
assert(result);
assert(x.read() == b);
delete a;
delete b;
}
Where delete
can call some hook which also removes it from the table if it has not yet been
reclaimed. I believe this can become a reality with little effort.
Below, I go over some potential improvements that can be made by others who are up to it.
It may be possible to combine normal pointer compression for the lower locales that fit within 16 bits. Combining both means we allow better performance on average, even for cases when they have more than 16 bits worth of locales, any locales with an id less than 2^16 get the power of atomics, while the ones greater than or equal to 2^16 get to use the much-slower-but-faster-than-sync-variables descriptors.
See next idea for how to differentiate.
A theoretical model where we use a more dynamic type of compression which attempts to uses the lowest 32 bits for the virtual address, and divide the upper 32 bits into zone bits and locale bits. A 'zone' is a base 64-bit offset identified by zone id, which is allocated when a new unique id is found. With optimizations such as caching the zone mappings for other locales (I.E: First time a unique upper 32 bits are found, check if we have it cached, if not found jump to the owning locale and add it if not already, and update what we have cached, etc.)
The benefit here is that we get to keep using 64-bit pointer compression, and the amount of unique zones that can be kept track of depend on the number of bits needed to keep track of the locale id.
"What happens when we run out of zones?" - We use the descriptor tables again. We can just search the unique zones for the target locale first, if the zone isn't found and we can't create a new one, we just use descriptors.
"How do we differentiate between Zone Compression, Pointer Compression, and Descriptor Tables?" - Its possible to use the lowest 2 bits. This means lowering the potential maximum segments for descriptor tables, but it allows a much more dynamic compression strategy.
This is a change that isn't theoretical but requires changes with the language itself. Majority of the overhead of descriptor tables is hashing the object, which requires jumping to the owning locales, but if the descriptors were cached then repeated accesses would be just about as fast as atomics. If it were cached, the benefits of recycling memory to be used in atomics would be tenfold (or hundredfold, as the benchmarks show).
The table utilizes a simplistic skip list implementation on a segmented memory pool.
The memory pool allows indexing by allowing an easy translation from a 32 bit
index into a segment and segment data index. It utilizes a simple bitmap (BigInteger
).
The implementation of the memory pool allows wait-free reads (hence indexing is fine)
but requires synchronization to recycle/reuse memory. The skip list used to manage
memory also requires synchronization, and a lock-free/wait-free implementation requires
proper memory management such as Hazard Pointers which requires thread-local or task-local
storage. Hence, synchronization is needed and we use a simple local sync
variable.
If someone managed to make everything concurrent, it could improve performance further by 5 - 10x at least.
Edit:
Am fixing up for official issue.
You don't have to get get feedback on a PR message before opening it. Your PR reviewer will ask you to update it if they think it's necessary. Note that PR message becomes the merge commit message, so it goes into git history, more or less. So, I think capturing any "gotchas" in the code, or your software design at a high-level is important. The "Details" part seems to do that well.
Since you already have it here: My initial thought is that this is a bit too long for a PR message. If you want to make it more concise, "Introduction", "Performance" and "Usage" headers might go away, or can be summarized with couple of sentences. Usage details can also be summarized into a documentation in the code.
That being said, @mppf can override any comments I have here :)
Currently atomics are only supported for 64-bit values
up to 64-bit values
current hardware is restricted to 64-bit atomic operations, how do we atomically modify a 128-bit value?
some hardware supports 128-bit CAS; it's just that I don't think we want to require it.
I skimmed over the rest. My suggestion would be to make that into a GitHub issue and simply link to that issue from any PR message you make to do the implementation. It's advisable to ask for feedback on the issue before you actually open a PR.
Thanks!
Decided to go ahead and create the issue describing the new construct here, but official PR is a bit of ways off, hopefully by Friday at this point.
PR Officially up: https://github.com/chapel-lang/chapel/pull/6717
@e-kayrakli @mppf
This is of importance to both the project and to Chapel as a whole (or so I believe). I believe that this early prototype of the Global Descriptor Table (GDT), formerly referred to as GlobalAtomicObject, demonstrates the potential power of global atomics. There are a few key things here which were misconceptions that I've had and been told since the beginning...
Communication Is NOT Bad
Communication, even if it is per-operation, is not bad, but it's not good either. The true bottleneck is contention. I've tested time and time again, and each time an algorithm containing some contention-causing operation which may cause an unbounded number of failed operation resulting in an unbounded number of communications, has failed horribly in terms of performance. For example, operations like this is bad...
Will cause an unbounded number of communications. Reading
x
andy
is relatively expensive, and since theCAS
operation not only causes remote contention, but the fact that bothx
andy
need to be read again, causes performance to drop.Now, algorithms which are guaranteed to succeed are guaranteed to scale very well; that is, only wait-free algorithms can see scalability. Code that uses
fetchAdd
andexchange
work wonderfully, which gives me hope that a global data structure with very loose guarantees are possible and useful.Global Descriptor Table
Next, is the new and novel idea of the
GDT
, the Global Descriptor Table. A 64-bit word is used over a 128-bit wide pointer allowing us to take advantage of the benefits of network atomics. In essence, we encode the locale number in the upper 32 bits and the actual index in the lower 32 bits. Currently, an array is used, but in reality (perhaps with runtime support if not already available) its possible that a descriptor can be directly used like a normal pointer. Currently there is a large amount of overhead in needing to keep a bitmap of usable memory and it cannot be resized without needing to synchronize all accesses to it (as Chapel domains and arrays cannot be resized in a lock-free way).Currently, it has been tested with simple
exchange
operations on class instances remotely across all locales, versus needing to acquire a sync variable to do the same. As stated above, acompareExchangeStrong
kills performance, but that has nothing to do with theGDT
but with network atomic contention, so its kept simple. It works and it scales. The below graph shows time to complete the same number of operations (100,000 in this case). It shows the overall runtime of the average of 4 trials (discarding the first warm-up), and that whilesync
will increase in an a near exponential growth,GDT
remains linear.Now, to get a better look at it, here are the same results in Operations per Second.
Implications
I believe this is some very valuable information, which is why I include Michael as well. Not only is the implementation very immature (there are so many optimizations that can be made, that there's no saying how much more this can scale), it also surpasses the only way to perform atomics on global class instances. As well, this opens some very unconventional means of concurrency, the first being the
DSMLock
(WIP) that is based on the DSM-Synch (Distributed Shared Memory Combined Synchronization) in the publication that had CC-Synch (Cache-Coherent Combined Synchronization). As I can confirm that this scales, I can possibly even makeDSMLock
scale in such a way that global lock-based algorithms can scale (not just wait-free). Extremely exciting!Edit:
If this is extended on for the runtime, I can imagine having the entire global address space chunked up into a 4GB (2^32) zones, with 2^8 zones on 2^24 locales. With 256 of 4GB zones, that's 1TB address space per locale, with 16M+ locales.