LouisJenkinsCS / Distributed-Data-Structures

[GSoC] Distributed Data Structures - Collections Framework for Chapel language
https://summerofcode.withgoogle.com/archive/2017/projects/6530769430249472/
BSD 3-Clause "New" or "Revised" License
15 stars 2 forks source link

Global Atomic Operations on Multi-word Data #15

Open LouisJenkinsCS opened 7 years ago

LouisJenkinsCS commented 7 years ago

@e-kayrakli @mppf

This is of importance to both the project and to Chapel as a whole (or so I believe). I believe that this early prototype of the Global Descriptor Table (GDT), formerly referred to as GlobalAtomicObject, demonstrates the potential power of global atomics. There are a few key things here which were misconceptions that I've had and been told since the beginning...

Communication Is NOT Bad

Communication, even if it is per-operation, is not bad, but it's not good either. The true bottleneck is contention. I've tested time and time again, and each time an algorithm containing some contention-causing operation which may cause an unbounded number of failed operation resulting in an unbounded number of communications, has failed horribly in terms of performance. For example, operations like this is bad...

while true {
  var _x = x.read();
  var _y = y.read();
  if _x < _y && _x.compareExchangeStrong(_x, _x + 1) {
    break;
  }
}

Will cause an unbounded number of communications. Reading x and y is relatively expensive, and since the CAS operation not only causes remote contention, but the fact that both x and y need to be read again, causes performance to drop.

Now, algorithms which are guaranteed to succeed are guaranteed to scale very well; that is, only wait-free algorithms can see scalability. Code that uses fetchAdd and exchange work wonderfully, which gives me hope that a global data structure with very loose guarantees are possible and useful.

Global Descriptor Table

Next, is the new and novel idea of the GDT, the Global Descriptor Table. A 64-bit word is used over a 128-bit wide pointer allowing us to take advantage of the benefits of network atomics. In essence, we encode the locale number in the upper 32 bits and the actual index in the lower 32 bits. Currently, an array is used, but in reality (perhaps with runtime support if not already available) its possible that a descriptor can be directly used like a normal pointer. Currently there is a large amount of overhead in needing to keep a bitmap of usable memory and it cannot be resized without needing to synchronize all accesses to it (as Chapel domains and arrays cannot be resized in a lock-free way).

Currently, it has been tested with simple exchange operations on class instances remotely across all locales, versus needing to acquire a sync variable to do the same. As stated above, a compareExchangeStrong kills performance, but that has nothing to do with the GDT but with network atomic contention, so its kept simple. It works and it scales. The below graph shows time to complete the same number of operations (100,000 in this case). It shows the overall runtime of the average of 4 trials (discarding the first warm-up), and that while sync will increase in an a near exponential growth, GDT remains linear.

image

Now, to get a better look at it, here are the same results in Operations per Second.

image

Implications

I believe this is some very valuable information, which is why I include Michael as well. Not only is the implementation very immature (there are so many optimizations that can be made, that there's no saying how much more this can scale), it also surpasses the only way to perform atomics on global class instances. As well, this opens some very unconventional means of concurrency, the first being the DSMLock (WIP) that is based on the DSM-Synch (Distributed Shared Memory Combined Synchronization) in the publication that had CC-Synch (Cache-Coherent Combined Synchronization). As I can confirm that this scales, I can possibly even make DSMLock scale in such a way that global lock-based algorithms can scale (not just wait-free). Extremely exciting!

Edit:

If this is extended on for the runtime, I can imagine having the entire global address space chunked up into a 4GB (2^32) zones, with 2^8 zones on 2^24 locales. With 256 of 4GB zones, that's 1TB address space per locale, with 16M+ locales.

mppf commented 7 years ago

Re "Communication is not bad", I think your example was interesting but I'm not sure I'd describe it that way. I mean, you're saying that an unbounded amount of communication under contention kills performance. And it's probably worse to have unbounded number of compare-exchange when communication is involved than it is locally, because the latency on the operation is higher (so it might be more likely another operation changes the value).

My main question here is - what are these bit-maps of all memory you mentioned? Nothing about the descriptor table idea implies to me that we need bitmaps... but perhaps I have not thought through some issue.

LouisJenkinsCS commented 7 years ago

My main question here is - what are these bit-maps of all memory you mentioned? Nothing about the descriptor table idea implies to me that we need bitmaps... but perhaps I have not thought through some issue.

The bitmaps are mainly to keep track of which descriptors are in use and which are not. The descriptors themselves do not refer to the actual address space, but some index into an array on some locale. The bitmap is to provide a lower (but essential) overhead to keeping track of which indices can be used/reused.

Bitmaps may not be needed if instead of Chapel arrays we use just raw pointers to memory like c_malloc (but this is a prototype after all). Currently the user must allocate the data first using new and then 'register' it (I.E: Adding to array) which will return the descriptor (I.E: Index the data was added to). Ideally, allocating large chunks of memory and implementing a basic memory manager would provide much better performance (but takes more time to implement).

LouisJenkinsCS commented 7 years ago

Code in question:

GDT and Bitmap.

Note that they are very early (but promising) implementations.

mppf commented 7 years ago

I would expect a pattern like this:

class C { ... }
proc test() {
   var a = new C(1);
   var b = new C(2);
   var x: atomic C;
   var result:bool;
   x.write(a);
   result = x.compareExchange(a, b);
   assert(result);
   assert(x.read() == b);
   delete a;
   delete b;
}

I would imagine that would mean you would need to perform this register operation on the first atomic operation for that class.

LouisJenkinsCS commented 7 years ago

Truth be told, under more ideal cases it would be better if a and b could be created by some other object and used as such...

class C { ... }
var gdt = new GDT(C);
proc test() {
   // Maybe have parameters passed be used to reflectively call the constructor of C?
   var (idxA, a) = gdt.create(1);
   var (idxB, b) = gdt.create(2);
   var x: atomic uint;
   var result:bool;
   x.write(idxA);
   result = x.compareExchange(idxA, idxB);
   assert(result);
   assert(gdt.read(idxB) == b);
}

Currently its ugly, but either the runtime or compiler will need to keep track of the actual index of the object or the user has to. I'm not sure if you can hash the actual pointer address to an index, but maybe something similar can be done to remove the need to use the index and support just using the raw object. I can experiment with this as well.

mppf commented 7 years ago

Right, it's clear that would be more convenient to implement, but I'm not so sure I'd consider it a full solution to the problem. After all, we are trying to make a productive parallel language, so we want to minimize concerns for our users.

Anyway, as far is implementing goes, yeah, keep going & experiment with what it would take to get my example pattern to work. Thanks!

LouisJenkinsCS commented 7 years ago

@mppf @e-kayrakli

Here is a document detailing how I plan to implement global atomics in a fashion that can provide reasonable semantics (I.E: No need to carry descriptors, wide-pointer compression strategy, global atomics, etc.)

I wish to start on it this weekend (as after I need to satisfy Engin's requirements for the Queue) so by then I should know if it does work or not. As well, the FCHLock (Flat Combining Hierarchical Lock) determines if I can make a global priority queue or not with reasonable performance, which also can be affected by this. If any advice or feedback could be given on it by Friday night I'd appreciate it.

LouisJenkinsCS commented 7 years ago

Also as well, @mppf I believe it was mentioned before that Linux only allows a 47-bit virtual address space, but I can't 100% verify that. I mean, if that is the case for a work-around I can always just shave off the top 17 bits for locale id (which only allows 131K locales, but it could be an even more temporary but immediate solution, taking literally next to no effort to do). I just can't make assumptions without knowing how Chapel handles things currently. gbt1 mentioned that network tops at 128k, but I can see it increasing easily within a few years (of which my solution won't work for). However, if both are 'fine for now' solutions, then its sufficient to just have top 17 bits be locale id, lower 47 bits be virtual address.

Edit:

Turns out, as seen here that my solution is rather overly complicated. "Most operating systems and applications will not need such a large address space for the foreseeable future, so implementing such wide virtual addresses would simply increase the complexity and cost of address translation with no real benefit." Absolutely crazy... so I can just actually compress it as I see fit. Or rather, my solution isn't overly complicated, but one that is resistant to change.

mppf commented 7 years ago

I think it's important that we have a strategy for my above Chapel code example to work, even if it's not the first thing we implement.

I don't understand why you need a "Zone" strategy at all. Let's back up a bit and talk through the problems that really need to be solved here.

I think what I was imagining was: a) wide pointer compression into 64-bits b) a fall-back strategy that works differently (e.g. by referring to an allocated descriptor index in a table of descriptors).

See https://github.com/chapel-lang/chapel/blob/master/runtime/include/chpl-wide-ptr-fns.h for some existing code that does wide pointer compression, but that wouldn't work on very large systems. (I.e. I'm asserting (a) is already implemented and you could steal/reuse that implementation. ).

Anyway, for (b), could we propagate information about a mapping from index to node in a dynamic way, only when needed? For example, here is an "obvious" strategy for (b) - that is, a straw-man:

For terminology, let's call the thing we actually store in an atomic class variable an "atomic instance descriptor".

How would that work with my example?

class C { ... }
proc test() {
   var a = new C(1);  // works as it does now
   var b = new C(2); // as it does now
   var x: atomic C; // atomic C is 64 bits in a format we are inventing: "atomic instance descriptor".
   var result:bool;
   x.write(a);
   result = x.compareExchange(a, b);
   // compareExchange will:
   //  1) Convert a and b to atomic instance descriptors (possibly by compression,
   //      possibly by searching in or updating some data structures)
   //  2) Perform the 64-bit atomic compareExchange operation on the descriptors
   assert(result);
   assert(x.read() == b);
   // read() will:
   // 1) atomically read a 64-bit atomic instance descriptor
   // 2) unpack the atomic instance descriptor into a 128-bit wide pointer
   //      (possibly by reading it directly, possibly with some data structure).
   delete a;
   // delete a could remove the atomic instance descriptor for a from some data structure
   delete b;
   // delete b could remove the atomic instance descriptor for a from some data structure
}

So, for the example to be efficient, we need to have reasonable ways to do: i) Go from a wide address of a class instance to an atomic instance descriptor ii) Go from a atomic instance descriptor to an address

We could conceivably store the atomic instance descriptor in the class instance, once it is computed. We're making the compiler/language after all, so we could add a 64-bit element to all class instances. That might solve (i). Or maybe there is a better solution.

For (ii), we could literally use a global array as I was describing above. It could even be block distributed... Or, maybe the best solution here uses a limited number of atomic descriptors per locale. So that the format would be 32-bits of locale number/node id, and then a 32-bit offset into that locale's individual table. That's not really that different from using a Block distributed array, but it might add some flexibility. In particular, you could imagine using a skip list (or something) instead of an array, to make inserting/freeing more efficient.

Or, what if these arrays were per-pthread, so that there isn't even any locking overhead in managing the data structures? Whatever pthread is doing (i) would update its table.

Is it a significant limitation to only allow 2^32 atomic instance descriptors at a given time? I'm not sure.

Related ideas:

Even though these are related, I don't think it's a good idea to tie this effort to those other components, unless we really really need to.

LouisJenkinsCS commented 7 years ago

I don't understand why you need a "Zone" strategy at all.

I believe I should reiterate on the point of Zones. Zones makes an assumption on 3 things to compress the most significant 32-bits of the addr portion of a wide pointer...

1) The Virtual Memory Manager will monotonically allocate virtual addresses and are not random, decreasing the amount of unique most significant 32-bits. 2) Chapel's memory manager reuses memory, further reducing the unique most significant 32-bits. 3) The number of unique most significant 32-bits do not exceed the amount of bits used to keep track of it.

Assuming all locales cache each other's Zone base offsets and their respective mappings, then we can easily do this without necessary intervention from the runtime or compiler. Note again, this was from the perspective of using GDT as a module, but in your example it can still work, and if anything it can do so in a much better way. Assuming instead of just having the atomic C be an 'atomic instance descriptor', have it just be a GDT.

To skip to the good part... We'll assume that a.addr and a.localeId refer to the address and localeId respectively...

  x.write(a);
  // write will:
  //  1) Descriptor becomes (a.localeId << 40) | zoneId(a) << 32 | a.addr & 0xFFFFFFFF
  //  2) This is written to some atomic uint field.
  // Note: zoneId performs the 'magic' in terms of compression... that's certainly possible but a WIP
  result = x.compareExchange(a, b);
  // compareExchange will:
  //  1) Convert a and b to atomic instance descriptors (possibly by compression,
  //      possibly by searching in or updating some data structures)
  //  2) Perform the 64-bit atomic compareExchange operation on the descriptors
  assert(result);
  assert(x.read() == b);
  // read() will:
  // 1) atomically read a 64-bit atomic instance descriptor
  // 2) Unpacks by doing the following:
  //      var descr = ...;
  //      var localeId = descr >> 40;
  //      var addr = (zone((descr >> 32) & 0xFF)) | descr & 0xFFFFFFFF;
  //      return ...; /* Handle conversion back to wide pointer with localeId and addr */
  //    Note: zone performs the 'magic' in terms of decompression... while it is a WIP, what it does
  //          is return a 64-bit base that the offset (lower 32 bits) is added to.

The above again makes relatively dangerous assumptions: Only 256 unique most significant 32 bits are used. Truth be told, it could be potentially improved by changing how many bits are kept the addr part and how many can be used for zoneId. Just needed to make this case apparent before its dismissed, as I still think it could work (as you said, the language can be changed to provide certain guarantees)

a) wide pointer compression into 64-bits

Its good to hear that its been done before, definitely makes my job easier and also makes more sense that someone else on the development team thought of it first.

b) a fall-back strategy that works differently (e.g. by referring to an allocated descriptor index in a table of descriptors).

creating a descriptor simply adds a class instance pointer to a free slot in the array & returns its index

This is definitely viable if we do what you said...

We could conceivable store the atomic instance descriptor in the class instance, once it is computed.

This would be acceptable. If we had each type keep track of it's descriptor, and each type have a specific GDT it would also work out well. This allows the current approach of keeping a bitmap, as well as allow me to incrementally grow the GDT by allocating and chaining another array (twice the original size) and to allow reads of currently existing descriptors to be free of synchronization. This may be the easiest way as well as an acceptable proof-of-concept/prototype.

using a skip list (or something) instead of an array, to make inserting/freeing more efficient.

Makes any remote operation (such as a read on another node) to become too expensive, as they too need to traverse the nodes in the list to find it.

Or, what if these arrays were per-pthread, so that there isn't even any locking overhead in managing the data structures?

That's an interesting one... would this also allow certain NUMA-friendly/aware optimizations? Perhaps you could just have it be one for each NUMA node if its that the case. However, having one per pthread, especially if its hyperthreaded and has tons of processors per node and tons of nodes, then I can forsee it being an issue. Having too many GDT structures is a huge space-overhead, but per NUMA-node should be okay since each has their own memory anyway, right?

mppf commented 7 years ago

Assuming instead of just having the atomic C be an 'atomic instance descriptor', have it just be a GDT.

I can't really make sense of that one...

using a skip list (or something) instead of an array, to make inserting/freeing more efficient.

Makes any remote operation (such as a read on another node) to become too expensive, as they too need to traverse the nodes in the list to find it.

Which operations need to access this data structure, and how do they access it? myAtomicC.read() goes from index to class instance wide pointer. That basically just amounts to accessing the i'th element of the array/whatever. Different data structures we could choose will represent different tradeoffs between the speed of the various operations.

I don't object to the zone idea exactly, I just think it needs to fit into the framework I was describing & should be evaluated among some of these other ideas.

LouisJenkinsCS commented 7 years ago

Assuming instead of just having the atomic C be an 'atomic instance descriptor', have it just be a GDT.

I can't really make sense of that one...

That's my fault, I mean that atomic C becomes some other object that keeps track of both the GDT (Global Descriptor Table/Global Array of class instances) and the current descriptor set.

That basically just amounts to accessing the i'th element of the array/whatever.

Now that's another thing that's my fault. Truthfully, I've never implemented a skip list before so when I thought 'list' I thought 'linked list', not that it could be array based. I rescind what I said, a skip list seems to be the ticket then.

I don't object to the zone idea exactly, I just think it needs to fit into the framework I was describing & should be evaluated among some of these other ideas.

Gotcha, the thing is when it comes to the runtime or compiler I have no idea what is and isn't possible. As well, I've never developed applications for HPC clusters (and this is my first time delving into distributed computing) so I don't have any input to give on whether or not something is enough or what is expected of a sample user program. However to give what little input I can give now...

Or, we start with 2^32 max and expect to extend that in the future if the limit becomes a problem.

This is one thing I can agree with. I honestly can't say whether or not its good enough even in the present, but that's all I can assume.

A big machine could allocate that many class instances. An allocated class instance is going to be 32 bytes at a minimum, in practice, as far as I understand. (16 byte aligned but storing >= 1 byte). So that's 128 GiB of storage for the classes themselves, which is a lot, but might even be possible on 1 node.

I honestly don't know, but my opinion on it would be that if a node exceeds what is set as the maximum threshold, it should just be a runtime error, and that the user should try another work-around. Of course, a fallback for this could be to just use the hwlock solution if the user passes a certain flag, --atomics=hwlock or something? Again, I'm not even sure what the standard way of handling this in Chapel is.

One might investigate how many atomic instance descriptors would be used in a sample application using atomic class instances.

This could be doable for my project by just using it in the queue. If a data structure can't exceed that limit, then I doubt anything else can.

LouisJenkinsCS commented 7 years ago

As well, to reiterate, the best thing I can potentially do is to implement it as an actual module, and then analyze how it can be integrated into the language itself. Currently a way to hash pointers to actual indexes are needed, and the only way to prove a solution works currently is to build an early prototype for it (at least IMO). The solution I agree with is having fields in class instances for it's descriptor unique to an atomic variable (definitely most straight forward). Its impossible from a module, but the 'zone' one might work. At least after the module is established/developed, it should make it 10x easier to add to the language itself and/or make other modifications to it.

mppf commented 7 years ago

Right, but when you're doing such a prototype implementation, it's important to know what a plausible API for the compiler to use will be.

LouisJenkinsCS commented 7 years ago

I can think of an adaptive 4-phase strategy...

NumLocales ≤ 2^17

Normal Compression of Wide Pointers (17 bits for locale, 47 Bits for offset, based on assumption that only 47 bits are ever used), which allows 131K locales and the entire usable address space. This should be sufficient and does not cause any performance loss whatsoever as barely any translation is needed. Note that I don't leave a bit for marking/tagging, but one can be added if you need to pack it into a 64-bit word, then it would change the condition to NumLocales ≤ 2^16.

NumLocales < 2^32

We use the Zone compression strategy to keep track of things like this. The more nodes you have, the less bits you have for Zone (and if you exceed the zone, then we can adapt into the third phase). In this case, if you have exactly 2^18 NumLocales, then you can reserve the rest for zones, meaning its more likely that you can continue using this strategy. As mentioned before, Zones just map the most significant 32 bits of the address to a zone id as they are encountered, which only has slight minor performance losses for caching another locale's zone mappings.

NumLocales = 2^32 or ZoneId bits exceeded...

We use the global descriptor table strategy. Any allocations that exceed the zone, will be instead added here. There is a significant penalty in terms of managing everything, but what else can you do?

NumLocales > 2^32

Just fall back to using a sync variable or some other lock. No amount of compression will help in that case.

Would you agree to something like this?

Edit:

Updated a 4th phase

LouisJenkinsCS commented 7 years ago

Okay, now I found out that I can't really create Wide Pointers from Chapel... I can't call any of the runtime functions that expect or return wide_ptr_t because there is no equivalent type in Chapel (and compiler throws error to avoid type-punning, so using a record of equivalent size won't work either). As well, any change I make to the compiler will make my proposal at the end null-and-void so I can't do that route at all... I guess, really, even compression won't work, I'll just have to skip straight to descriptor tables... So I guess really I'll try to implement concurrent skip lists and handle allocation in that manner and just see if I can hash objects... @e-kayrakli I can do that right? Hash objects to a unique uint or something comparable?

e-kayrakli commented 7 years ago

Hash objects to a unique uint or something comparable?

I am not sure if I am understanding it right.. but yes you can cast references to uint or something similar using compiler primitives. And once you have it cast, then hash it as if it is just an integer and not a reference. If it is wide reference (which I think is the actual question you are asking) I think you can have an hash function that takes both locale id and address as an input. (There are wide_get_* primitives that can help you get different fields of a wide pointer easily, and probably much faster than e.g. x.locale.id)

mppf commented 7 years ago

Okay, now I found out that I can't really create Wide Pointers from Chapel... I can't call any of the runtime functions that expect or return wide_ptr_t because there is no equivalent type in Chapel

You should be able to call the functions in runtime/include/chpl-wide-ptr-fns.h, but you'll have to use the foreign function interface to do so. Including telling Chapel that wide_ptr_t is an extern type.

Additionally, there are compiler primitives to get the locale or address portion of a wide pointer. _wide_get_locale, _wide_get_node, _wide_get_addr. See for example

test/multilocale/ferguson/get-addr-node.chpl

I would expect that these compiler primitives are all that you need for an initial attempt.

I don't understand why compiler adjustments would make your proposal useless.

LouisJenkinsCS commented 7 years ago

I done it! I have done as you asked Michael, and created the descriptor table approach and have a MWE that is similar to the ones you desired. I even embedded everything within a single GlobalAtomicObject which manages everything, and as a record it provides automatic memory management and cleans itself up nicely.

  class C { var c : int; }
  var a = new C(1);
  var b = new C(2);
  var x = new GlobalAtomicObject(C); // atomic C
  var result : bool;

  x.write(a);
  result = x.compareExchange(a, b);
  assert(result);
  assert(x.read() == b);

  // Note that you can call 'delete' on the object while having it still be present
  // in the descriptor table. This may be okay for most cases where memory is not
  // an issue since reused addresses will hash to the same node and that node will
  // already point to the valid object the user is attempting to insert. However if
  // memory is an issue, one can call '_delete' to ensure that the node itself is removed
  // and can be recycled for later use.
  delete a;
  delete b;

  // Is Safe because we only use the pointer itself.
  x._delete(a);
  x._delete(b);

  // As well, when GlobalAtomicObject goes out of scope, all nodes it had in use also
  // get deleted...

I have also tested it in a multi-locale environment, and can prove that it works (although the test doesn't have much coverage yet aside for ensuring that the objects returned are consistent),

So in essence, I have proof of correctness for this proof of concept. Next up is performance, of which I need to measure each operation (compared to sync equivalent as a baseline and atomic on 64-bit as a ceiling). I'll make a follow-up post once I have such results.

LouisJenkinsCS commented 7 years ago

Global Atomic Object

I present the GlobalAtomicObject, the first of it's kind to allow atomic operations on class instances (128-bit wide pointers)! GlobalAtomicObject currently suffers from poor optimizations due to lack of time, but it sure is promising! I compare it's performance against 'Sync Var', the all-mighty sync var, which currently gets a handicap by just acquiring the lock, perform some arbitrary cheap operation, then immediately release it. This is the 'baseline' which we must beat or else the implementation is useless. The next is Atomic on a 64-bit value, which is the 'ceiling', and can be seen as how much performance we can get for normal pointer compression versus keeping tables like this. It has two benchmarks: 'single' which is a single atomic operation, and 'LLSC' which is a pseudo Load-Linked Store-Conditional operation. LLSC looks like such...

var x = atomicVar.read();
atomicVar.compareExchangeStrong(x, f(x));

There is no retrying, just a one-and-done atomic operation (which I count as one). In 'Global Atomic', we use similar atomic operations as 'Atomic' does, although a few things need to be considered. While for 'Single', it performs a write (because a read is not interesting since it is a basic lookup in a locale's table), for data in local memory, which shows the ideal. Now 'LLSC' can perform operations for data that is remote, meaning that each time, its possible that during compression of the data, we may need to jump to other locales. It shows a more average case, but not the worse case when data to compareExchangeStrong is both remote. Currently, 'LLSC' for GlobalAtomicObject demonstrates 3 atomic operations: the read which is a simple lookup, and compression of both variables passed to compareExchangeStrong which one is local, the other remote.

It should be noted again that with pointer compression and a smaller number of nodes, performance is excellent. However, when one has more than the maximum allowed, it is still manageable and better than a sync variable is.

Performance

I present two graphs: one with respect to runtime, and another in terms of raw operations per second.

Time

image

This shows the runtime of the program. As can be seen, raw atomic operations grossly beats anything else, although in terms of atomic LL/SC, it does lag enough to demonstrate the power of global atomics. A single global atomic, while is magnitudes slower than a single 64-bit atomic, is only 2x slower than the LLSC.

Imagine if I had attempted to implement some MCAS operation to update both locations using descriptors. At best we would succeed within 4 LL/SC operations without obstruction (2 to set descriptors, 2 to update), of which a single Global Atomic variable would beat. Of course, in reality, the operation would require multiple retries, in the hundreds to thousands with real workloads that placed contention on it. This shows that, in terms of global atomics, this is the best option currently, as a global atomic is guranteed to win.

As great as GlobalAtomicObject is in being novel and the first solution, it isn't the best solution. Any compression to fit within 64 bytes without overhead of extra communication would get the phenomenal performance of a single atomic operation.

Operations Per Second

image

This shows the raw operations per second. Notice that I removed 64-bit atomics entirely as they shadowed over everything else to the point that the performance boost wouldn't be noticeable. As can be seen however, both a single global atomic operation, and even the LL/SC combo operation with a potential remote reference (meaning an added on statement) beats a normal sync variable. In every case, GlobalAtomicObject wins outright, but performance penalties from remote references are very apparent and should be avoided at all costs.

Improvements

Some improvements to the GlobalAtomicObject that can be made... I can't do much more than the first one (and even then not without some help), the rest would be too time consuming, and the ground work has already been done, so I'm sure someone else can pick up on this (or maybe after the project is over perhaps I can come back and do it myself).

Pointer Compression for NumLocales < 2^17

This is something I still want to do, but it'd be great if I could have a good example to do it from Chapel. That's all I'd need to be able to start and complete it. Currently it only uses descriptor tables which uses a skip list for allocation/deallocation and a segmented memory pool that recycles nodes (and indexed directly during a read operation).

Wait-Free/Lock-Free Descriptor Table Implementation

We use a sync variable to synchronize access locally, which is a huge bottleneck. If a wait-free or lock-free skip list were to be used, I'm certain that performance would improve by about 5 - 10x.

Low-Level C/Runtime specific optimization

From using malloc over new for recycling, to caching the descriptor into the object itself, etc. Even implementing it in C alone would provide better performance I'm sure, especially as apart of the language/runtime itself.

mppf commented 7 years ago

Cool!

The Chapel type system doesn't normally allow ptr-int casting but we can work around that with some special functions. The below example should have everything you need to get started on the pointer compression part.

proc c_void_ptr_to_uint(ptr:c_void_ptr)
{
  return __primitive("cast", uint, ptr);
}

proc uint_to_c_void_ptr(u:uint)
{
  return __primitive("cast", c_void_ptr, u);
}

proc test(obj:object)
{
  var loc = __primitive("_wide_get_locale", obj);
  var node = chpl_nodeFromLocaleID(loc);
  var subloc = chpl_sublocFromLocaleID(loc);
  var raddr = __primitive("_wide_get_addr", obj);

  // TODO: reasonable checking for pointers/nodes
  // too big to fit into this format.
  // TODO: configurable format

  // Let's "pack" the wide pointer
  var uraddr = c_void_ptr_to_uint(raddr);
  var unode = node:uint;
  var packed:uint = 0;
  packed = uraddr & 0x0000ffffffffffff;
  packed |= unode << 48;

  writef("node %xi              raddr %016xi\n", unode, uraddr);
  writef("packed to                 %016xi\n", packed);

  // Let's "unpack" the wide pointer.
  var got_uraddr = packed & 0x0000ffffffffffff;
  var got_unode = packed >> 48;
  var got_raddr = uint_to_c_void_ptr(got_uraddr);
  writef("unpacked to node %xi  raddr %016xi\n", got_unode, got_uraddr);

  // check pointer value matches.
  assert(raddr == got_raddr);
  // check node matches
  assert(obj.locale.id == got_unode);
}

class C {
  var x:int;
}

for loc in Locales {
  on loc {
    var c = new C(10);
    test(c);
  }
}

For the other improvements, I'd expect the wait-free/lock-free version would be a big win but that implementing it in C vs Chapel won't matter much.

mppf commented 7 years ago

You might also need a way to go from a node and a local pointer to a wide pointer. I can probably help you with that part too but I'm hoping you can make progress without it for the moment.

LouisJenkinsCS commented 7 years ago

I greatly appreciate the examples, definitely handy for handling compression and decompression, however I really do need to be able to obtain the object, which means I do need to create the wide pointer. If there was a way to actually copy a dummy wide-pointer by value, then modify the copied by-value wide pointer, that could work so long as the compiler allows it. Issue is, the fact that it throws "TODO - don't like type-punning record/union" is a huge roadblock, because it would work if it allowed it. I'm assuming that it throws a compiler warning because GCC will throw a compiler warning and any compiler warning gets treated as a compiler error?

A simple function could be added to runtime/include/chpl-wide-ptr-fns.h to create a wide pointer given the type, localeId and addr right?

Nevermind, realized that this was it. I managed to side-step the compiler a bit by using c_ptrTo to get the address to a wide pointer created on the stack and copied directly into it. Dangerous, but best I can do. Full MWE below

extern type wide_ptr_t;
extern type c_nodeid_t;
extern proc chpl_return_wide_ptr_node(c_nodeid_t, c_void_ptr) : wide_ptr_t;

proc c_void_ptr_to_uint(ptr:c_void_ptr)
{
  return __primitive("cast", uint, ptr);
}

proc uint_to_c_void_ptr(u:uint)
{
  return __primitive("cast", c_void_ptr, u);
}

class C {
  var x:int;
}

proc test(obj:object)
{
  var loc = __primitive("_wide_get_locale", obj);
  var node = chpl_nodeFromLocaleID(loc);
  var subloc = chpl_sublocFromLocaleID(loc);
  var raddr = __primitive("_wide_get_addr", obj);

  // TODO: reasonable checking for pointers/nodes
  // too big to fit into this format.
  // TODO: configurable format

  // Let's "pack" the wide pointer
  var uraddr = c_void_ptr_to_uint(raddr);
  var unode = node:uint;
  var packed:uint = 0;
  packed = uraddr & 0x0000ffffffffffff;
  packed |= unode << 48;

  writef("node %xi              raddr %016xi\n", unode, uraddr);
  writef("packed to                 %016xi\n", packed);

  // Let's "unpack" the wide pointer.
  var got_uraddr = packed & 0x0000ffffffffffff;
  var got_unode = packed >> 48;
  var got_raddr = uint_to_c_void_ptr(got_uraddr);
  writef("unpacked to node %xi  raddr %016xi\n", got_unode, got_uraddr);

  // check pointer value matches.
  assert(raddr == got_raddr);
  // check node matches
  assert(obj.locale.id == got_unode);

  // Start here for new stuff... we create the wide pointer from decompressed information
  var wideptr = chpl_return_wide_ptr_node(got_unode, got_raddr);
  // Note: We fool the compiler into creating a wide pointer by using an
  // 'on here' type of statement. I indexed into 'Locales' because I don't want the
  // compiler to 'optimize' it away...
  var newObj : C;
  on Locales[here.id] do newObj = new C();
  // Warning: If this is *not* a wide pointer, it will overwrite part of the stack.
  // If it is, it works. Dangerous, but only way to do so at the moment.
  c_memcpy(c_ptrTo(newObj), c_ptrTo(wideptr), 16);
  assert(obj == newObj);
}

for loc in Locales {
  on loc {
    var c = new C(10);
    test(c);
  }
}
mppf commented 7 years ago

@LouisJenkinsCS - looks like an OK workaround for now. At some point, I think that somebody (possibly me) should add the reverse to __primitive("_wide_get_addr", obj) -- namely, a primitive that creates a wide pointer of a particular type from a local void* and a locale.

e-kayrakli commented 7 years ago

a primitive that creates a wide pointer of a particular type from a local void* and a locale.

FWIW, I have implemented something like that for my work: https://github.com/e-kayrakli/chapel/blob/prefetch-master/compiler/codegen/expr.cpp#L5317

It can be used with a C pointer or wide reference (I implemented support for narrow reference as well, but it was useless for me, thus I commented it out). You can read lockOff in the code as the locale ID for a general case. I don't mind polishing it, if need be.

LouisJenkinsCS commented 7 years ago

@mppf

It's been done. Not only does it perform the compression strategy for locales that fit within 16 bits, but also maintains manageable/respectable for locales that are within 17 and 32 bits. It is as fast as atomics (in this case it was faster, but likely due to environmental differences such as scheduling, network traffic, etc). I jumped straight to 32 locales because there's no point rerunning everything as its clear that sync is inferior (and it takes a long time to finish).

32 Locales

Sync Atomic Single Atomic LLSC Global Atomic Single Global Atomic LLSC
316.75 0.215319 6.78373 0.203364 6.90024
mppf commented 7 years ago

@LouisJenkinsCS - cool, let's work towards putting this in to a PR that can go onto master.

LouisJenkinsCS commented 7 years ago

Sure, I can handle something like that. Should I do the following...

1) Clone Chapel (already done) 2) Create a branch called 'global-atomics' 3) Move GlobalAtomicObject.chpl into its modules/standard a) Requires me to condense everything into a single file, but manageable 4) Include the test into 'test/global_atomics' with '.good' file 5) Make PR into master

If it doesn't have to be today, I'll be be able to refactor it more into something that is presentable (currently its more of something I coded in 2 days).

e-kayrakli commented 7 years ago

@mppf -- Should this be a standard module or an internal one?

mppf commented 7 years ago

Doesn't have to be today. You'll also need to socialize the basic idea (design, proposed interface, plans for compiler support that you won't implement but maybe I will). I'd make it an internal module, given that users will eventually just use : atomic MyClass types to get to it.

edit: socialize the idea by creating an issue or PR message or email summarizing it. Send email to chapel-developers with a link or with the description.

LouisJenkinsCS commented 7 years ago

@mppf @e-kayrakli

Below is a rough draft for the PR proposal I want to make tomorrow afternoon. Let me know what you think. The code still needs to be refactored tomorrow, but I need to have a good opener I'd think.

Global Atomic Object

Introduction

In any budding language, especially one that boasts to be a productive concurrent/parallel language, atomics on all types are a necessity. Currently atomics are only supported up to 64-bit values, and because PGAS languages allow pointers into the address space of other nodes, a normal 64-bit pointer may no longer suffice. The introduction of wide pointers, which are 128-bit values containing both the node id and address, are necessary but introduce a new problem: majority of current hardware is restricted to 64-bit atomic operations, so how can allow most platforms to atomically modify a 128-bit value?

In this proposal I suggest a solution that not only is viable, but also is scalable. I introduce the GlobalAtomicObject, a module which hopefully can be integrated into the core of the language itself. It is a prototype but is extremely promising, and hopefully can be expanded on by others. At best I believe it should be integrated into the language, and at least included as a module.

Details

Below, I go over some details that are present in the current implementation

Pointer Compression

This is the common case, but certainly not a novel concept in and of itself nor to Chapel itself, however it is sufficient for most cases where the number of locales fit within 16 bits. It takes advantage of the fact that all known architectures and operating systems only use 48 bits of the virtual address space, meaning we are free to embed the 16 bits of the locale id in the upper 16 bits. This means for 65K locales, we get the benefit of using atomics for a single 64-bit atomic operation, making atomics extremely fast.

Descriptor Tables

This could be a new design pattern really, but the idea, while novel in its application, is simple in theory. We maintain a table of objects, one for each locale, of which we use the index and locale id as a descriptor (a pseudo-pointer). When we want to atomically operate using a class instance, we first hash the class instance (it's 64-bit address) and perform an insertIfNotPresent operation on the target locale, and then embed the locale id in the upper 32 bits, and the index into the descriptor table into the lower 32-bits. Compression is the most expensive part, as it involves doing the above every time (although see improvements for a huge improvement), but decompression is cheap in that it can easily directly index into the descriptor table.

Performance

I compare GlobalAtomicObject (using descriptor tables) performance against 'Sync Var', the all-mighty sync variable, which currently gets a handicap by just acquiring the lock, perform some arbitrary cheap operation, then immediately release it. This is the 'baseline' which we must beat or else the implementation is useless. The next is Atomic on a 64-bit value, which is the 'ceiling', and can be seen as how much performance we can get for normal pointer compression versus keeping tables like this. It has two benchmarks: 'single' which is a single atomic operation, and 'LLSC' which is a pseudo Load-Linked Store-Conditional operation. LLSC looks like such...

var x = atomicVar.read();
atomicVar.compareExchangeStrong(x, f(x));

There is no retrying, just a one-and-done atomic operation (which I count as one). In 'Global Atomic', we use similar atomic operations as 'Atomic' does, although a few things need to be considered. While for 'Single', it performs a write (because a read is not interesting since it is a basic lookup in a locale's table), for data in local memory, which shows the ideal. Now 'LLSC' can perform operations for data that is remote, meaning that each time, its possible that during compression of the data, we may need to jump to other locales. It shows a more average case, but not the worse case when data to compareExchangeStrong is both remote. Currently, 'LLSC' for GlobalAtomicObject demonstrates 3 atomic operations: the read which is a simple lookup, and compression of both variables passed to compareExchangeStrong which one is local, the other remote.

It should be noted again that with pointer compression and a smaller number of nodes, performance is excellent. However, when one has more than the maximum allowed, it is still manageable and better than a sync variable is. It should also be emphasized that the runtime of GlobalAtomicObject with pointer compression is equivalent to Atomic.

I present two graphs: one with respect to runtime, and another in terms of raw operations per second.

Time

image

This shows the runtime of the benchmark. As can be seen, raw atomic operations grossly beats anything else, although in terms of atomic LL/SC, it does lag enough to demonstrate the power of global atomics. A single global atomic, while is magnitudes slower than a single 64-bit atomic, is only 2x slower than the LLSC.

Imagine if we had attempted to implement some MCAS operation to update both locations using descriptors. At best we would succeed within 4 LL/SC operations without obstruction (2 to set descriptors, 2 to update), of which a single Global Atomic variable would beat. Of course, in reality, the operation would require multiple retries, in the hundreds to thousands with real workloads that placed contention on it. This shows that, in terms of global atomics, this is the best option currently, as a global atomic is guaranteed to succeed in one operation.

Operations Per Second

image

This shows the raw operations per second. Notice that I removed 64-bit atomics entirely as they shadowed over everything else to the point that the performance boost wouldn't be noticeable. As can be seen however, both a single global atomic operation, and even the LL/SC combo operation with a potential remote reference (meaning an added on statement) beats a normal sync variable. In every case, GlobalAtomicObject wins outright, but performance penalties from remote references are very apparent and should be avoided at all costs.

Usage

Sample usage of GlobalAtomicObject can be seen below (and yes, it does work).

class C { var c : int; }
var a = new C(1);
var b = new C(2);
var x = new GlobalAtomicObject(C); // atomic C
var result : bool;

x.write(a);
result = x.compareExchange(a, b);
assert(result);
assert(x.read() == b);

// Note that you can call 'delete' on the object while having it still be present
// in the descriptor table. This may be okay for most cases where memory is not
// an issue since reused addresses will hash to the same node and that node will
// already point to the valid object the user is attempting to insert. However if
// memory is an issue, one can call '_delete' to ensure that the node itself is removed
// and can be recycled for later use.
delete a;
delete b;

// Is Safe because we only use the pointer itself. However, in cases where 'a' and 'b'
// can be reused by the same GlobalAtomicObject, then this could cause undefined behavior
// where another concurrent task adds the same pointer. In those cases, you should call
// '_delete' before 'delete'.
x._delete(a);
x._delete(b);

// As well, when GlobalAtomicObject goes out of scope, all nodes it had in use also
// get deleted...

GlobalAtomicObject(C) can easily become atomic C. If the language were to be modified, it wouldn't be too surprising to see the above become the below (which of course lacks the excessive comments), which is an exemplary ideal that @mppf insisted on:

class C { ... }
proc test() {
   var a = new C(1);
   var b = new C(2);
   var x: atomic C;
   var result:bool;
   x.write(a);
   result = x.compareExchange(a, b);
   assert(result);
   assert(x.read() == b);
   delete a;
   delete b;
}

Where delete can call some hook which also removes it from the table if it has not yet been reclaimed. I believe this can become a reality with little effort.

Improvements

Below, I go over some potential improvements that can be made by others who are up to it.

Pointer Compression + Descriptor Tables

It may be possible to combine normal pointer compression for the lower locales that fit within 16 bits. Combining both means we allow better performance on average, even for cases when they have more than 16 bits worth of locales, any locales with an id less than 2^16 get the power of atomics, while the ones greater than or equal to 2^16 get to use the much-slower-but-faster-than-sync-variables descriptors.

See next idea for how to differentiate.

Zone Compression

A theoretical model where we use a more dynamic type of compression which attempts to uses the lowest 32 bits for the virtual address, and divide the upper 32 bits into zone bits and locale bits. A 'zone' is a base 64-bit offset identified by zone id, which is allocated when a new unique id is found. With optimizations such as caching the zone mappings for other locales (I.E: First time a unique upper 32 bits are found, check if we have it cached, if not found jump to the owning locale and add it if not already, and update what we have cached, etc.)

The benefit here is that we get to keep using 64-bit pointer compression, and the amount of unique zones that can be kept track of depend on the number of bits needed to keep track of the locale id.

"What happens when we run out of zones?" - We use the descriptor tables again. We can just search the unique zones for the target locale first, if the zone isn't found and we can't create a new one, we just use descriptors.

"How do we differentiate between Zone Compression, Pointer Compression, and Descriptor Tables?" - Its possible to use the lowest 2 bits. This means lowering the potential maximum segments for descriptor tables, but it allows a much more dynamic compression strategy.

Caching Descriptors into Objects

This is a change that isn't theoretical but requires changes with the language itself. Majority of the overhead of descriptor tables is hashing the object, which requires jumping to the owning locales, but if the descriptors were cached then repeated accesses would be just about as fast as atomics. If it were cached, the benefits of recycling memory to be used in atomics would be tenfold (or hundredfold, as the benchmarks show).

Concurrent-Safe Implementation

The table utilizes a simplistic skip list implementation on a segmented memory pool. The memory pool allows indexing by allowing an easy translation from a 32 bit index into a segment and segment data index. It utilizes a simple bitmap (BigInteger). The implementation of the memory pool allows wait-free reads (hence indexing is fine) but requires synchronization to recycle/reuse memory. The skip list used to manage memory also requires synchronization, and a lock-free/wait-free implementation requires proper memory management such as Hazard Pointers which requires thread-local or task-local storage. Hence, synchronization is needed and we use a simple local sync variable.

If someone managed to make everything concurrent, it could improve performance further by 5 - 10x at least.

Edit:

Am fixing up for official issue.

e-kayrakli commented 7 years ago

You don't have to get get feedback on a PR message before opening it. Your PR reviewer will ask you to update it if they think it's necessary. Note that PR message becomes the merge commit message, so it goes into git history, more or less. So, I think capturing any "gotchas" in the code, or your software design at a high-level is important. The "Details" part seems to do that well.

Since you already have it here: My initial thought is that this is a bit too long for a PR message. If you want to make it more concise, "Introduction", "Performance" and "Usage" headers might go away, or can be summarized with couple of sentences. Usage details can also be summarized into a documentation in the code.

That being said, @mppf can override any comments I have here :)

mppf commented 7 years ago

Currently atomics are only supported for 64-bit values

up to 64-bit values

current hardware is restricted to 64-bit atomic operations, how do we atomically modify a 128-bit value?

some hardware supports 128-bit CAS; it's just that I don't think we want to require it.

I skimmed over the rest. My suggestion would be to make that into a GitHub issue and simply link to that issue from any PR message you make to do the implementation. It's advisable to ask for feedback on the issue before you actually open a PR.

Thanks!

LouisJenkinsCS commented 7 years ago

Decided to go ahead and create the issue describing the new construct here, but official PR is a bit of ways off, hopefully by Friday at this point.

LouisJenkinsCS commented 7 years ago

PR Officially up: https://github.com/chapel-lang/chapel/pull/6717