Loading of global epoch is questionably early in ck_epoch_begin

The epoch is loaded into a record's local epoch not-after the record is marked as active, so, for example, ck_epoch_poll may ignore an thread that is about to enter an epoch-protected section with an old epoch.

More concretely, imagine a sequentially consistent execution (so we can reason in terms of reorderings), which is more strict than the model concurrencykit is actually using.

thread A calls ck_epoch_begin, loads the global epoch G https://github.com/concurrencykit/ck/blob/3679c1fb8bc/include/ck_epoch.h#L96 and stops executing
thread B calls ck_epoch_poll several times:
- ck_epoch_scan is called https://github.com/concurrencykit/ck/blob/3679c1fb8bc/src/ck_epoch.c#L219
- ck_epoch_scan ignores thread A (since A's active flag hasn't been set yet) https://github.com/concurrencykit/ck/blob/3679c1fb8bc/src/ck_epoch.c#L247
- the global epoch is incremented https://github.com/concurrencykit/ck/blob/3679c1fb8bc/src/ck_epoch.c#L431
thread A is resumed, finally setting its active flag, but pinned to an epoch many generations old https://github.com/concurrencykit/ck/blob/3679c1fb8bc/include/ck_epoch.h#L109

This results in thread A and thread B executing concurrently while arbitrarily many generations apart, but a distance of 1 (or 2) generations is an assumption for correctness in the academic descriptions of the scheme that I have seen. I'm unfamiliar with the details of CK's variant, so this may be OK, but it seems to me that it needs more justification.

I believe moving the epoch manipulations to be after setting the active flag and fencing should defend against the failure mode described above, if it is indeed a problem.

(This possible problem was noticed and pointed out to me by @aturon.)

Hey @huonw and @aturon,

First off, thanks much for reviewing the ck_epoch code. However, I believe it is correct. It's fine to be many generations old as long as you don't actually physically destroy objects in this situation or tick the epoch (unless that snapshot is from an older grace period), which is the central concept. The academic implementations have readers tick over the epoch, where following this constraint to the letter matters since they might otherwise be deferring to stale limbo lists (this also implies significant serialization penalty on some write paths). All epoch ticks here occur outside of protected sections and on epoch snapshots with respect to the individual writers.

The point of the fence is ordering with respect to the active flag and not the epoch counter. The epoch counter will be either observably old or has passed a grace period (wrap-around). As for the limbo lists, remember we only garbage collect a limbo list as a result of a successful tick (from the calling thread or another) with respect to the callers observation of the global epoch. Also, the limbo lists are local to every caller of ck_epoch_call/ck_epoch_poll. I think ck_epoch_poll is way too conservative but I'm not confident in relaxing some of the constraints yet (primarily because once I have time to revisit it, there is a better and more-finely grained interface to implement poll).

Let's first tackle ck_epoch_begin ordering semantics. The Concurrency Kit implementation does not apply modulo for the epoch distance calculation and relies on global ordering of memory operations with respect to the epoch tick, this allows for much better amortization across writers trying to detect a grace period or tick (fewer write activity). The memory fence is really about ordering protected sections with respect to setting the active flag and nothing to do with the epoch counter. The epoch counter counter observed is either current, observably old or has passed a grace period.

For example, let's take the execution history you mention where we read an old epoch and stop executing before setting the active flag. Imagine now that it is some handful of generations old. This would be detectable unless the epoch was ticked UINT_MAX generations. For example, global epoch might be 12 but our snapshot was from 4. This doesn't break correctness, because protected section's load operation is either from epoch 4 or current (up to previous generation). Eitherway the epoch counter will not tick because the processor doing the epoch scan will also either see 4 or something current. What about overflow in the tick? If it was ticked UINT_MAX generations, we know that at least a grace period has passed. Since we have a full fence and global memory ordering, we also know that the hazardous references (contents of section between begin and end) would occur with respect to either the current global epoch or the previous, so the epoch invariant is maintained. Again, if that isn't the case, then the epoch counter would not tick to begin with.

Alright, what about ck_epoch_poll? Things get a little bit hairy here because we are applying the modulo semantics. There is a full memory fence up top. Any updates on the writer's side (remember, you have per-writer limbo lists, not global limbo lists) are visible. If all threads have active = 0, then we know that any "stale" epoch (arbitrarily many generations apart) is irrelevant because they haven't hit their fence yet with respect to the fence in ck_epoch_poll (similarly in ck_epoch_synchronize).

We know that any wrap-around for a position in the limbo list represents a grace period with respect to the contents of the limbo list. Given the epoch snapshot of the caller of ck_epoch_poll, we can apply this same invariant to the epoch that we incremented to (but not incremented from, unless it was mutated due to a successful tick which also indicates a grace period with respect to the caller and limbo list offset). This is all assuming that ck_epoch_call defers objects to the same epoch as ck_epoch_poll. What if that isn't the case with respect to readers? For example, I might defer to 2 as a writer but a reader is at "2" (limbo list position) from a long time ago. In this That doesn't matter either because we only collect a limbo list if we have observed an actual tick occur and we always only collect to the tick we mutated to (which implies at least e + 2).

But is there a reason not to move the update of the record past the fence and after active? If so, it would likely be some negligible performance reason (gain / loss depending on protected section).

I would like to note that I think is ck_epoch_poll is utter-shit as far as incremental reclamation semantics, it is best-effort, too conservative and more coarse-grained than I would like. There's definitely room for improvement here and any efforts to take a stab at this are welcome. I think simply adding more limbo lists might solve a lot of the issues along with reclaiming up to modulus e - 2. We use 4 here primarily for performance (modulus on fast path of call sucks when deleting objects in bulk).

Does this clarify things for you? I'll be around #rust-libs for a few in case you want to chat about it.

concurrencykit / ck

Loading of global epoch is questionably early in ck_epoch_begin #63