concurrencykit / ck

Concurrency primitives, safe memory reclamation mechanisms and non-blocking (including lock-free) data structures designed to aid in the research, design and implementation of high performance concurrent systems developed in C99+.
http://concurrencykit.org/
Other
2.35k stars 312 forks source link

Create an aba_t type with a family of ck_pr functions around it #170

Closed sbahra closed 1 day ago

sbahra commented 3 years ago

In several places in CK we use a pattern like this:

struct ck_stack {
    struct ck_stack_entry *head;
    char *generation CK_CC_PACKED;
} CK_CC_ALIASED;

We can instead encode this using something like:

struct ck_stack {
    ck_pr_aba_t *head;
};

Where ck_pr_aba_t is an ABA-safe counter. This vastly reduces boiler-plate for these data structures when it comes to portability and makes it easier to implement data structures for things like CHERI.

Data structures using this pattern:

ck_stack
ck_hp_fifo
ck_hp_stack
ck_fifo

If there is a better name, let's go for it.

sbahra commented 3 years ago

    struct ck_stack_entry *stack;

    stack = ck_pr_load_ptr(&target->head);
    entry->next = stack;
    ck_pr_fence_store();

    while (ck_pr_cas_ptr_value(&target->head, stack, entry, &stack) == false) {
        entry->next = stack;
        ck_pr_fence_store();
    }

ck_pr_aba_t *stack;
stack = ck_pr_ll(&target->head);
entry->next = stack;
ck_pr_fence_store();
while (ck_pr_cas_sc(&target->head, stack, entry, &stack) == false) {
    entry->next = stack;
    ck_pr_fence_store();
}

For architectures not supporting LLSC:
ck_pr_ll = {
    load(generation);
    fence_load();
    load(pointer);
}

cas_sc_value = {
    cas_ptr_2(...);
}

For architectures supporting LLSC:
ck_pr_ll = ll
cas_sc = sc + load

If we have time, adding native ll/sc to ARM, PPC, RiscV, etc...
If no time, add LL/SC to the common ck_pr.h (#ifdef CK_F_PR_LLSC) and to the Cheri port.
Update data structures using cas_2 for ABA-protection to use the new ck_pr_whatever_t.```
pkhuong commented 3 years ago

There's a lot of restrictions on HW LL/SC, and it's not obvious how to make sure that compiled code obeys them on all platforms. I think it would make more sense to use HW LL/SC to implement our software LL/SC the same way that we do on CAS_2 platforms.

sbahra commented 3 years ago

Thanks for the input. I shared that same concern. My opinion has changed.

For others following along: The motivation for this is to not rely on CAS-based speculation when we don't have to. LL/SC-based platforms lacking double-width historically did not benefit from having ABA-free operations otherwise. It is possible to implement ABA-safety using weak CAS with significant cost to fast path (a la https://pvk.ca/Blog/2015/01/13/lock-free-mutual-exclusion/) but we can't rely on the existing tagging system CAS2 uses. RDCSS seems reasonable but leaves a lot of performance on the floor for LL/SC platforms.

ISTM in practice for the relevant platforms today (looking at Power and ARM) that we can achieve the above interface reliably: 1) For the actual critical section, we can require ck_pr operations if necessary, which meets the requirements of LL/SC (the compiler barriers are needed). ck_pr may not even be necessary in practice. Branching should be fine. See https://github.com/Tr0py/lock-free-stack-arm-asm/blob/master/stack.s and https://developer.arm.com/documentation/ddi0487/latest/ 2) If there is a problem with the above approach, we can also expose instead a ck_pr family of functions for LL/SC-safe family of operations. Hopefully it will not come down to this (edit). 3) Good test coverage for this is reasonable.

@pkhuong Perhaps I'm missing something, but I don't see how we can implement the software LL/SC the same way as CAS2 if we're lacking double-width SC. Were you thinking of something along the lines of RDCSS and given the points above, is it worth it? And if all else fails, I wonder if RDCSS can just be one potential backend for ck_pr_aba_t.

pkhuong commented 3 years ago

My stance is that we can only reliably use HW LL/SC in (inline) assembly. As soon as regular compiler-generated code can go between the LL and related SC, things are liable to go wrong. If we can't find some software primitive that does what we want for LL/SC, then we should expose a couple data structure -specific helpers, implemented in assembly.

sbahra commented 3 years ago

That was what I was hoping to avoid in the past (and situation gets worse with CHERI). Will explore the path above (architectures are a lot more lax these days) and see what limitations we hit. It may be reasonable to implement the software LL/SC only for CHERI with the hardware extensions it provides. Worse comes to worst, falling back to RDCSS would work for the use-case.

@qwattash It'll be useful if we can review the CHERI port before this work is started (we can disable the structures requiring CAS2 initially and merge the port into master). In particular, it may be possible to mitigate the need for the LL/SC-based ck_pr_aba_t using CHERI's extensions.

pkhuong commented 3 years ago

ARM has double-word LL/SC with ldxp/stxp.

sbahra commented 3 years ago

Yes, implemented in https://github.com/concurrencykit/ck/blob/master/include/gcc/aarch64/ck_pr_llsc.h but in general interested in a solution that'll work nicely with CHERI's tagging, Power and Risc V.

qwattash commented 3 years ago

I think the issues with CHERI will be incrementally tackled. Currently what I have done is strictly related to what is used by the FreeBSD kernel, so many features are left unused, and I have just some quick fix in the generic ck_pr.h that replaces __sync_* with __atomic_* that our compiler for MIPS and RISC-V understands, this works around everything CheriBSD uses. That said it would be nice to have ck have full support for RISC-V (MIPS will soon be discontinued anyway). There are two sets of changes that I found so far in arch-independent code:

  1. ABA typing that would be nicer to hide behind a machine-dependent abstraction
  2. Sub-object bounds attributes to let the compiler know that it should not set tight bounds on the various *_ENTRY() structures (otherwise we get a capability that can not be used to access the container structure).

The RISC-V CHERI implementation has LR/SC and there are no extra restrictions that I'm aware of, although there are no load/store pair instructions (as in plain RISC-V). I expect it to differ from plain RISC-V only in minor assembly instructions and register naming. I started having a look at the ABA-safe pointer, I'll read more about the various things you have linked as I'm not entirely familiar with all the issues. I can probably try to isolate into a fork of CK the changes I have, so we can all have a look more easily.

sbahra commented 3 years ago

Thanks for the update @qwattash

sbahrafreebsd commented 3 years ago

@qwattash Let me know if you would like to catch up on this, happy to help or discuss.

qwattash commented 3 years ago

@sbahrafreebsd Sorry for the lack of progress, I have been busy in the past months working at other projects, a side-effect of this has been that I have a patch to support the experimental Morello extensions to aarch64. Will come back to this ASAP.

sbahra commented 3 years ago

Ok, thanks!