Specialization of all atomic ops

bremoran commented 8 years ago

atomic_incr and atomic_decr could easily be specialized when LDREX and STREX are available. Is this a good idea?

Where LDREX and STREX are available, the current call path of atomic_incr is:

bool atomic_cas(uint32_t *ptr, uint32_t *expectedCurrentValue, uint32_t desiredValue)
{
    uint32_t currentValue = __LDREXW(ptr);
    if (currentValue != *expectedCurrentValue) {
        *expectedCurrentValue = currentValue;
        __CLREX();
        return false;
    }

    return !__STREXW(desiredValue, ptr);
}

template<typename T>
T atomic_incr(T *valuePtr, T delta)
{
    T oldValue = *valuePtr;
    while (true) {
        const T newValue = oldValue + delta;
        if (atomic_cas(valuePtr, &oldValue, newValue)) {
            return newValue;
        }
    }
}

An alternative implementation is:

uint32_t atomic_incr(uint32_t *valuePtr, uint32_t delta)
{
    uint32_t newValue;
    do {
        newValue = __LDREXW(valuePtr) + delta;
    } while (__STREXW(newValue, valuePtr));
    return newValue;
}

The latter appears to be a more compact, faster, and more readable implementation, but there may be other drawbacks.

cc @rgrover @bogdanm @0xc0170 @niklas-arm @autopulated

rgrover commented 8 years ago

I believe the specialization-to-LDREX smartness should be limited to atomic_cas(). Atomic_incr() etc. should build upon atomic_cas() so that readers and users are encouraged to re-use that API.

From: Brendan Moran [mailto:notifications@github.com] Sent: 18 January 2016 16:43 To: ARMmbed/core-util Cc: Rohit Grover Subject: [core-util] Specialization of all atomic ops (#58)

atomic_incr and atomic_decr could easily be specialized when LDREX and STREX are available. Is this a good idea?

Where LDREX and STREX are available, the current call path of atomic_incr is:

bool atomic_cas(uint32_t ptr, uint32_t expectedCurrentValue, uint32_t desiredValue)

{

uint32_t currentValue = __LDREXW(ptr);

if (currentValue != *expectedCurrentValue) {

    *expectedCurrentValue = currentValue;

    __CLREX();

    return false;

}

return !__STREXW(desiredValue, ptr);

}

template

T atomic_incr(T *valuePtr, T delta)

{

T oldValue = *valuePtr;

while (true) {

    const T newValue = oldValue + delta;

    if (atomic_cas(valuePtr, &oldValue, newValue)) {

        return newValue;

    }

}

}

An alternative implementation is:

uint32_t atomic_incr(uint32_t *valuePtr, uint32_t delta)

{

uint32_t newValue;

do {

    newValue = __LDREXW(valuePtr) + delta;

} while (__STREXW(newValue, valuePtr));

return newValue;

}

The latter appears to be a more compact, faster, and more readable implementation, but there may be other drawbacks.

cc @rgroverhttps://github.com/rgrover @bogdanmhttps://github.com/bogdanm @0xc0170https://github.com/0xc0170 @niklas-armhttps://github.com/niklas-arm @autopulatedhttps://github.com/autopulated

— Reply to this email directly or view it on GitHubhttps://github.com/ARMmbed/core-util/issues/58.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

ciarmcom commented 8 years ago

ARM Internal Ref: IOTSFW-1827

bremoran commented 8 years ago

I have created a simple test application so that we can look at the cost of each implementation:

uint32_t incr(uint32_t *v, uint32_t delta);
uint32_t atomic_incr(uint32_t *v, uint32_t delta);
uint32_t util_atomic_incr(uint32_t *v, uint32_t delta);

void app_start(int, char**) {
    uint32_t a = 0;
    uint32_t b = 0;
    uint32_t c = 0;
    incr(&a,1);
    atomic_incr(&b,1);
    util_atomic_incr(&c,1ul);
    printf("%lu\r\n", a);
    printf("%lu\r\n", b);
    printf("%lu\r\n", c);
}

// This implementation is not functional, it's just a reference for what the minimum non-atomic operation is.
uint32_t incr(uint32_t * v, uint32_t delta) {
    return (*v += delta);
}
// This atomic_incr is not in the mbed::util namespace
uint32_t atomic_incr(uint32_t * v, uint32_t delta) {
    uint32_t newValue;
    do {
        newValue = __LDREXW(v) + delta;
    } while (__STREXW(newValue, v));
    return newValue;
}
// Note that this function was necessary to encapsulate inline instantiation of the atomic_incr template.
uint32_t util_atomic_incr(uint32_t *v, uint32_t delta) {
    return mbed::util::atomic_incr(v, delta);
}

Code Generated:

/**
 * Code generated in release mode:
 * uint32_t incr(uint32_t * v, uint32_t delta) {
 *     5314:       4603            mov     r3, r0
 *     5316:       6800            ldr     r0, [r0, #0]
 *     5318:       4408            add     r0, r1
 *     531a:       6018            str     r0, [r3, #0]
 *     531c:       4770            bx      lr
 */

/**
 * Code generated in release mode:
 * uint32_t atomic_incr(uint32_t * v, uint32_t delta) {
 *     531e:       4603            mov     r3, r0
 *     5320:       e853 0f00       ldrex   r0, [r3]
 *     5324:       4408            add     r0, r1
 *     5326:       e843 0200       strex   r2, r0, [r3]
 *     532a:       2a00            cmp     r2, #0
 *     532c:       d1f8            bne.n   5320 <atomic_incr(unsigned long*, unsigned long)+0x2>
 *     532e:       4770            bx      lr
 */

/**
 * Code generated in release mode
 * 0000531c <mbed::util::atomic_incr(unsigned long*, unsigned long)>:
 * 531c:       b573            push    {r0, r1, r4, r5, r6, lr}
 * 531e:       6803            ldr     r3, [r0, #0]
 * 5320:       9301            str     r3, [sp, #4]
 * 5322:       4605            mov     r5, r0
 * 5324:       460e            mov     r6, r1
 * 5326:       a902            add     r1, sp, #8
 * 5328:       4628            mov     r0, r5
 * 532a:       f851 4d04       ldr.w   r4, [r1, #-4]!
 * 532e:       4434            add     r4, r6
 * 5330:       4622            mov     r2, r4
 * 5332:       f000 f886       bl      5442 <bool mbed::util::atomic_cas<unsigned long>(unsigned long*, unsigned long*, unsigned long)>
 *
 *     Note: this section is inserted as an indented block to show the full call path
 *     00005442 <bool mbed::util::atomic_cas<unsigned long>(unsigned long*, unsigned long*, unsigned long)>:
 *     5442:       b510            push    {r4, lr}
 *     5444:       e850 3f00       ldrex   r3, [r0]
 *     5448:       680c            ldr     r4, [r1, #0]
 *     544a:       42a3            cmp     r3, r4
 *     544c:       d004            beq.n   5458 <bool mbed::util::atomic_cas<unsigned long>(unsigned long*, unsigned long*, unsigned long)+0x16>
 *     544e:       600b            str     r3, [r1, #0]
 *     5450:       f3bf 8f2f       clrex
 *     5454:       2000            movs    r0, #0
 *     5456:       bd10            pop     {r4, pc}
 *     5458:       e840 2300       strex   r3, r2, [r0]
 *     545c:       f1d3 0001       rsbs    r0, r3, #1
 *     5460:       bf38            it      cc
 *     5462:       2000            movcc   r0, #0
 *     5464:       bd10            pop     {r4, pc}
 *
 * 5336:       2800            cmp     r0, #0
 * 5338:       d0f5            beq.n   5326 <util_atomic_incr(unsigned long*, unsigned long)+0xa>
 * 533a:       4620            mov     r0, r4
 * 533c:       b002            add     sp, #8
 * 533e:       bd70            pop     {r4, r5, r6, pc}
 */

bogdanm commented 8 years ago

What Rohit said feels right as a general rule: have a single truly atomic primitive (atomic_cas in our case) and implement the rest of atomic operations on top of it. The difference in size between the two versions (above) is also quite suggestive. In the end, I don't believe this is really needed, but if you want to go ahead and implement it, I'm fine with that (but remember to do the same with atomic_decr :) )

bremoran commented 8 years ago

I agree that it's useful to have a single truly atomic primitive that always works. I also see a case for optimized versions of particular atomic primitives where they are useful. I understand that, by introducing optimized versions of more atomic primitives, it could encourage developers of new atomic operations to start by writing the optimized version and that this could have a detrimental effect on the portability of their atomics. Could we add some documentation that explains how to implement portable atomic operations so that we can still give the right guidance while also enabling optimized atomics?

I would be happy to use atomic_incr/decr as a case study for how to implement atomics in a portable way, that still permits optimization where needed.

I care more about atomic_incr and atomic_decr because they are used in sbrk(), krbs(), and the reference counting mechanism that's going into Function. The Function use-case, in particular has ramifications for scheduling callbacks, since each call to minar::postCallback() will cause, at minimum, one call to atomic_incr and each non-periodic invokation of a callback in minar will cause, at minimum, one call to atomic_decr. Considering these use-cases, it appears that atomic_incr and atomic_decr could easily live in a high-frequency call path, depending on the application. While I would normally say that profiling was necessary prior to considering optimization, the optimization of these APIs is trivial, and will take less engineering effort than the profiling.

niklarm commented 8 years ago

:+1: for adding specialized, fast atomic_incr/decr.

bogdanm commented 8 years ago

OK, I'm sold :) Specialize ALL the things!

ARMmbed / core-util

Specialization of all atomic ops #58