dpdani / cereggii

Thread synchronization utilities for Python
Apache License 2.0
8 stars 0 forks source link

Performance increase when running with the GIL #12

Closed dpdani closed 8 months ago

dpdani commented 8 months ago

There appears to be some efficiency being lost when running without the GIL, which is not addressed by the current implementation of AtomicInt:

$ python counter.py
...
A counter using cereggii.AtomicInt and cereggii.AtomicIntHandle.
spam.counter.get()=0
spam.counter.get()=15000000
Took 2.40 seconds (16.58x faster).

$ PYTHONGIL=1 python counter.py
...
A counter using cereggii.AtomicInt and cereggii.AtomicIntHandle.
spam.counter.get()=0
spam.counter.get()=15000000
Took 0.59 seconds (2.11x faster).

The same example is running ~4x faster.

Should try to understand where this performance is lost, and mitigate the issue.

dpdani commented 8 months ago

@colesbury do you have any hints about this difference?

colesbury commented 8 months ago

Probably contention on the reference counts (primarily) and on the counter. If you have a inherently serial process, adding more threads will just slow things down. You will get better performance with the coarser serialization from the GIL.

dpdani commented 8 months ago

Mm, but each thread should only contend its own handle's reference count, so that there shouldn't be contention there.

The only object whose reference count could be contended should be the int 1, but I thought integers between -5 and 256 should be "immortal" (src).

colesbury commented 8 months ago

Ok, it sounds like reference count contention is not the issue, but you still have contention on atomic integer itself. For example, you have 3 threads:

T1: 5,000,000 increments T2: 5,000,000 increments T3: 5,000,000 increments

It'll be a lot faster if the threads run sequentially (T2 waits for T1, etc.) then if they run in parallel. The GIL doesn't force them to run sequentially, but it moves you closer to that case.

If the threads run in parallel, they will all compete to modify the same cache line. The more interleaving between threads/cores, the more time wasted on overhead.

dpdani commented 8 months ago

Yes, can confirm. Even with plain C some similar performance difference can be seen:

$ /usr/bin/time -p ./casperf8uncontended
real 10.63
user 21.24
sys 0.00
$ /usr/bin/time -p ./casperf8contended
real 57.80
user 114.05
sys 0.00
#include <stdlib.h>
#include <pthread.h>

void *thread_routine(void *spam) {
    for (int i = 0; i < 4294967295UL; i++) {
        __sync_bool_compare_and_swap_1(spam, 0, 0);
    }

    return spam;
}

int main(void) {
    pthread_t thread1, thread2;
    void* spam = malloc(64 + 1);  // in two distinct cache lines

    int t1 = pthread_create(&thread1, NULL, thread_routine, (void*) spam);
    int t2 = pthread_create(&thread2, NULL, thread_routine, (void*) (spam + 64));

    pthread_join(thread1, NULL);
    pthread_join(thread2, NULL);

    return t1 + t2;
}
#include <stdlib.h>
#include <pthread.h>

void *thread_routine(void *spam) {
    for (int i = 0; i < 4294967295UL; i++) {
        __sync_bool_compare_and_swap_1(spam, 0, 0);
    }

    return spam;
}

int main(void) {
    pthread_t thread1, thread2;
    void* spam = malloc(1);  // 8bits

    int t1 = pthread_create(&thread1, NULL, thread_routine, (void*) spam);
    int t2 = pthread_create(&thread2, NULL, thread_routine, (void*) spam);

    pthread_join(thread1, NULL);
    pthread_join(thread2, NULL);

    return t1 + t2;
}