9fans / plan9port

Plan 9 from User Space
https://9fans.github.io/plan9port/
Other
1.64k stars 326 forks source link

fmt: replace formatter rwlock with mutex #338

Closed dancrossnyc closed 4 years ago

dancrossnyc commented 4 years ago

The use of reader/writer locks was actually a pessimization, as shown by the following benchmark and collected data:

Benchmark program:

-->CUT<--

void
thr(void *arg)
{
    Channel *c = (Channel *)arg;
    char buf[128];
    int i;

    for(i=0; i<1000000; i++)
        snprint(buf, sizeof(buf), "%s%04d/%02d/%02d%s%s%s%s",
            "hi", 2020, 1, 1, "happy", "new", "year", "!");
    sendul(c, 0);
    threadexits(0);
}

enum {
    NTHREADS = 16,
    STACK = 4096,
};

void
threadmain(int argc, char *argv[])
{
    (void)argc;
    (void)argv;
    Channel *chans[NTHREADS];
    int i;

    for(i=0; i<NTHREADS; i++){
        chans[i] = chancreate(sizeof(ulong), 1);
        threadcreate(thr, chans[i], STACK);
    }
    for(i=0; i<NTHREADS; i++){
        (void)recvul(chans[i]);
        chanfree(chans[i]);
    }
    threadexits(0);
}

-->CUT<--

===On Debian Linux 5.2.17/x86_64 (9c bench.c && 9l -o bench bench.o)=== (HP Z440, about a year old; Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz)

With RWLock:

**Best time over six runs:

: spitfire; time ./bench

real 0m9.208s user 0m0.000s sys 0m0.003s : spitfire;

Raw times: 9.209, 9.193, 9.454, 9.450, 9.436, 9.208 avg 9.325, med 9.323, std dev +/-.134

With Lock (simple mutex):

**Best time over six runs:

: spitfire; time ./bench

real 0m6.802s user 0m0.002s sys 0m0.001s : spitfire;

Raw times: 6.814, 6.808, 6.808, 6.802, 6.817, 6.802 avg 6.809, med 6.808, std dev +/-.006

===On macOS Mojave/x86_64 (9c bench.c && 9l -o bench bench.o)=== (2013 Mac Pro, about two years old; 3.5 GHz 6-Core Intel Xeon E5)

With RWLock:

**Best time over six runs:

: tempest; time ./bench

real 0m11.753s user 0m0.001s sys 0m0.002s : tempest;

Raw run times: 11.820, 11.865, 11.776, 11.766, 11.753, 11.852 avg 11.805, med 11.798, std dev +/-.047

With Lock:

**Best time over six runs:

: tempest; time ./bench

real 0m8.818s user 0m0.001s sys 0m0.002s : tempest;

Raw run times: 8.828, 8.825, 8.818, 8.883, 8.934, 8.863 avg 8.859, med 8.846, std dev +/-.045

===On FreeBSD 12.1/x86_64 (9c bench.c && 9l -o bench bench.o -lpthread)=== (VPS; Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (2295.80-MHz K8-class CPU))

With RWLock:

**Best time over six runs:

: pingala; time ./bench ./bench 0.00s user 0.00s system 0% cpu 24.690 total : pingala;

Raw run times: 25.693, 26.178, 24.866, 25.392, 25.690, 25.881 avg 25.617, med 25.692, std dev +/-.449

With Lock:

**Best time over six runs:

: pingala; time ./bench ./bench 0.00s user 0.00s system 0% cpu 15.164 total : pingala;

Raw run times: 15.688, 15.400, 16.123, 15.164, 15.969, 16.256 avg 15.767, med 15.829, std dev +/-.427

rsc commented 4 years ago

There's no parallelism in this benchmark, because the test is using threadcreate instead of proccreate.

dancrossnyc commented 4 years ago

Fair point, though the change in run time does suggest the type of lock has an effect.

Further, changing my benchmark to use proccreate instead of threadcreate yields similar results, with the mutex being substantially faster than the reader/writer lock on the platforms I tested (macOS, Linux, FreeBSD). Indeed, on Linux, Lock was twice as fast as RWLock. However, with the parallelism of proccreate, tests run substantially longer on all platforms.

I'll update the commit message with new data.

rsc commented 4 years ago

My system is nothing like your system, apparently (MacBook Pro). For me, the QLock is 10X slower real time.

r$ cat x.c
#include <u.h>
#include <libc.h>
#include <thread.h>

void
thr(void *arg)
{
    Channel *c = (Channel *)arg;
    char buf[128];
    int i;

    for(i=0; i<10000; i++)
        snprint(buf, sizeof(buf), "%s%04d/%02d/%02d%s%s%s%s",
            "hi", 2020, 1, 1, "happy", "new", "year", "!");
    sendul(c, 0);
    threadexits(0);
}

enum {
    NTHREADS = 16,
    STACK = 4096,
};

void
threadmain(int argc, char *argv[])
{
    (void)argc;
    (void)argv;
    Channel *chans[NTHREADS];
    int i;

    for(i=0; i<NTHREADS; i++){
        chans[i] = chancreate(sizeof(ulong), 1);
        proccreate(thr, chans[i], STACK);
    }
    for(i=0; i<NTHREADS; i++){
        (void)recvul(chans[i]);
        chanfree(chans[i]);
    }
    threadexitsall(0);
}
r$ cat src/lib9/fmtlock2.c
#include <u.h>
#include <libc.h>

static RWLock fmtlock;

void
__fmtrlock(void)
{
    rlock(&fmtlock);
}

void
__fmtrunlock(void)
{
    runlock(&fmtlock);
}

void
__fmtwlock(void)
{
    wlock(&fmtlock);
}

void
__fmtwunlock(void)
{
    wunlock(&fmtlock);
}
r$ (cd src/lib9 && mk)
mk: 'default' is up to date
r$ 9c x.c && 9l x.o && 9 time ./a.out && 9 time ./a.out && 9 time ./a.out
0.50u 4.15s 0.40r    ./a.out
0.51u 4.35s 0.42r    ./a.out
0.50u 4.25s 0.41r    ./a.out
r$ 

OK, make change to QLock:

r$ git diff src/lib9
diff --git a/src/lib9/fmtlock2.c b/src/lib9/fmtlock2.c
index b755daa3..0dc81651 100644
--- a/src/lib9/fmtlock2.c
+++ b/src/lib9/fmtlock2.c
@@ -1,28 +1,28 @@
 #include <u.h>
 #include <libc.h>

-static RWLock fmtlock;
+static QLock fmtlock;

 void
 __fmtrlock(void)
 {
-   rlock(&fmtlock);
+   qlock(&fmtlock);
 }

 void
 __fmtrunlock(void)
 {
-   runlock(&fmtlock);
+   qunlock(&fmtlock);
 }

 void
 __fmtwlock(void)
 {
-   wlock(&fmtlock);
+   qlock(&fmtlock);
 }

 void
 __fmtwunlock(void)
 {
-   wunlock(&fmtlock);
+   qunlock(&fmtlock);
 }
r$ (cd src/lib9 && mk)
9c  fmtlock2.c
9ar rsc $PLAN9/lib/lib9.a fmtlock2.o
r$ 9c x.c && 9l x.o && 9 time ./a.out && 9 time ./a.out && 9 time ./a.out
1.98u 3.49s 4.32r    ./a.out
2.08u 3.74s 4.60r    ./a.out
2.14u 3.86s 4.73r    ./a.out
r$ 
dancrossnyc commented 4 years ago

I'm not using a QLock, I'm using a spinlock (Lock, sans Q). Am I wrong?

dancrossnyc commented 4 years ago

Hmm; reading the man page, maybe I am wrong. It says this:

          Although Locks are the more primitive lock, they have limi-
          tations; for example, they cannot synchronize between tasks
          in the same proc. Use QLocks instead.
dancrossnyc commented 4 years ago

Hmm...and on re-reading the BUGS section, it's not clear to me that the above is actually an issue on p9p.

rsc commented 4 years ago

Please drop this. I had the QLock/Lock version installed after that test and got an acme deadlock out of it. I am going to get rid of any locking on the fast path again.

dancrossnyc commented 4 years ago

Eek. That sounds weird, but ok.

rsc commented 4 years ago

The problem is that the lock - any lock - changes calls to print from being signal-safe to being signal-unsafe. That path needs to be non-blocking.