Open cjappl opened 1 year ago
Neat, thanks! Do the tests and benchmarks compile using the makefiles in those directories? (May need to replace g++
with clang++
.)
Compiled with no issue, running says
Note: SPSC queue not supported on this platform, discount its timings
Full output:
> ./benchmarks
Note: SPSC queue not supported on this platform, discount its timings
|---------------- Min -----------------|----------------- Max -----------------|----------------- Avg -----------------|
Benchmark | RWQ | BRWCB | SPSC | Folly | RWQ | BRWCB | SPSC | Folly | RWQ | BRWCB | SPSC | Folly | xSPSC | xFolly
------------------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+-------+-------
Raw add | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.00x | 0.41x
Raw remove | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.00x | 0.28x
Raw empty remove | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.00x | 0.59x
Single-threaded | 0.0028s | 0.0029s | 0.0000s | 0.0014s | 0.0028s | 0.0029s | 0.0000s | 0.0015s | 0.0028s | 0.0029s | 0.0000s | 0.0015s | 0.00x | 0.51x
Mostly add | 0.0029s | 0.0120s | 0.0000s | 0.0641s | 0.0029s | 0.0120s | 0.0000s | 0.0735s | 0.0029s | 0.0120s | 0.0000s | 0.0685s | 0.00x | 23.68x
Mostly remove | 0.0027s | 0.0032s | 0.0000s | 0.0027s | 0.0027s | 0.0032s | 0.0000s | 0.0027s | 0.0027s | 0.0032s | 0.0000s | 0.0027s | 0.00x | 0.98x
Heavy concurrent | 0.0025s | 0.0143s | 0.0000s | 0.0399s | 0.0025s | 0.0148s | 0.0000s | 0.0428s | 0.0025s | 0.0145s | 0.0000s | 0.0417s | 0.00x | 16.61x
Random concurrent | 0.0061s | 0.0083s | 0.0000s | 0.0066s | 0.0061s | 0.0083s | 0.0000s | 0.0067s | 0.0061s | 0.0083s | 0.0000s | 0.0067s | 0.00x | 1.08x
Average ops/s:
ReaderWriterQueue: 371.80 million
BlockingReaderWriterCircularBuffer: 481.26 million
SPSC queue: 0.00 million
Folly queue: 513.61 million
I modified the code to allow me to run, but crash when running in -O3:
> lldb benchmarks
(lldb) target create "benchmarks"
Current executable set to '/Users/chrisapple/code/readerwriterqueue/benchmarks/benchmarks' (arm64).
(lldb) run
Process 69685 launched: '/Users/chrisapple/code/readerwriterqueue/benchmarks/benchmarks' (arm64)
Process 69685 stopped
* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
frame #0: 0x000000010000a52c benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) [inlined] spsc_queue<int>::try_dequeue(this=0x000000016fdfb7d8, v=0x000000016fdfb7b4) at spscqueue.h:82:29 [opt]
79 {
80 if (load_consume(&tail_->next_))
81 {
-> 82 v = tail_->next_->value_;
83 store_release(&tail_, tail_->next_);
84 return true;
85 }
Target 0: (benchmarks) stopped.
warning: benchmarks was compiled with optimization - stepping may behave oddly; variables may not be available.
(lldb) bt
* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
* frame #0: 0x000000010000a52c benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) [inlined] spsc_queue<int>::try_dequeue(this=0x000000016fdfb7d8, v=0x000000016fdfb7b4) at spscqueue.h:82:29 [opt]
frame #1: 0x000000010000a518 benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) at bench.cpp:373:7 [opt]
frame #2: 0x000000010000a50c benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) [inlined] void details::ArgWrapper<void, void, void>::callCallback<double runBenchmark<spsc_queue<int> >(this=0x0000600000201290, callback=0x0000600000201280)::'lambda3'()>(double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'()&&) const at simplethread.h:49:80 [opt]
frame #3: 0x000000010000a50c benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(wrapper=0x0000600000201280) at simplethread.h:71:23 [opt]
frame #4: 0x000000010000b2f0 benchmarks`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*> >(void*) [inlined] decltype(__f=<unavailable>, __args=<unavailable>)(static_cast<SimpleThread::ThreadRef*>(fp0))) std::__1::__invoke<void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*>(void (*&&)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*&&) at type_traits:3918:1 [opt]
frame #5: 0x000000010000b2ec benchmarks`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*> >(void*) [inlined] void std::__1::__thread_execute<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*, 2ul>(__t=<unavailable>, (null)=<unavailable>)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*>&, std::__1::__tuple_indices<2ul>) at thread:287:5 [opt]
frame #6: 0x000000010000b2ec benchmarks`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*> >(__vp=0x0000600000201220) at thread:298:5 [opt]
frame #7: 0x00000001a46b826c libsystem_pthread.dylib`_pthread_start + 148
(lldb) (lldb)
The specific crashing frame more up close
* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
frame #0: 0x000000010000a52c benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) [inlined] spsc_queue<int>::try_dequeue(this=0x000000016fdfb7d8, v=0x000000016fdfb7b4) at spscqueue.h:82:29 [opt]
79 {
80 if (load_consume(&tail_->next_))
81 {
-> 82 v = tail_->next_->value_;
83 store_release(&tail_, tail_->next_);
84 return true;
85 }
This does not happen when compiling the benchmarks with -O0. Let me see if I can poke around and find anything satisfying
Some output from ubsan and asan:
bench.cpp:277:10: runtime error: signed integer overflow: 2147450880 + 65536 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior bench.cpp:277:10 in
bench.cpp:277:10: runtime error: signed integer overflow: 2147450880 + 65536 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior bench.cpp:277:10 in
bench.cpp:277:10: runtime error: signed integer overflow: 2147450880 + 65536 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior bench.cpp:277:10 in
bench.cpp:277:10: runtime error: signed integer overflow: 2147450880 + 65536 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior bench.cpp:277:10 in
ext/1024cores/spscqueue.h:82:29: runtime error: member access within null pointer of type 'spsc_queue<int>::node'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ext/1024cores/spscqueue.h:82:29 in
AddressSanitizer:DEADLYSIGNAL
=================================================================
==70167==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000008 (pc 0x0001006461c8 bp 0x00016f85ee80 sp 0x00016f85ee40 T501)
==70167==The signal is caused by a UNKNOWN memory access.
==70167==Hint: address points to the zero page.
#0 0x1006461c8 in spsc_queue<int>::try_dequeue(int&) spscqueue.h:82
#1 0x100647b1c in SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) simplethread.h:71
#2 0x10064fa14 in void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*> >(void*) thread:298
#3 0x1a46b8268 in _pthread_start+0x90 (libsystem_pthread.dylib:arm64e+0x7268)
#4 0x1a46b3088 in thread_start+0x4 (libsystem_pthread.dylib:arm64e+0x2088)
==70167==Register values:
x[0] = 0x000000016f85ed58 x[1] = 0x000000016f7d2d70 x[2] = 0x0000000000000000 x[3] = 0x0000000135f007a0
x[4] = 0x0000000063000000 x[5] = 0x0000000000000000 x[6] = 0x0000000000000000 x[7] = 0x0000000000000000
x[8] = 0x000000016f7d2cf0 x[9] = 0x0000000000000008 x[10] = 0x0000007000020000 x[11] = 0x000000002defa59e
x[12] = 0x0000000126816310 x[13] = 0x0000000000000000 x[14] = 0x0000000000000000 x[15] = 0x0000000000000000
x[16] = 0x0000000301413400 x[17] = 0x0000000100c4c0a8 x[18] = 0x0000000000000000 x[19] = 0x0000000102b02d70
x[20] = 0x0000000102b02d78 x[21] = 0x00000001006624b0 x[22] = 0x0000000100662490 x[23] = 0x0000007000020000
x[24] = 0x0000000000000000 x[25] = 0x0000000000124ddb x[26] = 0x00000000205605ae x[27] = 0x00000000205605af
x[28] = 0x0000000000000000 fp = 0x000000016f85ee80 lr = 0x00000001006463f8 sp = 0x000000016f85ee40
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV spscqueue.h:82 in spsc_queue<int>::try_dequeue(int&)
Thread T501 created by T0 here:
#0 0x100bf8c5c in wrap_pthread_create+0x54 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x38c5c)
#1 0x10064f46c in std::__1::thread::thread<void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*&, void>(void (*&&)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*&) thread:314
#2 0x10064eb64 in SimpleThread::startThread(void*, void (*)(void*)) simplethread.cpp:65
#3 0x1006479f4 in SimpleThread::SimpleThread<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'()>(double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'()&&) simplethread.h:115
#4 0x100635f94 in double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&) bench.cpp:371
#5 0x10062bc2c in main bench.cpp:124
#6 0x1006e1088 in start+0x204 (dyld:arm64e+0x5088)
==70167==ABORTING
Don't waste your time, the spsc implementation depends specifically on x86 semantics and is not portable. I guess we'd have to disable it completely to be able to run the benchmarks in O3.
asan is also known to give false positives with lock-free code.
Curious where those ubsan overflows are from though.
The ubsan overflows are pretty easy to fix, just changing:
unsigned long long total = 0;
on line 270 of the bench file fixed it up (long long may be overkill, but this silenced it)
Don't waste your time, the spsc implementation depends specifically on x86 semantics and is not portable. I guess we'd have to disable it completely to be able to run the benchmarks in O3.
Does this mean MPMC (concurrentqueue) is portable? Should I try and run some benchmarks on that? Happy to help in any way I can.
Ah I see, it's a dummy variable just to disable dead code removal. Yeah switching to any unsigned type should fix it.
"spsc" here in the benchmarks is not my code, but a different SPSC queue implementation from https://www.1024cores.net/home/lock-free-algorithms/queues/unbounded-spsc-queue just for comparison.
Would be worth getting the benchmarks to run in O3 (without spsc) to see the results.
Ah, got it! Sorry, I thought that was referring to the code in this repo (readerwriterqueue)
Here are the afformentioned results, now that we know SPSC is dead and we don't care about it:
> ./benchmarks
Note: SPSC queue not supported on this platform, discount its timings
|---------------- Min -----------------|----------------- Max -----------------|----------------- Avg -----------------|
Benchmark | RWQ | BRWCB | SPSC | Folly | RWQ | BRWCB | SPSC | Folly | RWQ | BRWCB | SPSC | Folly | xSPSC | xFolly
------------------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+-------+-------
Raw add | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.00x | 0.42x
Raw remove | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.00x | 0.28x
Raw empty remove | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.00x | 0.59x
Single-threaded | 0.0028s | 0.0029s | 0.0000s | 0.0014s | 0.0028s | 0.0029s | 0.0000s | 0.0015s | 0.0028s | 0.0029s | 0.0000s | 0.0015s | 0.00x | 0.51x
Mostly add | 0.0029s | 0.0119s | 0.0000s | 0.0373s | 0.0029s | 0.0120s | 0.0000s | 0.0594s | 0.0029s | 0.0120s | 0.0000s | 0.0487s | 0.00x | 16.86x
Mostly remove | 0.0027s | 0.0035s | 0.0000s | 0.0026s | 0.0027s | 0.0035s | 0.0000s | 0.0027s | 0.0027s | 0.0035s | 0.0000s | 0.0027s | 0.00x | 0.97x
Heavy concurrent | 0.0025s | 0.0151s | 0.0000s | 0.0415s | 0.0025s | 0.0153s | 0.0000s | 0.0432s | 0.0025s | 0.0152s | 0.0000s | 0.0427s | 0.00x | 17.09x
Random concurrent | 0.0061s | 0.0082s | 0.0000s | 0.0066s | 0.0061s | 0.0083s | 0.0000s | 0.0067s | 0.0061s | 0.0082s | 0.0000s | 0.0067s | 0.00x | 1.08x
Average ops/s:
ReaderWriterQueue: 363.73 million
BlockingReaderWriterCircularBuffer: 473.82 million
SPSC queue: 0.00 million
Folly queue: 509.72 million
That's O3?
Yep, this is no changes to the repo, just tip of main. the compile line being used is:
benchmarks$(EXT): bench.cpp ../readerwriterqueue.h ../readerwritercircularbuffer.h ../atomicops.h ext/1024cores/spscqueue.h ext/folly/ProducerConsumerQueue.h ../tests/common/simplethread.h ../tests/common/simplethread.cpp systemtime.h systemtime.cpp makefile
g++ -std=c++11 -Wpedantic -Wall -DNDEBUG -O3 -g bench.cpp ../tests/common/simplethread.cpp systemtime.cpp -o benchmarks$(EXT) -pthread $(PLATFORM_OPTS)
(to answer your previous concern about clang++
macosx "helpfully" aliases g++
to clang++
)
> g++
clang: error: no input files
Nice. Thanks for the results. Would have to dig into the assembly to see why the base operations with Folly are twice as fast.
Sounds good. Please don't hesitate to reach out if I can assist. This project has helped me a ton so happy to contribute in any way I can.
Hi!
I saw
Note that it's only been tested on x86(-64); if someone has access to other processors I'd love to run some tests on anything that's not x86-based.
In the readme.
I have a mac m1 (arm) processor. If there is a suite of tests I can run for you I'd be thrilled to help.
Let me know