cameron314 / readerwriterqueue

A fast single-producer, single-consumer lock-free queue for C++
Other
3.7k stars 662 forks source link

Help with profiling OSX M1 #142

Open cjappl opened 1 year ago

cjappl commented 1 year ago

Hi!

I saw

Note that it's only been tested on x86(-64); if someone has access to other processors I'd love to run some tests on anything that's not x86-based.

In the readme.

I have a mac m1 (arm) processor. If there is a suite of tests I can run for you I'd be thrilled to help.

Let me know

cameron314 commented 1 year ago

Neat, thanks! Do the tests and benchmarks compile using the makefiles in those directories? (May need to replace g++ with clang++.)

cjappl commented 1 year ago

Compiled with no issue, running says

Note: SPSC queue not supported on this platform, discount its timings

Full output:

> ./benchmarks
Note: SPSC queue not supported on this platform, discount its timings
                  |----------------  Min -----------------|----------------- Max -----------------|----------------- Avg -----------------|
Benchmark         |   RWQ   |  BRWCB  |  SPSC   |  Folly  |   RWQ   |  BRWCB  |  SPSC   |  Folly  |   RWQ   |  BRWCB  |  SPSC   |  Folly  | xSPSC | xFolly
------------------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+-------+-------
Raw add           | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.00x | 0.41x
Raw remove        | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.00x | 0.28x
Raw empty remove  | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.00x | 0.59x
Single-threaded   | 0.0028s | 0.0029s | 0.0000s | 0.0014s | 0.0028s | 0.0029s | 0.0000s | 0.0015s | 0.0028s | 0.0029s | 0.0000s | 0.0015s | 0.00x | 0.51x
Mostly add        | 0.0029s | 0.0120s | 0.0000s | 0.0641s | 0.0029s | 0.0120s | 0.0000s | 0.0735s | 0.0029s | 0.0120s | 0.0000s | 0.0685s | 0.00x | 23.68x
Mostly remove     | 0.0027s | 0.0032s | 0.0000s | 0.0027s | 0.0027s | 0.0032s | 0.0000s | 0.0027s | 0.0027s | 0.0032s | 0.0000s | 0.0027s | 0.00x | 0.98x
Heavy concurrent  | 0.0025s | 0.0143s | 0.0000s | 0.0399s | 0.0025s | 0.0148s | 0.0000s | 0.0428s | 0.0025s | 0.0145s | 0.0000s | 0.0417s | 0.00x | 16.61x
Random concurrent | 0.0061s | 0.0083s | 0.0000s | 0.0066s | 0.0061s | 0.0083s | 0.0000s | 0.0067s | 0.0061s | 0.0083s | 0.0000s | 0.0067s | 0.00x | 1.08x

Average ops/s:
    ReaderWriterQueue:                  371.80 million
    BlockingReaderWriterCircularBuffer: 481.26 million
    SPSC queue:                         0.00 million
    Folly queue:                        513.61 million

I modified the code to allow me to run, but crash when running in -O3:

Screen Shot 2023-02-27 at 9 49 37 AM
> lldb benchmarks
(lldb) target create "benchmarks"
Current executable set to '/Users/chrisapple/code/readerwriterqueue/benchmarks/benchmarks' (arm64).
(lldb) run
Process 69685 launched: '/Users/chrisapple/code/readerwriterqueue/benchmarks/benchmarks' (arm64)
Process 69685 stopped
* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x000000010000a52c benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) [inlined] spsc_queue<int>::try_dequeue(this=0x000000016fdfb7d8, v=0x000000016fdfb7b4) at spscqueue.h:82:29 [opt]
   79     {
   80         if (load_consume(&tail_->next_))
   81         {
-> 82             v = tail_->next_->value_;
   83             store_release(&tail_, tail_->next_);
   84             return true;
   85         }
Target 0: (benchmarks) stopped.
warning: benchmarks was compiled with optimization - stepping may behave oddly; variables may not be available.
(lldb) bt
* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
  * frame #0: 0x000000010000a52c benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) [inlined] spsc_queue<int>::try_dequeue(this=0x000000016fdfb7d8, v=0x000000016fdfb7b4) at spscqueue.h:82:29 [opt]
    frame #1: 0x000000010000a518 benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) at bench.cpp:373:7 [opt]
    frame #2: 0x000000010000a50c benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) [inlined] void details::ArgWrapper<void, void, void>::callCallback<double runBenchmark<spsc_queue<int> >(this=0x0000600000201290, callback=0x0000600000201280)::'lambda3'()>(double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'()&&) const at simplethread.h:49:80 [opt]
    frame #3: 0x000000010000a50c benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(wrapper=0x0000600000201280) at simplethread.h:71:23 [opt]
    frame #4: 0x000000010000b2f0 benchmarks`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*> >(void*) [inlined] decltype(__f=<unavailable>, __args=<unavailable>)(static_cast<SimpleThread::ThreadRef*>(fp0))) std::__1::__invoke<void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*>(void (*&&)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*&&) at type_traits:3918:1 [opt]
    frame #5: 0x000000010000b2ec benchmarks`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*> >(void*) [inlined] void std::__1::__thread_execute<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*, 2ul>(__t=<unavailable>, (null)=<unavailable>)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*>&, std::__1::__tuple_indices<2ul>) at thread:287:5 [opt]
    frame #6: 0x000000010000b2ec benchmarks`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*> >(__vp=0x0000600000201220) at thread:298:5 [opt]
    frame #7: 0x00000001a46b826c libsystem_pthread.dylib`_pthread_start + 148
(lldb) (lldb)

The specific crashing frame more up close

* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x000000010000a52c benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) [inlined] spsc_queue<int>::try_dequeue(this=0x000000016fdfb7d8, v=0x000000016fdfb7b4) at spscqueue.h:82:29 [opt]
   79     {
   80         if (load_consume(&tail_->next_))
   81         {
-> 82             v = tail_->next_->value_;
   83             store_release(&tail_, tail_->next_);
   84             return true;
   85         }

This does not happen when compiling the benchmarks with -O0. Let me see if I can poke around and find anything satisfying

cjappl commented 1 year ago

Some output from ubsan and asan:

bench.cpp:277:10: runtime error: signed integer overflow: 2147450880 + 65536 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior bench.cpp:277:10 in 
bench.cpp:277:10: runtime error: signed integer overflow: 2147450880 + 65536 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior bench.cpp:277:10 in 
bench.cpp:277:10: runtime error: signed integer overflow: 2147450880 + 65536 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior bench.cpp:277:10 in 
bench.cpp:277:10: runtime error: signed integer overflow: 2147450880 + 65536 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior bench.cpp:277:10 in 
ext/1024cores/spscqueue.h:82:29: runtime error: member access within null pointer of type 'spsc_queue<int>::node'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ext/1024cores/spscqueue.h:82:29 in 
AddressSanitizer:DEADLYSIGNAL
=================================================================
==70167==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000008 (pc 0x0001006461c8 bp 0x00016f85ee80 sp 0x00016f85ee40 T501)
==70167==The signal is caused by a UNKNOWN memory access.
==70167==Hint: address points to the zero page.
    #0 0x1006461c8 in spsc_queue<int>::try_dequeue(int&) spscqueue.h:82
    #1 0x100647b1c in SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) simplethread.h:71
    #2 0x10064fa14 in void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*> >(void*) thread:298
    #3 0x1a46b8268 in _pthread_start+0x90 (libsystem_pthread.dylib:arm64e+0x7268)
    #4 0x1a46b3088 in thread_start+0x4 (libsystem_pthread.dylib:arm64e+0x2088)

==70167==Register values:
 x[0] = 0x000000016f85ed58   x[1] = 0x000000016f7d2d70   x[2] = 0x0000000000000000   x[3] = 0x0000000135f007a0  
 x[4] = 0x0000000063000000   x[5] = 0x0000000000000000   x[6] = 0x0000000000000000   x[7] = 0x0000000000000000  
 x[8] = 0x000000016f7d2cf0   x[9] = 0x0000000000000008  x[10] = 0x0000007000020000  x[11] = 0x000000002defa59e  
x[12] = 0x0000000126816310  x[13] = 0x0000000000000000  x[14] = 0x0000000000000000  x[15] = 0x0000000000000000  
x[16] = 0x0000000301413400  x[17] = 0x0000000100c4c0a8  x[18] = 0x0000000000000000  x[19] = 0x0000000102b02d70  
x[20] = 0x0000000102b02d78  x[21] = 0x00000001006624b0  x[22] = 0x0000000100662490  x[23] = 0x0000007000020000  
x[24] = 0x0000000000000000  x[25] = 0x0000000000124ddb  x[26] = 0x00000000205605ae  x[27] = 0x00000000205605af  
x[28] = 0x0000000000000000     fp = 0x000000016f85ee80     lr = 0x00000001006463f8     sp = 0x000000016f85ee40  
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV spscqueue.h:82 in spsc_queue<int>::try_dequeue(int&)
Thread T501 created by T0 here:
    #0 0x100bf8c5c in wrap_pthread_create+0x54 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x38c5c)
    #1 0x10064f46c in std::__1::thread::thread<void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*&, void>(void (*&&)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*&) thread:314
    #2 0x10064eb64 in SimpleThread::startThread(void*, void (*)(void*)) simplethread.cpp:65
    #3 0x1006479f4 in SimpleThread::SimpleThread<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'()>(double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'()&&) simplethread.h:115
    #4 0x100635f94 in double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&) bench.cpp:371
    #5 0x10062bc2c in main bench.cpp:124
    #6 0x1006e1088 in start+0x204 (dyld:arm64e+0x5088)

==70167==ABORTING
cameron314 commented 1 year ago

Don't waste your time, the spsc implementation depends specifically on x86 semantics and is not portable. I guess we'd have to disable it completely to be able to run the benchmarks in O3.

cameron314 commented 1 year ago

asan is also known to give false positives with lock-free code.

cameron314 commented 1 year ago

Curious where those ubsan overflows are from though.

cjappl commented 1 year ago

The ubsan overflows are pretty easy to fix, just changing:

        unsigned long long total = 0;

on line 270 of the bench file fixed it up (long long may be overkill, but this silenced it)

cjappl commented 1 year ago

Don't waste your time, the spsc implementation depends specifically on x86 semantics and is not portable. I guess we'd have to disable it completely to be able to run the benchmarks in O3.

Does this mean MPMC (concurrentqueue) is portable? Should I try and run some benchmarks on that? Happy to help in any way I can.

cameron314 commented 1 year ago

Ah I see, it's a dummy variable just to disable dead code removal. Yeah switching to any unsigned type should fix it.

cameron314 commented 1 year ago

"spsc" here in the benchmarks is not my code, but a different SPSC queue implementation from https://www.1024cores.net/home/lock-free-algorithms/queues/unbounded-spsc-queue just for comparison.

cameron314 commented 1 year ago

Would be worth getting the benchmarks to run in O3 (without spsc) to see the results.

cjappl commented 1 year ago

Ah, got it! Sorry, I thought that was referring to the code in this repo (readerwriterqueue)

Here are the afformentioned results, now that we know SPSC is dead and we don't care about it:

> ./benchmarks
Note: SPSC queue not supported on this platform, discount its timings
                  |----------------  Min -----------------|----------------- Max -----------------|----------------- Avg -----------------|
Benchmark         |   RWQ   |  BRWCB  |  SPSC   |  Folly  |   RWQ   |  BRWCB  |  SPSC   |  Folly  |   RWQ   |  BRWCB  |  SPSC   |  Folly  | xSPSC | xFolly
------------------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+-------+-------
Raw add           | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.00x | 0.42x
Raw remove        | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.00x | 0.28x
Raw empty remove  | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.00x | 0.59x
Single-threaded   | 0.0028s | 0.0029s | 0.0000s | 0.0014s | 0.0028s | 0.0029s | 0.0000s | 0.0015s | 0.0028s | 0.0029s | 0.0000s | 0.0015s | 0.00x | 0.51x
Mostly add        | 0.0029s | 0.0119s | 0.0000s | 0.0373s | 0.0029s | 0.0120s | 0.0000s | 0.0594s | 0.0029s | 0.0120s | 0.0000s | 0.0487s | 0.00x | 16.86x
Mostly remove     | 0.0027s | 0.0035s | 0.0000s | 0.0026s | 0.0027s | 0.0035s | 0.0000s | 0.0027s | 0.0027s | 0.0035s | 0.0000s | 0.0027s | 0.00x | 0.97x
Heavy concurrent  | 0.0025s | 0.0151s | 0.0000s | 0.0415s | 0.0025s | 0.0153s | 0.0000s | 0.0432s | 0.0025s | 0.0152s | 0.0000s | 0.0427s | 0.00x | 17.09x
Random concurrent | 0.0061s | 0.0082s | 0.0000s | 0.0066s | 0.0061s | 0.0083s | 0.0000s | 0.0067s | 0.0061s | 0.0082s | 0.0000s | 0.0067s | 0.00x | 1.08x

Average ops/s:
    ReaderWriterQueue:                  363.73 million
    BlockingReaderWriterCircularBuffer: 473.82 million
    SPSC queue:                         0.00 million
    Folly queue:                        509.72 million
cameron314 commented 1 year ago

That's O3?

cjappl commented 1 year ago

Yep, this is no changes to the repo, just tip of main. the compile line being used is:

benchmarks$(EXT): bench.cpp ../readerwriterqueue.h ../readerwritercircularbuffer.h ../atomicops.h ext/1024cores/spscqueue.h ext/folly/ProducerConsumerQueue.h ../tests/common/simplethread.h ../tests/common/simplethread.cpp systemtime.h systemtime.cpp makefile
    g++ -std=c++11 -Wpedantic -Wall -DNDEBUG -O3 -g bench.cpp ../tests/common/simplethread.cpp systemtime.cpp -o benchmarks$(EXT) -pthread $(PLATFORM_OPTS)
cjappl commented 1 year ago

(to answer your previous concern about clang++ macosx "helpfully" aliases g++ to clang++)

> g++
clang: error: no input files
cameron314 commented 1 year ago

Nice. Thanks for the results. Would have to dig into the assembly to see why the base operations with Folly are twice as fast.

cjappl commented 1 year ago

Sounds good. Please don't hesitate to reach out if I can assist. This project has helped me a ton so happy to contribute in any way I can.