craflin / LockFreeQueue

A lock-free multi-producer multi-consumer ring buffer FIFO queue.
Apache License 2.0
168 stars 38 forks source link

Invalid memory barriers on Google Pixel 3XL #4

Open tyekx opened 8 months ago

tyekx commented 8 months ago

Hello!

Reporting this issue for the public here was on my backlog for ages, so here it is.

A while back I noticed a bunch of crashdumps on Android, where the callstack was pointing to the LockFreeQueueCpp11. I wrote a test that tested the queue in a concurrent setting. It was a single producer single consumer test, starting with syncing the threads up with a busy wait. The pushes and pops were done in a loop to retry on failure. I would produce N integers that then the consumer consumes. Then I check count and order to verify the 'queue-ness'.

On this test, the Google Pixel 3XL device acted up and failed, the counts did not add up. Everything else this unit test ran on was fine, windows, mac, iphones, other android devices. The only thing I could think of back then is that (to my knowledge) all x86 reads are acquire and all writes are release. But honestly most other ARM devices were fine as well, so maybe on that specific SoC this just did not hold this true. And sure enough, as I tightened the memory barriers in the push and pop, the test passed. I went for acq_rel on the CAS operations, and acquires/releases otherwise. I think you can have the CAS weaker than acq_rel, so feel free to investigate.

Now there are many devices using the same specific CPU (Qualcomm SDM845), but I dont think I personally tested more devices on that hw.

I just wanted to let you and the readers know that your milage may vary on ARM, and make sure to have a test for it 👍