cameron314 / concurrentqueue

A fast multi-producer, multi-consumer lock-free concurrent queue for C++11
Other
10k stars 1.69k forks source link

ConcurrentQueue ~150x slower on Windows #372

Closed alex-lt-kong closed 11 months ago

alex-lt-kong commented 11 months ago

I tested the minimally reproducible example on different computers using different compilers (gcc, clang, cl), the results are pretty consistent--Linux binary is faster than Windows binary by 150 times. Is this expected? How can we achieve similar performance on Windows?

The MRE code

#include "concurrentqueue.h"

#include <iostream>
#include <thread>

using namespace std;

moodycamel::ConcurrentQueue<int64_t> queue;

void ProduceItems(const size_t iter)
{
    for (size_t i = 0; i <= iter; ++i)
    {
        auto now = chrono::system_clock::now();
        int64_t epoch_us = chrono::duration_cast<chrono::microseconds>(now.time_since_epoch()).count();
        queue.enqueue(epoch_us);
        this_thread::sleep_for(chrono::milliseconds(1)); // Simulate some work
    }
}

void ConsumeItems(const size_t iter)
{
    size_t count = 0;
    double sum;
    int64_t then;
    while (count < iter)
    {
        if (queue.try_dequeue(then))
        {
            auto now = chrono::system_clock::now();
            int64_t epoch_us = chrono::duration_cast<chrono::microseconds>(now.time_since_epoch()).count();
            sum += epoch_us - then;
            ++ count;
        } else {
            this_thread::sleep_for(chrono::microseconds(10));
        }
    }
    cout << "avg. time elapsed (us): " << sum / iter << endl;
}

int main()
{
    size_t iter = 12345;
    thread consumer(ConsumeItems, iter * 3);
    thread producer1(ProduceItems, iter);
    thread producer2(ProduceItems, iter);
    thread producer3(ProduceItems, iter);

    producer1.join();
    producer2.join();
    producer3.join();
    consumer.join();

    cout << "All tasks completed." << endl;
    return 0;
}

Test results

$ clang++ --version Debian clang version 14.0.6 Target: x86_64-pc-linux-gnu Thread model: posix $ clang++ test.cpp -o test-cl.out -O3 $ ./test-cl.out avg. time elapsed (us): 42.5577 All tasks completed.


* Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz - Windows Host

```bash
PS > cl
Microsoft (R) C/C++ Optimizing Compiler Version 19.36.32535 for x86
Copyright (C) Microsoft Corporation.  All rights reserved.
PS > cl .\test.cpp /Ox
PS > .\test.exe
avg. time elapsed (us): 7468.5
All tasks completed.

PS > clang++ --version
clang version 13.0.0 (https://github.com/llvm/llvm-project.git d7b669b3a30345cfcdb2fde2af6f48aa4b94845d)
Target: x86_64-w64-windows-gnu
Thread model: posix
PS > clang++ .\test.cpp -o .\test.exe -O3
PS > .\test.exe
avg. time elapsed (us): 7874.67
All tasks completed.

$ clang++ --version Debian clang version 14.0.6 Target: x86_64-pc-linux-gnu Thread model: posix $ clang++ test.cpp -o test-cl.out -O3 $ ./test-cl.out avg. time elapsed (us): 48.0926 All tasks completed.

cameron314 commented 11 months ago

this_thread::sleep_for(chrono::microseconds(10)) will sleep for significantly longer than 10 us.

alex-lt-kong commented 11 months ago

Oh really? The purpose is that I want the program to be "sort-of" busy-waiting but not using up all CPU resources. If this_thread::sleep_for(chrono::microseconds(10)) doesnt work on Windows, how should I achieve something similar?

cameron314 commented 11 months ago

Spin wait.

Note also that try_dequeue will not treat producers equally, so the min/max/avg spread will be very different. All the elements from the first producer will be dequeued first, then all from the second, then all from the third.

Finally, don't use system_clock for timing, use steady_clock or high_resolution_clock.

alex-lt-kong commented 11 months ago

By Spin wait you mean I just keep looping like this?:

while (count < iter)
{
    if (queue.try_dequeue(then))
    {
        auto now = chrono::system_clock::now();
        int64_t epoch_us = chrono::duration_cast<chrono::microseconds>(now.time_since_epoch()).count();
        sum += epoch_us - then;
        ++ count;
    }
}
cameron314 commented 11 months ago

You could also do that, but if you want to wait, then wait, just not with a sleep:

auto deadline = std::chrono::steady_clock::now() + std::chrono::microseconds(10);
while (std::chrono::steady_clock::now() < deadline);

Sleep offers no guarantees on precision.

cameron314 commented 11 months ago

I don't think it affects the code here, but also bear in mind that thread creation on Windows is at least an order of magnitude slower than on Linux.

alex-lt-kong commented 11 months ago

Thanks for the information.

One more thing that concerns me: "Note also that try_dequeue will not treat producers equally", if my threads are all long-running, say they will keep running for days, will this inequality still persist? Is there a way to make producers "equal" without impacting performance? In my use case, strictly equal is not needed, just I want the latency across different producers to be "similar".

cameron314 commented 11 months ago

Short answer is no, this queue does not guarantee any sort of fairness.

However, if you use a consumer token, it will be somewhat more fair (rotating between producers at fixed intervals).

alex-lt-kong commented 11 months ago

Let me ask this: usually, a queue is a FIFO data structure, does this still apply to MPMC queue like ConcurrentQueue? When you say try_dequeue does not treat producers equally:

(the 2nd scenario means the FIFO assumption does not hold anymore--or to put it differently, we have a per-producer FIFO, but not an overall FIFO)

cameron314 commented 11 months ago

or to put it differently, we have a per-producer FIFO, but not an overall FIFO

Exactly -- please see the README. If you need global ordering, this is not the right queue.

alex-lt-kong commented 11 months ago

thanks for all the help!