Endless loop in ImplicitProducer* get_or_add_implicit_producer()

oviano commented 2 years ago

I've been using this nice queue for a while now; one of my usages is as part of a log file writing system. I have a blocking concurrent queue where a thread waits for items before writing them to a text file.

I'm getting a fairly rare situation where about half a dozen or so "producer" threads all need to write at the same time, typically during a shutdown sequence and they all seem to be stuck going round and round in this function.

Not sure how to debug this!

cameron314 commented 2 years ago

Can you identify in which loop they are stuck exactly? Can you reproduce in isolation?

oviano commented 2 years ago

        assert(mainHash != nullptr);  // silence clang-tidy and MSVC warnings (hash cannot be null)
        for (auto hash = mainHash; hash != nullptr; hash = hash->prev) {
            // Look for the id in this hash
            auto index = hashedId;
            while (true) {      // Not an infinite loop because at least one slot is free in the hash table
                index &= hash->capacity - 1;

Yes it was basically looping around the outer while(true) { shown above, with index basically going up each iteration and then back to 1 again.

I haven't yet attempted to reproduce in isolation, I will try that tomorrow. I've being chasing a bug in my project where it was entering a stuck/high CPU state and I had left three instances running in the VS debugger and I've literally only caught it this evening for the first time.

oviano commented 2 years ago

Managed to trap it again; it loops endlessly because the condition (probedKey == id) is never met for it to return, and neither is (probedKey == details::invalid_thread_id) for it to break.

details::invalid_thread_id is defined as zero, this is Windows.

The contents of hash->entries[0...31] is mostly with key = 0xFFFFFFFFU but a few valid entries, but none where the key==id.

cameron314 commented 2 years ago

Thanks, I'll have a look soon.

oviano commented 2 years ago

In the above last case I encountered it there was actually only one thread looping in this manner. There were no other threads referencing the queue apart from the consumer which was sitting waiting to dequeue.

Not too sure how it works but it seems that either the thread never had a producer hash created or something (another thread?) invalidated it when it shouldn’t have done?

oviano commented 2 years ago

Is it conceivable that the thread exit handler code is clearing down the hash entry after a new thread has started and is already re-using the same thread id/hash?

This seems to happen when threads are stopped and new threads immediately created.

i will disable the thread local stuff and see if the problem automagically goes away….

oviano commented 2 years ago

Apologies for the running commentary - I added some trace to log the thread ID when the thread is subscribed to the thread exit listener, and when the exit handler is called.

What I am finding is that the threads that are stuck in this loop were never seemingly given a producer.

What seems to have happened is that all the previous threads that were stopped earlier have left their producers in the reusable (invalid_thread_id2) state, and so there is never an available empty one (invalid_thread_id) to allow the loop to break.

Does that make sense as an explanation?

cameron314 commented 2 years ago

Thanks for looking into this. That makes sense, and is looking increasingly like a bug! I'll take a look at lunch.

oviano commented 2 years ago

It's possibly because I'm using it in a way that is less conventional, i.e. as part of a logging system, where it's normal for threads that come and go during execution to want to write logging items to the queue.

Maybe that's different from the envisaged use of long-running threads in a more stable producer -> consumer model.

cameron314 commented 2 years ago

Nah, it seems to be a legitimate bug. I'll try to work out a fix, but in the meantime you can comment out the define of MOODYCAMEL_CPP11_THREAD_LOCAL_SUPPORTED in the header.

Thanks for finding this!

cameron314 commented 2 years ago

The implicit producer hash table should always have a load factor of no more than 75%. The infinite loop you see indicates it reached 100%.

invalid_thread_id is used to indicate an empty slot, and invalid_thread_id2 is used as a tombstone for a removed value (and may be later overwritten similar to an empty slot). The count is not decremented when a tombstone is written, so there should always be at least one truly free slot in the hash at all times, preventing that infinite loop.

However, the count was also not being incremented when a free (not reused!) slot was taken for a reused producer, which ultimately resulted in the count not reflecting the reality of the actual hash entries, causing it to fill up to 100% capacity in pathological cases. This should now be fixed.

I also fixed an (unlikely) race during insertion (and replacement with the tombstone during deletion) where stale keys could be read, causing the loop to iterate past a usable value (during insertion) or overwrite a valid value (during deletion). This race could lead to a similar violation of hash invariants.

cameron314 commented 2 years ago

Let me know if this resolves the issue you're seeing. Thanks for your help investigating this (I cannot reproduce locally).

oviano commented 2 years ago

Thanks. I just had a go at reproducing the issue again with the new code and I wasn't able to reproduce it, so it looks like it's fixed.

I will report back again if I get any more issues!

oviano commented 2 years ago

So a different issue but perhaps related to this area/your changes:

cameron314 commented 2 years ago

Hmm. Which VS version? Can you reproduce in isolation?

oviano commented 2 years ago

VS2022.

I will try and make it happen again.

Do you have tests that create and delete threads randomly and use this queue? My code isn’t really doing this on a grand scale. It’s a video streaming server with just one or two viewers, so someone connects and it fires up a bunch of threads for various reasons such as capturing, decoding, sending the stream etc, then when the connection ends the threads exit.

Unless it’s exceedingly rare and I was very (un)lucky I would think it ought to appear in such a test scenario.

cameron314 commented 2 years ago

Ah, think I found the issue. Upon destruction, the implicit producer attempts to deregister from the TLS-based registry, but it's usually running on a different thread, so it's not actually deregistered, leaving freed memory in the linked list which later causes a crash.

oviano commented 2 years ago

Right - I was actually trying to figure out if it was something along those lines but hadn’t proved it yet.

Then I tried reproducing it again to get more hints but couldn’t get it happening again.

Certainly in my use case the thread that creates the queue is often different to the one that is the producer.

PS looking at the history of this it seems you’ve had quite a fight with TLS over the years…

cameron314 commented 2 years ago

Yeah, it seems this code was less well tested than I remembered :-/ I only recently re-enabled MOODYCAMEL_CPP11_THREAD_LOCAL_SUPPORTED by default which is why you're hitting these issues now. So, thanks for being the inadvertent guinea pig!

I'll build out a fix for this issue soon.

oviano commented 2 years ago

Yeah, it seems this code was less well tested than I remembered :-/ I only recently re-enabled MOODYCAMEL_CPP11_THREAD_LOCAL_SUPPORTED by default which is why you're hitting these issues now. So, thanks for being the inadvertent guinea pig!

I'll build out a fix for this issue soon.

It's no problem, I'm pleased to help.

It's a really nice implementation, and there aren't too many cross-platform queues like this so being able to drop this into places where I was previously using std::mutex + std::condition_variable + std::queue/list/etc is really useful.

cameron314 commented 2 years ago

Should be fixed. Let me know if you come across any other issues.

rob-stewart commented 1 year ago

Can you merge these changes into master? I've encountered similar symptoms and would prefer to use a release.

cameron314 commented 1 year ago

Can you merge these changes into master? I've encountered similar symptoms and would prefer to use a release.

They are already in master. Created a new release (v1.0.4). Closing this issue for now as no further problems have been reported.

cameron314 / concurrentqueue

Endless loop in ImplicitProducer* get_or_add_implicit_producer() #288