fraunhoferhhi / vvdec

VVdeC, the Fraunhofer Versatile Video Decoder
https://www.hhi.fraunhofer.de/en/departments/vca/technologies-and-solutions/h266-vvc.html
BSD 3-Clause Clear License
454 stars 91 forks source link

multithreading stuck when the number of threads is too high #119

Closed whhwhhwhh closed 8 months ago

whhwhhwhh commented 2 years ago

As mentioned in issue117 https://github.com/fraunhoferhhi/vvdec/issues/117 , if the number of threads is too high,such as 100, it would be stuck. And when the thread gets stuck, taskIt is null (taskIt = findNextTask( threadId, nextTaskIt );),which might be a bug.Please check it.

adamjw24 commented 2 years ago

The issue has a workaround solution in current master (but not yet in v1.6.0). We will have look into a proper fix at some point.

elkx1 commented 1 year ago

I am also encountering this issue using 2.0.0 release which has the workaround (max threads limit to 64). I am seeing this on a 32 core (64 thread) machine with AMD EPYC 7543 CPU

It looks like one thread is stopped waiting in

m_allThreadsWaitingCV.wait( lock, [=] { return !m_allThreadWaitingMoreWork; } );

which means the thread pool has been paused.

I don't know this code at all but a quick scan of it tells me possibly this is what's happening:

The thread pool can be paused after barrier task for the final packet in the stream has arrived, and then it will never wake up because PoolPause::unpauseIfPaused won't be called again unless there is another call to vvdec_decode with more data (which might not happen).

This can happen if the final barrier task arrives in between the call to findNextTask and m_poolPause.pauseIfAllOtherThreadsWaiting();

adamjw24 commented 1 year ago

Interesting, thanks for the analysis.

Now for my understanding. With which number of threads did you encounter the issue? So the problem happens at the end of the decoding before flushing, or after flushing? (but for sure after the last packet has been passed to the decoder?)

K-os commented 8 months ago

This issue seems to be fixed with the latest reworking of the thread pool implementation.

Please reopen the issue if the problem persists.