busy loop in thread_sleep(0) causes excessive cpu consumption regression from 0e78efad0be73d293880d1b71053c0d70a50a080

GoogleCodeExporter commented 9 years ago

What is the expected behavior? What do you see instead?
When using a large number of simultaneous vp8 codecs where there are many 
decoders and one encoder, high cpu usage is present.  The configuration is the 
one encoder with token_parts set to 3 and threads set to match the number of 
CPU on the machine.   

What version are you using? On what operating system?
latest master 067fc49996c4fb1f7f0a6dddaf4e74a8561350e0 on debian x86_64

Can you reproduce using the vpxdec or vpxenc tools? What command line are
you using?
no, the command line tool is not effected as its not embedded in a threaded 
app.  The result cannot be duplicated.

Please provide any additional information below.

The attached patch was just a quick way to do a test but it vastly improved 
results.
The busy loops in the code calling thread_sleep(0) even with the asm wait 
instructions were heavily straining the system..

The difference with the patch applied was largely significant allowing many 
more concurrent transcoders to operate in the threaded app.

I suggest some kind of conditional signal or using the existing semaphores to 
signal these places where thread_sleep() are currently implemented to poll an 
int value.  I have not studied the code enough to feel comfortable making the 
patch myself or I would have.

Using the patch In a 36 user video conference, the difference is staggering and 
everything functions well.  Without the patch the 12 core (24 core 
hyperthreaded) machine is maxed out on all cpu and with the patch there was an 
even distribution of 20% and resources to spare.

I noticed the change before 0e78efad0be73d293880d1b71053c0d70a50a080 was to use 
usleep(nms*1000); so it was effectively calling usleep(0);  so had the old 
revision changed to thread_sleep(1); instead of changing to sched_yield() it 
would have probably been a better solution.

However using sleeping for this is clearly not the most optimal solution.  The 
best thing would be to block on a conditional and have the threads send a 
cond_signal() to wake it up every time it hits the proper condition.

Either way the way the code is now creates a 200% penalty in cpu usage and its 
more likely the decoder busy loop than the encoder one since I am using many 
more decoders than encoders but the change globally was what delivered working 
results and I noticed 2 busy loops in the encode and 1 in the decode so they 
all probably play a role.

Original issue reported on code.google.com by anthony....@gmail.com on 19 Mar 2015 at 11:44

Attachments:

test.diff

GoogleCodeExporter commented 9 years ago

Original comment by ya...@google.com on 20 Mar 2015 at 12:45

GoogleCodeExporter commented 9 years ago

I updated the patch to work on mac and windows by restoring the commented usage 
of nanosleep and changing all the thread_sleep to be 1 instead of 0

This still serves as just a demonstration of what has been proven in testing on 
at least mac and linux to improve performance.  It still would be better to use 
conditionals or something else.

Original comment by anthony....@gmail.com on 24 Mar 2015 at 8:18

Attachments:

vpx.sleep.diff

GoogleCodeExporter commented 9 years ago

Original comment by ya...@google.com on 2 Apr 2015 at 9:41

Changed state: Assigned

GoogleCodeExporter commented 9 years ago

VP8 multi-threaded decoder was slow while running on Linux, commit 
0e78efad0be73d293880d1b71053c0d70a50a080 was a fix to that issue.

VP8 multi-threaded codec wouldn't work well for your usage case (more decoders 
than # of cores) because of the busy waiting synchronization. Can you turn off 
the multi-threads(-t 1) for all decoders, and maybe use -t 2 for the encoder?

Original comment by yunqingw...@google.com on 2 Apr 2015 at 10:43

GoogleCodeExporter commented 9 years ago

Original comment by yunqingw...@google.com on 2 Apr 2015 at 10:44

GoogleCodeExporter commented 9 years ago

As I mentioned, the combination of the vpx_sleep patch I attached and the 
multi-threaded decoder gives the best performance for many decoders in the same 
multithreaded process.  I provided a rather detailed description of that 
already,  The unbounded busy sync is part of the problem and is not visible on 
a single decoder in one process.

Original comment by anthony....@gmail.com on 2 Apr 2015 at 10:51

GoogleCodeExporter commented 9 years ago

Tested the following on Ubuntu desktop by running single VP8 decoder:
-#define thread_sleep(nms) sched_yield();/* {struct timespec ts;ts.tv_sec=0; 
ts.tv_nsec = 1000*nms;nanosleep(&ts, NULL);} */
+#define thread_sleep(nms) {struct timespec ts;ts.tv_sec=0; ts.tv_nsec = 
1000*nms;nanosleep(&ts, NULL);}

The decoder speed went down from 600+ fps to 400 fps. I understand that using 
sleep() consumes less cpu than using sched_yield(). But the performance 
regression is way too huge.

In your use case, setting decoders' number of threads to 1 may help since no 
synchronization is involved. But, if that doesn't help, you can always patch 
your build.

Original comment by yunqingw...@google.com on 10 Jun 2015 at 5:48

Changed state: WontFix

GoogleCodeExporter commented 9 years ago

As stated before.  The patch is clearly not the right solution.  The right 
solution is to use some sort of broadcast conditional on the real condition it 
should be waiting for instead of a busy loop, but we need your expertise in 
knowing where those triggers should actually be.

Original comment by m...@jerris.com on 10 Jun 2015 at 5:54

GoogleCodeExporter commented 9 years ago

I already mentioned that the patch was not the ideal solution.
So rather than improve it with a semaphore or conditional you choose to close 
it as won't fix?
With single thread, the amount of concurrent decoders suffer tremendously.  Do 
you not have any consideration for a single process using the lib 30-100 times 
concurrently?

Clearly the entire threading abstraction / portability is very light and not 
really robust.  Patching our own is troublesome as we are trying to use system 
packages in distros.

Original comment by anthony....@gmail.com on 10 Jun 2015 at 5:56

GoogleCodeExporter commented 9 years ago

I'm not sure I understand.   

Why can't each decode of a video run in single threaded mode on its own thread? 

Why is it that you need each of the multiple decoders running multithreaded 
when you have so many?  

Jim

Original comment by jimbankoski@google.com on 10 Jun 2015 at 7:06

GoogleCodeExporter commented 9 years ago

Well when we are on a 24 core machine with say 10 decoders the multithreaded 
decoder spreads the load over all of the cpu.  Ideally the goal is to not max 
out any single CPU.

When doing load testing using 1 thread allowed less concurrent participants 
than with the patch and the threads on.  Without the patch and threads on made 
it horribly worse.

Is it terribly hard to employ some sort of controlled blocking on that busy 
loop?  I am new to the code but when I originally reported this issue it looked 
like one thread was waiting for an int to change from -1 to some other number 
or something.  Can't that just be replaced with waiting on a semaphore or 
conditional variable and toggle it when that value is changed?   Is the goal to 
avoid any more threading abstraction?

Original comment by anthony....@gmail.com on 10 Jun 2015 at 7:17

GoogleCodeExporter commented 9 years ago

I'm sorry you are having trouble getting that to work.   Are you trying to 
decode 10 really 1080p decodes or something like that? 

Does what VP9 is doing work better for you?   Could you check? I think we've 
got slightly different threading primitives.

Original comment by jimbankoski@google.com on 10 Jun 2015 at 10:46

GoogleCodeExporter commented 9 years ago

It's any combos of res with mostly 720p or 1080p

The threading thing is better in vp9 yes, per my other open issue about stack 
size.
Either way the busy loop in question is still an issue unless something is done 
to not sleep and also not spinlock?  Did the comment I made about the waiting 
for a val to change make sense?  The spinlock might make one full speed decode 
work but for many simo realtime decoders the spins eat all the extra cpu time 
and cause slowdowns.

We are decoding many concurrent streams in realtime over rtp not from files.

Original comment by an...@freeswitch.org on 10 Jun 2015 at 11:48

GoogleCodeExporter commented 9 years ago

Can this issue please be re-opened so we can explore a conditional/semaphore 
based solution that would improve performance in all scenarios?

Original comment by m...@jerris.com on 11 Jun 2015 at 7:27

GoogleCodeExporter commented 9 years ago

Okay the issue is open again: 

Unfortunately,  I can't promise when we'll get to this any time soon, as its 
potentially a rather big change affecting a lot of code and people.   

If you are willing to explore moving the semaphores we used in VP9 back into 
VP8 I guarantee we will test and review your code and we will land them if they 
work  better for webrtc in general.

Original comment by jimbankoski@google.com on 12 Jun 2015 at 12:55

Changed state: Available

flaccidware / webm

busy loop in thread_sleep(0) causes excessive cpu consumption regression from 0e78efad0be73d293880d1b71053c0d70a50a080 #979