busy loop in thread_sleep(0) causes excessive cpu consumption regression from 0e78efad0be73d293880d1b71053c0d70a50a080

GoogleCodeExporter commented 9 years ago

What is the expected behavior? What do you see instead?
When using a large number of simultaneous vp8 codecs where there are many 
decoders and one encoder, high cpu usage is present.  The configuration is the 
one encoder with token_parts set to 3 and threads set to match the number of 
CPU on the machine.   

What version are you using? On what operating system?
latest master 067fc49996c4fb1f7f0a6dddaf4e74a8561350e0 on debian x86_64

Can you reproduce using the vpxdec or vpxenc tools? What command line are
you using?
no, the command line tool is not effected as its not embedded in a threaded 
app.  The result cannot be duplicated.

Please provide any additional information below.

The attached patch was just a quick way to do a test but it vastly improved 
results.
The busy loops in the code calling thread_sleep(0) even with the asm wait 
instructions were heavily straining the system..

The difference with the patch applied was largely significant allowing many 
more concurrent transcoders to operate in the threaded app.

I suggest some kind of conditional signal or using the existing semaphores to 
signal these places where thread_sleep() are currently implemented to poll an 
int value.  I have not studied the code enough to feel comfortable making the 
patch myself or I would have.

Using the patch In a 36 user video conference, the difference is staggering and 
everything functions well.  Without the patch the 12 core (24 core 
hyperthreaded) machine is maxed out on all cpu and with the patch there was an 
even distribution of 20% and resources to spare.

I noticed the change before 0e78efad0be73d293880d1b71053c0d70a50a080 was to use 
usleep(nms*1000); so it was effectively calling usleep(0);  so had the old 
revision changed to thread_sleep(1); instead of changing to sched_yield() it 
would have probably been a better solution.

However using sleeping for this is clearly not the most optimal solution.  The 
best thing would be to block on a conditional and have the threads send a 
cond_signal() to wake it up every time it hits the proper condition.

Either way the way the code is now creates a 200% penalty in cpu usage and its 
more likely the decoder busy loop than the encoder one since I am using many 
more decoders than encoders but the change globally was what delivered working 
results and I noticed 2 busy loops in the encode and 1 in the decode so they 
all probably play a role.

Original issue reported on code.google.com by anthony....@gmail.com on 19 Mar 2015 at 11:44

Attachments:

test.diff

GoogleCodeExporter commented 9 years ago

Original comment by ya...@google.com on 20 Mar 2015 at 12:45

GoogleCodeExporter commented 9 years ago

I updated the patch to work on mac and windows by restoring the commented usage 
of nanosleep and changing all the thread_sleep to be 1 instead of 0

This still serves as just a demonstration of what has been proven in testing on 
at least mac and linux to improve performance.  It still would be better to use 
conditionals or something else.

Original comment by anthony....@gmail.com on 24 Mar 2015 at 8:18

Attachments:

vpx.sleep.diff

GoogleCodeExporter commented 9 years ago

Original comment by ya...@google.com on 2 Apr 2015 at 9:41

Changed state: Assigned

GoogleCodeExporter commented 9 years ago

VP8 multi-threaded decoder was slow while running on Linux, commit 
0e78efad0be73d293880d1b71053c0d70a50a080 was a fix to that issue.

VP8 multi-threaded codec wouldn't work well for your usage case (more decoders 
than # of cores) because of the busy waiting synchronization. Can you turn off 
the multi-threads(-t 1) for all decoders, and maybe use -t 2 for the encoder?

Original comment by yunqingw...@google.com on 2 Apr 2015 at 10:43

GoogleCodeExporter commented 9 years ago

Original comment by yunqingw...@google.com on 2 Apr 2015 at 10:44

GoogleCodeExporter commented 9 years ago

As I mentioned, the combination of the vpx_sleep patch I attached and the 
multi-threaded decoder gives the best performance for many decoders in the same 
multithreaded process.  I provided a rather detailed description of that 
already,  The unbounded busy sync is part of the problem and is not visible on 
a single decoder in one process.

Original comment by anthony....@gmail.com on 2 Apr 2015 at 10:51

Suvarna1488 / webm

busy loop in thread_sleep(0) causes excessive cpu consumption regression from 0e78efad0be73d293880d1b71053c0d70a50a080 #979