Chen-tao / webm

Automatically exported from code.google.com/p/webm
0 stars 0 forks source link

vp9 encoder is slow #553

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
vp9 has gone thru a pass of optimization, so opening bug to report back for
https://code.google.com/p/webm/issues/detail?id=527

Re #2 bug 527 - show fps.
Thats working now, thanks.  fps and ETA are shown during compression.
Pass 2/2 frame 6008/5983 22912842B 3538581 ms 1.70 fps [ETA 27:43:45] 
←[K43.882 42.836 45.961 49.254    4137FFFFFF

On a short video (bear), encoding time of 3 versions of the encoder are 
compared:
jan24   317054.21
feb23   313116.16
feb23x  113901.76
The last one is v1.2.0-1710-gbf0570a experimental branch.  2.74x faster.
feb23x vp8 encodes in 4206.39 with 1 thread.  27.07x faster.
feb23x vp8 encodes in 2307.61 with 6 threads. 49.35x faster.

Original issue reported on code.google.com by fbarch...@google.com on 25 Feb 2013 at 5:35

GoogleCodeExporter commented 8 years ago
Experimental is 3.297 times faster than Master.
44.980 times slower than vp8.

Version  VP9 Time VP8 Time
jan14      72,628 3,958
jan20     311,193 1,989
jan24     338,425 1,939
jan26     315,218 1,809
feb23     313,918 2,170
feb23x    113,970 2,015
mar1      316,140 2,151
mar1x      96,773 2,183
mar07     317,223 1,998
mar07x     94,093 2,188
mar09     319,236 1,974
mar09x     98,094 2,266
mar11     317,829 1,981
mar11x     96,393 2,143

Using the following script:
call :timeone vpxenc_jan14.exe   vp9
call :timeone vpxenc_jan20.exe   vp9
call :timeone vpxenc_jan24.exe   vp9
call :timeone vpxenc_jan26.exe   vp9
call :timeone vpxenc_feb23.exe   vp9
call :timeone vpxencx_feb23.exe  vp9
call :timeone vpxenc_mar1.exe    vp9
call :timeone vpxencx_mar1.exe   vp9
call :timeone vpxenc_mar07.exe   vp9
call :timeone vpxencx_mar07.exe  vp9
call :timeone vpxenc_mar09.exe   vp9
call :timeone vpxencx_mar09.exe  vp9
call :timeone vpxenc_mar11.exe   vp9
call :timeone vpxencx_mar11.exe  vp9

goto :eof

:timeone

timex %1  -w 640 -h 360 --fps=30000/1001 --target-bitrate=400 
bear.640x360_30Hz_P420.yuv -o bear0.vp9.webm -p 2 --codec=%2 --good 
--cpu-used=0 --lag-in-frames=25 --min-q=0 --max-q=63 --end-usage=vbr 
--auto-alt-ref=1 --kf-max-dist=9999 --kf-min-dist=0 --drop-frame=0 
--static-thresh=0 --bias-pct=50 --minsection-pct=0 --maxsection-pct=2000 
--arnr-maxframes=7 --arnr-strength=5 --arnr-type=3 --sharpness=0 
--undershoot-pct=100 -v --psnr -t 4
goto :eof

Original comment by fbarch...@chromium.org on 12 Mar 2013 at 3:39

GoogleCodeExporter commented 8 years ago
VP9 bitstream is not yet finalized 

Original comment by albe...@google.com on 14 Mar 2013 at 10:27

GoogleCodeExporter commented 8 years ago
Those working on improving vp9 quality (before its final), would benefit from 
faster iteration.  Making a change and testing if it helps quality, is 
typically hours, if not days, to run the test.

The improvement made so far (3.2x), made a huge difference.

I'd suggest enabling threads - that shouldnt affect bit stream, and would give 
an order of magnitude performance difference.

Original comment by fbarch...@chromium.org on 15 Mar 2013 at 3:00

GoogleCodeExporter commented 8 years ago
One more interesting thing: on my core i5 (4 cores) vp9 uses only one core (25% 
cpu load shows KDE system monitor) even when I use -t 3 option.
My command line:
vpxenc $HOME/video.y4m -o "${g%.*}.webm" \
  --i420 --passes=2 --pass=2 --fpf=pass.log -t 3 \
  --good --cpu-used=0 --target-bitrate=1200 --auto-alt-ref=1 \
  -v --codec=vp9 --end-usage=vbr --minsection-pct=5 \
  --maxsection-pct=800 --lag-in-frames=16 --cpu-used=0 \
  --kf-min-dist=0 --kf-max-dist=360 \
  --static-thresh=0 --min-q=0 --max-q=60 & mplayer -benchmark -nofs -noframedrop -vo yuv4mpeg:file=/home/ilya/video.y4m -ass -vf harddup -nosound "$g"

Original comment by yast...@gmail.com on 18 May 2013 at 7:31

GoogleCodeExporter commented 8 years ago
Some progress:

Date    ms/f
jan14   966
jan20   3,803
jan24   3,625
jan26   4,002
feb23   4,016
mar1.   3,962
mar07   3,936
mar09   4,121
mar11   4,082
mar14   3,984
mar23   4,030
apr26   1,294
may03   1,293
may12   1,288
jun01   1,298
jun12   12,844
jun14   9,601
jul31   2,118
aug03   2,044
aug18   2,078
aug24   1,950

VP8 is 36 ms/f on same machine/file.  54x faster.

Original comment by fbarch...@google.com on 25 Aug 2013 at 8:33

GoogleCodeExporter commented 8 years ago
Using Aug 29 version
On bear movie

VP8
Pass 2/2 frame   82/86    132565B   12933b/f  387607b/s 1310022 us (62.59 
fps)←[K573F
Stream 0 PSNR (Overall/Avg/Y/U/V) 37.551 37.836 36.661 41.026 43.575
TIMEX 1582.00 ms (1.58 seconds)

VP9 cpu used=1
Pass 2/2 frame   82/82    129313B   12615b/f  378098b/s   14284 ms (5.74 fps)
Stream 0 PSNR (Overall/Avg/Y/U/V) 39.840 39.970 38.879 42.694 44.946
TIMEX 14463.00 ms (14.46 seconds)

cpu used = 0 109714.00 ms
cpu used = 1 15086.00 ms
cpu used = 2 4854.00 ms
cpu used = 3 3351.00 ms
cpu used = 4 2776.00 ms

Long movie
Pass 2/2 frame 4228/4203 15216752B 1743055 ms 2.43 fps [ETA 20:09:35] 49.546 
48.730 51.543 52.201    1080F

Re #4 'interesting thing' - thats why I suggest enabling threads in #3.  Its 
still not enabled, and would be an easy change for a large win.
One idea would be a thread per tile for encoding.  Tiles allow full parallelism 
and just need the bitstream writes serialized.

Original comment by fbarch...@google.com on 1 Sep 2013 at 5:33

GoogleCodeExporter commented 8 years ago
Maybe my command line is wrong, but --cpu-used=1 (or 2 or 3) and threads=4 
still doesn't work. vpxenc still uses one thread.

Original comment by yast...@gmail.com on 16 Sep 2013 at 4:38

GoogleCodeExporter commented 8 years ago
Re #7  Vp9 does not support threads.

Long videos still take quite awhile to encode:

cpu used=0
Pass 2/2 frame  298/273  4029642B 2922957 ms 9808.58 ms/f [ETA 527:47:26] 
←[K42.323 41.067 46.737 47.617    8769F

cpu used=1
Pass 2/2 frame 10625/10600 76571391B 6799631 ms 1.56 fps [ETA 30:01:31] 44.487 
43.733 45.009 48.984     117F

Original comment by fbarch...@google.com on 17 Sep 2013 at 8:18

GoogleCodeExporter commented 8 years ago
cpu used=0 remains a little too slow to use in practice.
After 1 day, estimate is 382 hrs = 15.91 days.
Pass 2/2 frame 10335/10310 73179543B 84133871 ms 7.37 fpm [ETA 382:46:04] 
43.911 43.087 44.596 48.986   17701F

Of the 32 videos in testmatrix, 6 take more than a day
brian -  Pass 2/2 frame 29924/29899 54374479B 85703619 ms 20.95 fpm [ETA 
122:58:01] 42.184 41.281 44.06
garden - Pass 2/2 frame 1447/1422 58168937B 84758018 ms 1.02 fpm [ETA 21:37:40] 
44.362 43.732 45.578 46.436
dance -  Pass 2/2 frame 4892/4867 39493166B 85999862 ms 3.41 fpm [ETA 19:09:55] 
35.564 34.479 39.807 38.670      45F
snow -   Pass 2/2 frame 3355/3330 22930878B 84537280 ms 2.38 fpm [ETA 19:21:31] 
25.81
red -    Pass 2/2 frame 2488/2463 22099691B 84668105 ms 1.76 fpm [ETA 13:04:25] 
42.609 41.213 49.847 47.933    1053F

Original comment by fbarch...@google.com on 29 Sep 2013 at 12:49

GoogleCodeExporter commented 8 years ago

Original comment by fgalli...@google.com on 16 Jan 2015 at 11:53