GaseousStates / webp

Automatically exported from code.google.com/p/webp
0 stars 0 forks source link

cwebp is slow #54

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
cwebp takes 24s to encode bryce on default quality (75)

What steps will reproduce the problem?

1. time cwebp -q 75 bryce_big.jpg -o bryce_big.webp
Saving file 'bryce_big.webp'
4461356 bytes Y-U-V-All-PSNR 36.85 42.21 43.98   38.12 dB
block count:  intra4: 85305
              intra16: 8925  (-> 9.47%)
              skipped block: 1411 (1.50%)
bytes used:  header:            504  (0.0%)
             mode-partition: 513053  (11.5%)
 Residuals bytes  |segment 1|segment 2|segment 3|segment 4|  total
  intra4-coeffs:  | 2513602 |  843679 |  261585 |   50478 | 3669344  (82.2%)
 intra16-coeffs:  |    2007 |   15958 |   27884 |   15066 |   60915  (1.4%)
  chroma coeffs:  |  130769 |   51063 |   27937 |    7744 |  217513  (4.9%)
    macroblocks:  |      53%|      26%|      14%|       5%|   94230
      quantizer:  |      30 |      24 |      17 |      15 |
   filter level:  |       4 |       0 |       0 |       0 |
------------------+---------+---------+---------+---------+-----------------
 segments total:  | 2646378 |  910700 |  317406 |   73288 | 3947772  (88.5%)
23.880u 0.170s 0:24.20 99.3%    0+0k 0+8720io 0pf+0w

What is the expected output?
Encode should be about the same speed as other image codecs.
For this image, about 1-2 seconds.

What do you see instead?
23 seconds.

Original issue reported on code.google.com by fbarch...@google.com on 21 Mar 2011 at 11:41

Attachments:

GoogleCodeExporter commented 9 years ago
On my Windows box its 35 seconds (2.26 Ghz i7 with 8 cores)

timex cwebp -q 75 bryce_big.jpg -o bryce_big.webp
Saving file 'bryce_big.webp'
3863278 bytes Y-U-V-All-PSNR 35.65 41.29 43.00   36.94 dB
block count:  intra4: 82498
              intra16: 11732  (-> 12.45%)
              skipped block: 1847 (1.96%)
bytes used:  header:            465  (0.0%)
             mode-partition: 486570  (12.6%)
 Residuals bytes  |segment 1|segment 2|segment 3|segment 4|  total
  intra4-coeffs:  | 2220147 |  702549 |  177943 |   34137 | 3134776  (81.1%)
 intra16-coeffs:  |    2596 |   20719 |   36753 |   16113 |   76181  (2.0%)
  chroma coeffs:  |  102408 |   39885 |   17624 |    5343 |  165260  (4.3%)
    macroblocks:  |      51%|      27%|      15%|       5%|   94230
      quantizer:  |      35 |      31 |      26 |      20 |
   filter level:  |       5 |       3 |       0 |       0 |
------------------+---------+---------+---------+---------+-----------------
 segments total:  | 2325151 |  763153 |  232320 |   55593 | 3376217  (87.4%)
timex 35099.78ms

Original comment by fbarch...@google.com on 22 Mar 2011 at 12:16

GoogleCodeExporter commented 9 years ago
sse2 code is on its way

Original comment by s...@google.com on 24 Mar 2011 at 11:40

GoogleCodeExporter commented 9 years ago
Using libwebp-0.1.2-windows.zip 

c:\work>timex cwebp -q 75 bryce_big.jpg -o bryce_big.webp
Saving file 'bryce_big.webp'
4475680 bytes Y-U-V-All-PSNR 36.88 42.24 43.96   38.14 dB
block count:  intra4: 85349
              intra16: 8881  (-> 9.42%)
              skipped block: 1412 (1.50%)
bytes used:  header:            508  (0.0%)
             mode-partition: 514138  (11.5%)
 Residuals bytes  |segment 1|segment 2|segment 3|segment 4|  total
  intra4-coeffs:  | 2474357 |  879919 |  275473 |   50643 | 3680392  (82.2%)
 intra16-coeffs:  |    1802 |   15845 |   29432 |   14254 |   61333  (1.4%)
  chroma coeffs:  |  129097 |   53220 |   29389 |    7575 |  219281  (4.9%)
    macroblocks:  |      51%|      27%|      15%|       5%|   94230
      quantizer:  |      30 |      24 |      17 |      15 |
   filter level:  |       4 |       0 |       0 |       0 |
------------------+---------+---------+---------+---------+-----------------
 segments total:  | 2605256 |  948984 |  334294 |   72472 | 3961006  (88.5%)
timex 34820.08ms

Nice to see quality improved more than 1 dB!  But alas, performance did not.

I have 3 requests:
1. port the SSE2 to Visual C.  I suggest inline or yasm, because intrinsics are 
buggy and produce less efficent code.
2. give us a 64 bit build.  I think in the previous test (comment #1), cwebp on 
linux was 64 bit, but still had no sse2 and achieved 23 seconds
3. multicore encode.  The easiest way (ie a few hours work) is openmp.  Refer 
to psnr.cc source for build instructions, include, and pragma.

Original comment by fbarch...@google.com on 2 Apr 2011 at 5:10

GoogleCodeExporter commented 9 years ago
 https://review.webmproject.org/2200 should be a good start.

Original comment by s...@google.com on 22 Apr 2011 at 10:40

GoogleCodeExporter commented 9 years ago
C         30470.07 ms  38.14 dB
SSE2      18328.60 ms  38.14 dB
SSE2 x64  16782.91 ms  38.12 dB
Time to read input: 0.829s
Time to encode picture: 15.778s
A nice start on performance.  Curious the dB dropped a notch on 64 bit.

Here are the patches I applied
diff -wurp -N orig/Makefile.vc libwebp/Makefile.vc
--- orig/Makefile.vc    2011-04-23 09:32:33.121516900 -0700
+++ libwebp/Makefile.vc 2011-04-23 08:57:11.231349100 -0700
@@ -24,13 +24,13 @@ ARCH  = x86
 MT         = mt.exe
 CCNODBG    = cl.exe /nologo /O2 /DNDEBUG
 CCDEBUG    = cl.exe /nologo /Od /Gm /Zi /D_DEBUG /RTC1
-CFLAGS     = /Isrc /nologo /W3 /EHsc /DWIN32 /FD /c /GS 
/D_CRT_SECURE_NO_WARNINGS
-LDFLAGS    = /LARGEADDRESSAWARE /MANIFEST /NXCOMPAT /SAFESEH /DYNAMICBASE
+CFLAGS     = /Isrc /nologo  /arch:SSE2 /D__SSE2__ /W3 /EHsc /DWIN32 /FD /c /GS 
/D_CRT_SECURE_NO_WARNINGS
+LDFLAGS    = /DEBUG /LARGEADDRESSAWARE /MANIFEST /NXCOMPAT /DYNAMICBASE
 CFLAGSLIB  = /DLIBWEBP_STATICLIB
 LNKDLL     = link.exe /DLL
 LNKLIB     = link.exe /lib
 LNKEXE     = link.exe
-LFLAGS     = /nologo /machine:$(ARCH)
+LFLAGS     = /nologo
 CFLAGS     = $(CFLAGS)

 CFGSET     = FALSE
@@ -126,6 +126,7 @@ X_OBJS= \
        $(DIROBJ)\enc\config.obj \
        $(DIROBJ)\enc\cost.obj \
        $(DIROBJ)\enc\dsp.obj \
+       $(DIROBJ)\enc\dsp_sse2.obj \
        $(DIROBJ)\enc\frame.obj \
        $(DIROBJ)\enc\filter.obj \
        $(DIROBJ)\enc\iterator.obj \
diff -wurp -N orig/src/enc/dsp.c libwebp/src/enc/dsp.c
--- orig/src/enc/dsp.c  2011-04-23 09:32:33.228527600 -0700
+++ libwebp/src/enc/dsp.c       2011-04-23 08:27:28.576101400 -0700
@@ -700,17 +700,10 @@ void VP8EncDspInit(void) {
   VP8Copy16x16 = Copy16x16;

   // If defined, use CPUInfo() to overwrite some pointers with faster versions.
-  if (VP8GetCPUInfo) {
-    if (VP8GetCPUInfo(kSSE2)) {
 #if defined(__SSE2__)
       VP8EncDspInitSSE2();
 #endif
     }
-    if (VP8GetCPUInfo(kSSE3)) {
-      // later we'll plug some SSE3 variant here
-    }
-  }
-}

 #if defined(__cplusplus) || defined(c_plusplus)
 }    // extern "C"

Original comment by fbarch...@google.com on 23 Apr 2011 at 4:39

GoogleCodeExporter commented 9 years ago
Submitted a patch for SSE2 on Windows in issue 70. I'm seeing a 2x performance 
improvement (alas, libjpeg-turbo is still 50 times faster by my tests).

Original comment by thecybershadow on 3 May 2011 at 9:58

GoogleCodeExporter commented 9 years ago
down to 5secs now on my Mac laptop, including the fact that there's multiple 
passes in order to improve on the partition #0 constraints:

=======

Dimension: 11158 x 2156
Output:    4724432 bytes Y-U-V-All-PSNR 36.98 42.39 44.02   38.24 dB
block count:  intra4: 88992
              intra16: 5238  (-> 5.56%)
              skipped block: 762 (0.81%)
bytes used:  header:            536  (0.0%)
             mode-partition: 506333  (10.7%)
 Residuals bytes  |segment 1|segment 2|segment 3|segment 4|  total
    macroblocks:  |      15%|      30%|      34%|      19%|   94230
      quantizer:  |      36 |      29 |      23 |      15 |
   filter level:  |      11 |       6 |       5 |       7 |

real    0m5.081s
user    0m4.944s
sys 0m0.128s

Original comment by pascal.m...@gmail.com on 5 Jun 2014 at 4:56