Open GoogleCodeExporter opened 9 years ago
On my Windows box its 35 seconds (2.26 Ghz i7 with 8 cores)
timex cwebp -q 75 bryce_big.jpg -o bryce_big.webp
Saving file 'bryce_big.webp'
3863278 bytes Y-U-V-All-PSNR 35.65 41.29 43.00 36.94 dB
block count: intra4: 82498
intra16: 11732 (-> 12.45%)
skipped block: 1847 (1.96%)
bytes used: header: 465 (0.0%)
mode-partition: 486570 (12.6%)
Residuals bytes |segment 1|segment 2|segment 3|segment 4| total
intra4-coeffs: | 2220147 | 702549 | 177943 | 34137 | 3134776 (81.1%)
intra16-coeffs: | 2596 | 20719 | 36753 | 16113 | 76181 (2.0%)
chroma coeffs: | 102408 | 39885 | 17624 | 5343 | 165260 (4.3%)
macroblocks: | 51%| 27%| 15%| 5%| 94230
quantizer: | 35 | 31 | 26 | 20 |
filter level: | 5 | 3 | 0 | 0 |
------------------+---------+---------+---------+---------+-----------------
segments total: | 2325151 | 763153 | 232320 | 55593 | 3376217 (87.4%)
timex 35099.78ms
Original comment by fbarch...@google.com
on 22 Mar 2011 at 12:16
sse2 code is on its way
Original comment by s...@google.com
on 24 Mar 2011 at 11:40
Using libwebp-0.1.2-windows.zip
c:\work>timex cwebp -q 75 bryce_big.jpg -o bryce_big.webp
Saving file 'bryce_big.webp'
4475680 bytes Y-U-V-All-PSNR 36.88 42.24 43.96 38.14 dB
block count: intra4: 85349
intra16: 8881 (-> 9.42%)
skipped block: 1412 (1.50%)
bytes used: header: 508 (0.0%)
mode-partition: 514138 (11.5%)
Residuals bytes |segment 1|segment 2|segment 3|segment 4| total
intra4-coeffs: | 2474357 | 879919 | 275473 | 50643 | 3680392 (82.2%)
intra16-coeffs: | 1802 | 15845 | 29432 | 14254 | 61333 (1.4%)
chroma coeffs: | 129097 | 53220 | 29389 | 7575 | 219281 (4.9%)
macroblocks: | 51%| 27%| 15%| 5%| 94230
quantizer: | 30 | 24 | 17 | 15 |
filter level: | 4 | 0 | 0 | 0 |
------------------+---------+---------+---------+---------+-----------------
segments total: | 2605256 | 948984 | 334294 | 72472 | 3961006 (88.5%)
timex 34820.08ms
Nice to see quality improved more than 1 dB! But alas, performance did not.
I have 3 requests:
1. port the SSE2 to Visual C. I suggest inline or yasm, because intrinsics are
buggy and produce less efficent code.
2. give us a 64 bit build. I think in the previous test (comment #1), cwebp on
linux was 64 bit, but still had no sse2 and achieved 23 seconds
3. multicore encode. The easiest way (ie a few hours work) is openmp. Refer
to psnr.cc source for build instructions, include, and pragma.
Original comment by fbarch...@google.com
on 2 Apr 2011 at 5:10
https://review.webmproject.org/2200 should be a good start.
Original comment by s...@google.com
on 22 Apr 2011 at 10:40
C 30470.07 ms 38.14 dB
SSE2 18328.60 ms 38.14 dB
SSE2 x64 16782.91 ms 38.12 dB
Time to read input: 0.829s
Time to encode picture: 15.778s
A nice start on performance. Curious the dB dropped a notch on 64 bit.
Here are the patches I applied
diff -wurp -N orig/Makefile.vc libwebp/Makefile.vc
--- orig/Makefile.vc 2011-04-23 09:32:33.121516900 -0700
+++ libwebp/Makefile.vc 2011-04-23 08:57:11.231349100 -0700
@@ -24,13 +24,13 @@ ARCH = x86
MT = mt.exe
CCNODBG = cl.exe /nologo /O2 /DNDEBUG
CCDEBUG = cl.exe /nologo /Od /Gm /Zi /D_DEBUG /RTC1
-CFLAGS = /Isrc /nologo /W3 /EHsc /DWIN32 /FD /c /GS
/D_CRT_SECURE_NO_WARNINGS
-LDFLAGS = /LARGEADDRESSAWARE /MANIFEST /NXCOMPAT /SAFESEH /DYNAMICBASE
+CFLAGS = /Isrc /nologo /arch:SSE2 /D__SSE2__ /W3 /EHsc /DWIN32 /FD /c /GS
/D_CRT_SECURE_NO_WARNINGS
+LDFLAGS = /DEBUG /LARGEADDRESSAWARE /MANIFEST /NXCOMPAT /DYNAMICBASE
CFLAGSLIB = /DLIBWEBP_STATICLIB
LNKDLL = link.exe /DLL
LNKLIB = link.exe /lib
LNKEXE = link.exe
-LFLAGS = /nologo /machine:$(ARCH)
+LFLAGS = /nologo
CFLAGS = $(CFLAGS)
CFGSET = FALSE
@@ -126,6 +126,7 @@ X_OBJS= \
$(DIROBJ)\enc\config.obj \
$(DIROBJ)\enc\cost.obj \
$(DIROBJ)\enc\dsp.obj \
+ $(DIROBJ)\enc\dsp_sse2.obj \
$(DIROBJ)\enc\frame.obj \
$(DIROBJ)\enc\filter.obj \
$(DIROBJ)\enc\iterator.obj \
diff -wurp -N orig/src/enc/dsp.c libwebp/src/enc/dsp.c
--- orig/src/enc/dsp.c 2011-04-23 09:32:33.228527600 -0700
+++ libwebp/src/enc/dsp.c 2011-04-23 08:27:28.576101400 -0700
@@ -700,17 +700,10 @@ void VP8EncDspInit(void) {
VP8Copy16x16 = Copy16x16;
// If defined, use CPUInfo() to overwrite some pointers with faster versions.
- if (VP8GetCPUInfo) {
- if (VP8GetCPUInfo(kSSE2)) {
#if defined(__SSE2__)
VP8EncDspInitSSE2();
#endif
}
- if (VP8GetCPUInfo(kSSE3)) {
- // later we'll plug some SSE3 variant here
- }
- }
-}
#if defined(__cplusplus) || defined(c_plusplus)
} // extern "C"
Original comment by fbarch...@google.com
on 23 Apr 2011 at 4:39
Submitted a patch for SSE2 on Windows in issue 70. I'm seeing a 2x performance
improvement (alas, libjpeg-turbo is still 50 times faster by my tests).
Original comment by thecybershadow
on 3 May 2011 at 9:58
down to 5secs now on my Mac laptop, including the fact that there's multiple
passes in order to improve on the partition #0 constraints:
=======
Dimension: 11158 x 2156
Output: 4724432 bytes Y-U-V-All-PSNR 36.98 42.39 44.02 38.24 dB
block count: intra4: 88992
intra16: 5238 (-> 5.56%)
skipped block: 762 (0.81%)
bytes used: header: 536 (0.0%)
mode-partition: 506333 (10.7%)
Residuals bytes |segment 1|segment 2|segment 3|segment 4| total
macroblocks: | 15%| 30%| 34%| 19%| 94230
quantizer: | 36 | 29 | 23 | 15 |
filter level: | 11 | 6 | 5 | 7 |
real 0m5.081s
user 0m4.944s
sys 0m0.128s
Original comment by pascal.m...@gmail.com
on 5 Jun 2014 at 4:56
Original issue reported on code.google.com by
fbarch...@google.com
on 21 Mar 2011 at 11:41Attachments: