Open Sanmayce opened 8 years ago
Thanks for the thorough analysis! gcc does seem to have a problem generating tight code for LZSSE, although I'm surprised Intel is that far ahead! I'll have to have a more thorough look at the assembly you generated on the weekend to see what is causing the slowdown (although I suspect you are right and the Intel code is doing a much better job with the registers).
I'd say that the slow result with the small file alice29 could be due to timer resolution or it could be due to some fixed overhead/system overhead. One thing I've found is that if the OS hasn't allocated the memory pages for the buffer yet (just mapped them), then the initial page fault to allocate memory can be quite a large amount of overhead on those small files.
>Thanks for the thorough analysis!
My pleasure. AN thorough analysis is yet to come, I have only two old laptops at my disposal, it will take some 2-3 months to give a much better picture of textual [de]compression, I intend to test ~400 files, ranging from 25 bytes up to 1400MB...
>gcc does seem to have a problem generating tight code for LZSSE, although I'm surprised Intel is that far ahead!
GCC is good, but in my experience Intel distributes registers more effectively. When even a single register is not in use then stack accesses hurt speed, you know, since you utilize at max all the squad of XMMs it is even more so - the accumulative penalties matter! I think your etude is not to be presented with any compiler/options, its essence is lost ... in the translation.
>I'll have to have a more thorough look at the assembly you generated on the weekend to see what is causing the slowdown (although I suspect you are right and the Intel code is doing a much better job with the registers).
My wish is textual fans to have one fully operational console tool (even in its simplest form i.e. file-to-file), just to have one PARAGON performer and feel what the speed religion is all about.
>I'd say that the slow result with the small file alice29 could be due to timer resolution or it could be due to some fixed overhead/system overhead. One thing I've found is that if the OS hasn't allocated the memory pages for the buffer yet (just mapped them), then the initial page fault to allocate memory can be quite a large amount of overhead on those small files.
Here I admit, despite the "triviality" of such measuring, cannot sense what is going on, maybe I will ask for help Hamid or some guys at Intel's forum. Anyone?!
Hi again, I'm so sorry for my first draft, a stupid bug in the stats is now fixed, all in all I was swayed by the rush to upload it, now I had time to finish the draft. The upload is reuploaded, with working executable Nakamichi&LZSSE2 pair.
Also, I believe you were right about this 'malloc' that was not in reality executed fully, I moved it far before the benchmarking, and lo, it reports okay. This night I will run some 10 files more, gradually rising up to ~50MB, just to make the quick view a bit deeper.
Another dumb mistake of mine was the statement about 100x faster compression rate, in fact it is 1000x, if not bigger.
On my laptop Core 2 Q9550s @2.83GHz, alice29.txt is decompressed much faster, I believe the report is okay (given that I had many tasks in the tray):
LZSSE2: RAM-to-RAM performance: 640 MB/s.
Also I added a few more strong compressors to the 'bundle', the full log for 'alice29.txt':
D:\TEXTUAL_MADNESS\_The_Usual_Suspects>dir alice29.txt/b>DIRLIST
D:\TEXTUAL_MADNESS\_The_Usual_Suspects>type COMPRESS_all.bat
rem dir *.*/b/a-d>DIRLIST
FOR /F %%G IN (DIRLIST) DO CALL Bundle_of_15_compressors.bat %%G
D:\TEXTUAL_MADNESS\_The_Usual_Suspects>COMPRESS_all.bat
D:\TEXTUAL_MADNESS\_The_Usual_Suspects>rem dir *.*/b/a-d>DIRLIST
D:\TEXTUAL_MADNESS\_The_Usual_Suspects>FOR /F %G IN (DIRLIST) DO CALL Bundle_of_15_compressors.bat %G
D:\TEXTUAL_MADNESS\_The_Usual_Suspects>CALL Bundle_of_15_compressors.bat alice29.txt
Performers:
- LZ4 for Windows 32-bits v1.4, by Yann Collet (Sep 17 2013).
- 7-Zip (A) 9.20, Copyright (c) 1999-2010 Igor Pavlov, 2010-11-18.
- bsc, Block Sorting Compressor, Version 3.1.0. Copyright (c) 2009-2012 Ilya Grebnov, 8 July 2012.
- lzturbo 1.2 Copyright (c) 2007-2014 Hamid Buzidi, Aug 11 2014.
- zpaq v7.05 journaling archiver, compiled Apr 17 2015, http://mattmahoney.net/zpaq
- CABARC, Microsoft (R) Cabinet Tool - Version 5.1.2600.0, Copyright (c) Microsoft Corporation.
- Compress, version: (N)compress 4.2.4.4, compiled: Fri, Aug 23, 2013 11:56:09. Authors: Peter Jannesen, Dave Mack, Spencer W. Thomas, Jim McKie, Steve Davies, Ken Turkowski, James A. Woods, Joe Orost.
- zstd command line interface 64-bits v0.5.1, by Yann Collet
- xz (XZ Utils) 5.2.1, liblzma 5.2.1, XZ Utils home page: http://tukaani.org/xz/
- brotli, Feb-10-2016 source
- Nakamichi 'Tengu-Tsuyo', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m2 enforced, muffinesque suggestion by Jim Dempsey enforced.
- LZSSE2(FASTEST TEXTUAL decompressor), Copyright (c) 2016, Conor Stokes
- GLZA v0.4.1, Copyright 2014-2016 Kennon Conrad
- LZ5 command line interface 64-bit v1.4 by Y.Collet and P.Skibinski (Feb 11 2016)
- PPMd (Fast PPMII compressor for textual data) var.I rev.2, Dmitry Shkarin
...
Directory of D:\TEXTUAL_MADNESS\_The_Usual_Suspects
02/23/2016 11:25 PM 152,089 alice29.txt
04/03/2016 04:27 PM 48,575 alice29.txt.128MB.7z
04/03/2016 04:27 PM 63,013 alice29.txt.256MB.lzturbo12-19.lzt
04/03/2016 04:27 PM 62,436 alice29.txt.256MB.lzturbo12-29.lzt
04/03/2016 04:27 PM 50,668 alice29.txt.256MB.lzturbo12-39.lzt
04/03/2016 04:28 PM 42,210 alice29.txt.glze
04/03/2016 04:27 PM 46,685 alice29.txt.L11_W24.brotli
04/03/2016 04:28 PM 61,012 alice29.txt.l15_256MB.lz5
04/03/2016 04:28 PM 56,526 alice29.txt.L17.LZSSE2
04/03/2016 04:27 PM 49,710 alice29.txt.L21.zst
04/03/2016 04:27 PM 63,705 alice29.txt.L9.lz4
04/03/2016 04:27 PM 51,707 alice29.txt.L9.zip
04/03/2016 04:27 PM 49,538 alice29.txt.LZX21.cab
04/03/2016 04:28 PM 38,682 alice29.txt.m256.o16.ppmd
04/03/2016 04:27 PM 152,918 alice29.txt.method08.zpaq
04/03/2016 04:27 PM 59,915 alice29.txt.method28.zpaq
04/03/2016 04:27 PM 37,602 alice29.txt.method58.zpaq
04/03/2016 04:27 PM 54,278 alice29.txt.MSZIP.cab
04/03/2016 04:27 PM 40,998 alice29.txt.ST6Block256.bsc
04/03/2016 04:28 PM 73,235 alice29.txt.Tengu-Tsuyo.Nakamichi
04/03/2016 04:27 PM 48,528 alice29.txt.xz
04/03/2016 04:27 PM 62,247 alice29.txt.Z
D:\TEXTUAL_MADNESS\_The_Usual_Suspects>
Had time only to run 4 testfiles on i5-2430M @3GHz, DDR3 @666MHz:
D:\test>dir
Volume in drive D is COMETA_V1
Volume Serial Number is 2E2F-7737
Directory of D:\test
04/03/2016 02:51 PM <DIR> .
04/03/2016 02:51 PM <DIR> ..
02/23/2016 12:25 PM 152,089 alice29.txt
04/03/2016 02:51 PM 56,526 alice29.txt.L17.LZSSE2
04/03/2016 02:51 PM 73,235 alice29.txt.Nakamichi
03/27/2016 05:24 AM 6,225,580 Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt
02/23/2016 12:25 PM 10,192,446 dickens
04/03/2016 04:56 AM 146,432 Nakamichi_Tengu-Tsuyo_XMM_PREFETCH_4096_Intel_15.0_64bit_SSE41.exe
02/23/2016 12:25 PM 12,030,464 New_York_Times_Bestsellers_-_August_2015_-_20_ebooks.tar
02/23/2016 12:25 PM 3,265,536 University_of_Canterbury_The_Calgary_Corpus.tar
8 File(s) 32,142,308 bytes
2 Dir(s) 13,557,866,496 bytes free
D:\test>Nakamichi_Tengu-Tsuyo_XMM_PREFETCH_4096_Intel_15.0_64bit_SSE41.exe alice29.txt
Nakamichi 'Tengu-Tsuyo', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note: Conor Stokes' LZSSE2(FASTEST Textual Decompressor) is embedded, all credits along with many thanks go to him.
Limitation: Uncompressed 8192 MB of filesize.
Current priority class is HIGH_PRIORITY_CLASS.
Allocating Source-Buffer 0 MB ...
Allocating Target-Buffer 32 MB ...
Allocating Verification-Buffer 0 MB ...
Compressing 152,089 bytes ...
-; Each rotation means 64KB are encoded; Done 100%
NumberOfFullLiterals (lower-the-better): 31
NumberOf(Tiny)Matches[Tiny]Window (4): 161
NumberOf(Short)Matches[Tiny]Window (8): 98
NumberOf(Medium)Matches[Tiny]Window (12): 21
RAM-to-RAM performance: 23 KB/s.
Compressed to 73,235 bytes.
Source-file-Hash(FNV1A_YoshimitsuTRIAD) = 0x1366,78ee
Target-file-Hash(FNV1A_YoshimitsuTRIAD) = 0xa374,22ff
Decompressing 73,235 (being the compressed stream) bytes ...
RAM-to-RAM performance: 1152 MB/s.
Verification (input and output sizes match) OK.
Verification (input and output blocks match) OK.
LZSSE2: Compressing with LZSSE2 (level 17) 152,089 bytes ...
LZSSE2: Compressed to 56,526 bytes.
LZSSE2: RAM-to-RAM performance: 8736 KB/s.
LZSSE2: Decompressing 56,526 bytes (being the compressed stream) ...
LZSSE2: RAM-to-RAM performance: 18560 MB/s.
LZSSE2: Verification (input and output sizes match) OK.
LZSSE2: Verification (input and output blocks match) OK.
D:\test>Nakamichi_Tengu-Tsuyo_XMM_PREFETCH_4096_Intel_15.0_64bit_SSE41.exe University_of_Canterbury_The_Calgary_Corpus.tar
Nakamichi 'Tengu-Tsuyo', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note: Conor Stokes' LZSSE2(FASTEST Textual Decompressor) is embedded, all credits along with many thanks go to him.
Limitation: Uncompressed 8192 MB of filesize.
Current priority class is HIGH_PRIORITY_CLASS.
Allocating Source-Buffer 3 MB ...
Allocating Target-Buffer 35 MB ...
Allocating Verification-Buffer 3 MB ...
Compressing 3,265,536 bytes ...
/; Each rotation means 64KB are encoded; Done 100%
NumberOfFullLiterals (lower-the-better): 2966
NumberOf(Tiny)Matches[Tiny]Window (4): 5396
NumberOf(Short)Matches[Tiny]Window (8): 3633
NumberOf(Medium)Matches[Tiny]Window (12): 27429
RAM-to-RAM performance: 5 KB/s.
Compressed to 1,333,349 bytes.
Source-file-Hash(FNV1A_YoshimitsuTRIAD) = 0xe6d1,9b78
Target-file-Hash(FNV1A_YoshimitsuTRIAD) = 0x2a73,4318
Decompressing 1,333,349 (being the compressed stream) bytes ...
RAM-to-RAM performance: 1024 MB/s.
Verification (input and output sizes match) OK.
Verification (input and output blocks match) OK.
LZSSE2: Compressing with LZSSE2 (level 17) 3,265,536 bytes ...
LZSSE2: Compressed to 1,142,536 bytes.
LZSSE2: RAM-to-RAM performance: 463 KB/s.
LZSSE2: Decompressing 1,142,536 bytes (being the compressed stream) ...
LZSSE2: RAM-to-RAM performance: 2048 MB/s.
LZSSE2: Verification (input and output sizes match) OK.
LZSSE2: Verification (input and output blocks match) OK.
D:\test>Nakamichi_Tengu-Tsuyo_XMM_PREFETCH_4096_Intel_15.0_64bit_SSE41.exe Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt
Nakamichi 'Tengu-Tsuyo', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note: Conor Stokes' LZSSE2(FASTEST Textual Decompressor) is embedded, all credits along with many thanks go to him.
Limitation: Uncompressed 8192 MB of filesize.
Current priority class is HIGH_PRIORITY_CLASS.
Allocating Source-Buffer 5 MB ...
Allocating Target-Buffer 37 MB ...
Allocating Verification-Buffer 5 MB ...
Compressing 6,225,580 bytes ...
-; Each rotation means 64KB are encoded; Done 100%
NumberOfFullLiterals (lower-the-better): 57
NumberOf(Tiny)Matches[Tiny]Window (4): 5321
NumberOf(Short)Matches[Tiny]Window (8): 586
NumberOf(Medium)Matches[Tiny]Window (12): 18
RAM-to-RAM performance: 5 KB/s.
Compressed to 2,577,921 bytes.
Source-file-Hash(FNV1A_YoshimitsuTRIAD) = 0x0736,474e
Target-file-Hash(FNV1A_YoshimitsuTRIAD) = 0x4bb3,2a5b
Decompressing 2,577,921 (being the compressed stream) bytes ...
RAM-to-RAM performance: 1024 MB/s.
Verification (input and output sizes match) OK.
Verification (input and output blocks match) OK.
LZSSE2: Compressing with LZSSE2 (level 17) 6,225,580 bytes ...
LZSSE2: Compressed to 2,388,411 bytes.
LZSSE2: RAM-to-RAM performance: 1504 KB/s.
LZSSE2: Decompressing 2,388,411 bytes (being the compressed stream) ...
LZSSE2: RAM-to-RAM performance: 1920 MB/s.
LZSSE2: Verification (input and output sizes match) OK.
LZSSE2: Verification (input and output blocks match) OK.
D:\test>Nakamichi_Tengu-Tsuyo_XMM_PREFETCH_4096_Intel_15.0_64bit_SSE41.exe dickens
Nakamichi 'Tengu-Tsuyo', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note: Conor Stokes' LZSSE2(FASTEST Textual Decompressor) is embedded, all credits along with many thanks go to him.
Limitation: Uncompressed 8192 MB of filesize.
Current priority class is HIGH_PRIORITY_CLASS.
Allocating Source-Buffer 9 MB ...
Allocating Target-Buffer 41 MB ...
Allocating Verification-Buffer 9 MB ...
Compressing 10,192,446 bytes ...
\; Each rotation means 64KB are encoded; Done 100%
NumberOfFullLiterals (lower-the-better): 128
NumberOf(Tiny)Matches[Tiny]Window (4): 6259
NumberOf(Short)Matches[Tiny]Window (8): 712
NumberOf(Medium)Matches[Tiny]Window (12): 89
RAM-to-RAM performance: 5 KB/s.
Compressed to 3,964,930 bytes.
Source-file-Hash(FNV1A_YoshimitsuTRIAD) = 0x056f,7c86
Target-file-Hash(FNV1A_YoshimitsuTRIAD) = 0xb7f0,b59f
Decompressing 3,964,930 (being the compressed stream) bytes ...
RAM-to-RAM performance: 1152 MB/s.
Verification (input and output sizes match) OK.
Verification (input and output blocks match) OK.
LZSSE2: Compressing with LZSSE2 (level 17) 10,192,446 bytes ...
LZSSE2: Compressed to 3,872,373 bytes.
LZSSE2: RAM-to-RAM performance: 6017 KB/s.
LZSSE2: Decompressing 3,872,373 bytes (being the compressed stream) ...
LZSSE2: RAM-to-RAM performance: 1920 MB/s.
LZSSE2: Verification (input and output sizes match) OK.
LZSSE2: Verification (input and output blocks match) OK.
D:\test>
Comparing with TurboBench I see no discrepancy:
1142536 2048 MB/s lzsse2 17 !outside TurboBench!
1143886 35.0 6.18 1894.16 lzsse2 16 University_of_Canterbury_The_Calgary_Corpus.tar.tbb
2388411 1920 MB/s lzsse2 17 !outside TurboBench!
2391442 38.4 5.45 1961.43 lzsse2 16 Complete_Works_of_Seneca_-_Lucius_Annaeus_Seneca.txt.tbb
3872373 1920 MB/s lzsse2 17 !outside TurboBench!
3872652 38.0 5.43 1799.20 lzsse2 16 dickens.tbb
Except for the 'small file':
56526 18560 MB/s lzsse2 17 !outside TurboBench!
56530 37.2 5.64 1810.58 lzsse2 16 alice29.txt.tbb
Food for thought: 640MB/s vs 18560 MB/s (clean Windows 7 with no tasks in the tray). The measurement differs, yet I think it is correct now, or not?
Hi, once more wanted to share what awesomeness in action looks like.
Benchmarking 'TDELCC' a.k.a. The-Definitive-English-Language-Compression-Corpus, a smashdown, https://github.com/Sanmayce/Nakamichi
Another iteration of Sanmayce's decompression showdown 'FULG', revision 4, all performers are included in the package:
128t_opaque_GS.png: https://drive.google.com/file/d/1wPVbSSArPFd7_JyoOMS52Sx-HHsYQK9k/view?usp=sharing
Fulg-Textual_[De]Compression_Showdown_v4.tar.gz: https://drive.google.com/file/d/1F3yxgHrLNlrAM5Uc3pgFhtyTzkvS6sSR/view?usp=sharing
Satanichi_smashdown.pdf: https://drive.google.com/file/d/1uqGoWbn0WYM1l__wnGqmO61RegU521_v/view?usp=sharing
Always, it is good to get the picture how the latest compressors fare in TEXTUAL realm. The name of the game is: applying maximum compression strength, aiming at maximum decompression ... speed, heh-heh.
Included compressors:
RAR v.7.00beta3 by Alexander Roshal, Russia;
BR, Brotli v.1.1.0 by Jyrki Alakuijala, Finland;
ZPAQ v.7.15 by Matt Mahoney, America;
GZ, 7zip's GZ v.23.01 by Igor Pavlov, Russia;
BZ2, 7zip's BZ2 v.23.01 by Igor Pavlov, Russia;
7Z, 7zip's 7Z v.23.01 by Igor Pavlov, Russia;
ZSTD v.1.5.5 by Yann Collet aka Cyan, France;
BSC v.3.3.3 by Ilya Grebnov aka Gribok, Russia;
LZSSE by Conor Stokes, Australia;
Satanichi, Sanmayce's texttoy, Bulgaria;
BriefLZ v.1.3.0 by Joergen Ibsen, Denmark.
@IlyaGrebnov @jibsen
Compression command lines:
/bin/time -v ./brotli_1.1.0 -q 11 --large_window=30 "$1"
/bin/time -v ./rarlinux-x64-700b3 a -m5 -md2g "$1".rar "$1"
/bin/time -v ./7zzs a "$1".7z -mx9 -myx9 -m0=LZMA2:d1536m "$1"
/bin/time -v ./7zzs a -tbzip2 -mx=9 "$1.bz2" "$1"
/bin/time -v ./7zzs a -tgzip -mx=9 "$1.gz" "$1"
/bin/time -v ./BSC_3.3.3_AVX2_CLANG_17.0.4_dynamic.elf e "$1" "$1.bsc" -p -b2047 -m0 -e2
/bin/time -v ./zstd-v1.5.5 --ultra -22 --long=31 --zstd=wlog=31,clog=30,hlog=30,slog=26 "$1" -o "$1.zst"
/bin/time -v ./LZSSE_avx2_CLANG.elf -2 -l17 "$1" "$1.lzsse2"
/bin/time -v ./BriefLZ_1.3.0_CLANG_17.0.4_64bit.elf --optimal -b3g "$1" "$1.blz"
/bin/time -v ./zpaq715_sse4.1.elf add "$1.zpaq" "$1" -method 511 -threads 4
/bin/time -v ./"Satanichi_Nakamichi_Vanilla_LITE_DD-128AES_CLANG(17.0.6)_64bit.elf" "$1" "$1.Nakamichi" 20 111000 i
Decompression command lines:
perf stat -d ./brotli_1.1.0 -d -k "$1".br
perf stat -d ./rarlinux-x64-700b3 x "$1".rar
perf stat -d ./7zzs e "$1.7z"
perf stat -d ./7zzs e "$1.bz2"
perf stat -d ./7zzs e "$1.gz"
perf stat -d ./BSC_3.3.3_AVX2_CLANG_17.0.4_dynamic.elf d "$1.bsc" "$1"
perf stat -d ./zstd-v1.5.5 -f --priority=rt -d --long=31 "$1.zst"
perf stat -d ./LZSSE_avx2_CLANG.elf -d "$1.lzsse2" "$1"
perf stat -d ./BriefLZ_1.3.0_CLANG_17.0.4_64bit.elf -d -b3g "$1.blz" "$1"
perf stat -d ./zpaq715_sse4.1.elf x "$1.zpaq" -threads 4
perf stat -d ./"Satanichi_Nakamichi_Vanilla_LITE_DD-128AES_CLANG(17.0.6)_64bit.elf" $1.Nakamichi>$1.NKMCH
Testmachinette: Laptop 'Dzvertcheto' Thinkpad L490, Intel i7-8565U (8) @ 4.600GHz, 64GB, Linux Fedora 39 Testdatafile: sha1: 8326b48e3a315f4f656013629226c319fefd483e SUPRAPIG_Delphi_Classics_Complete_Works_of_128_authors.tar (1,576,788,480 bytes)
+-----------------------------+-----------------+-----------[ Sorted by Walltime ]-+------------------+---------------------+
| Compressor | Compressed size | Walltime / Usertime / Systemtime | Memory footprint | CPU utilization |
+---------------[ SuperFAST ]-+-----------------+----------------------------------+------------------+---------------------+
| BSC_3.3.3_AVX2_CLANG_17.0.4 | 304,827,632 | 1:10.1 / 514.9 / 6.6 | 7,710,336 KB | 743% |
+--------------------[ FAST ]-+-----------------+----------------------------------+------------------+---------------------+
| LZSSE_avx2_CLANG | 572,282,023 | 3:04.8 / 183.3 / 1.1 | 3,331,200 KB | 99% |
| rarlinux-x64-700b3 | 399,313,787 | 3:24.9 / 1388.2 / 3.3 | 7,658,240 KB | 678% |
+------------------[ Normal ]-+-----------------+----------------------------------+------------------+---------------------+
| 7zzs_23.01's bz2 | 414,301,737 | 8:03.3 / 3766.3 / 0.9 | 77,824 KB | 779% |
| 7zzs_23.01's gz | 544,531,970 | 20:09.7 / 1207.4 / 0.3 | 5,376 KB | 99% |
| 7zzs_23.01's 7z | 366,878,089 | 24:45.1 / 1810.0 / 7.2 | 15,963,904 KB | 122% |
| zstd-v1.5.5 | 374,058,071 | 31:29.5 / 1883.6 / 3.3 | 10,314,220 KB | 99% |
+--------------------[ SLOW ]-+-----------------+----------------------------------+------------------+---------------------+
| BriefLZ_1.3.0_CLANG_17.0.4 | 476,307,190 | 49:20.2 / 2945.3 / 10.0 | 32,803,328 KB | 99% |
| zpaq715_sse4.1 | 289,466,679 | 1:04:35 / 3860.0 / 9.0 | 19,023,136 KB | 99% |
| brotli_1.1.0 | 370,294,709 | 1:07:48 / 4057.3 / 4.02 | 9,974,512 KB | 99% |
+---------------[ UltraSLOW ]-+-----------------+----------------------------------+------------------+---------------------+
| Satanichi_CLANG_17.0.6 | 474,713,658 | 209,045 / 54,455 / 19,461 | 64+GB | 0.359 CPUs utilized |
+-----------------------------+-----------------+----------------------------------+------------------+---------------------+
Note01a: Nakamichi thrashes the virtual RAM (since it needs ~(61-(Source-Buffer + Target-Buffer = 2 + 3)-67)=-11 gigabytes more than 64GB), seen by the 6h systemtime. Note01b: Satanichi monstrously devours physical RAM, like 3TB, in order to flex its muscles. ! RAM needed to house B-trees (relative to the file being ripped): 44N = 66,224MB; RAM needed to build B-trees IN ONE PASS: (Target-Buffer = 2,503 MB) x 64 passes = 160,192MB ! So, drastically reduced time for compression if 230 GB are available. In case of all indexes fit in RAM, the encoding speed is 100 KB/s.
Testmachinette: Laptop 'Dzvertcheto' Thinkpad L490, Intel i7-8565U (8) @ 4.600GHz, 64GB, Linux Fedora 39 Testdatafile: sha1: 8326b48e3a315f4f656013629226c319fefd483e SUPRAPIG_Delphi_Classics_Complete_Works_of_128_authors.tar (1,576,788,480 bytes)
+-----------------------------+-----------------+-----------[ Sorted by Walltime ]-+-----------------------+--------------------+----------------------------------+
| Decompressor | Compressed size | Walltime / Usertime / Systemtime | CPU utilization | Instructions | LLC-loads / LLC-load-misses |
+---------------[ UltraFAST ]-+-----------------+----------------------------------+-----------------------+--------------------+----------------------------------+
| LZSSE_avx2_CLANG | 572,282,023 | 0.8 / 0.3 / 0.4 | 1.000 CPUs utilized | 5,276,911,595 | 563,939 / 151,203 |
| LZSSE_avx2_GCC | 572,282,023 | 0.8 / 0.3 / 0.4 | 0.999 CPUs utilized | 5,316,121,126 | 545,723 / 140,118 |
+---------------[ SuperFAST ]-+-----------------+----------------------------------+-----------------------+--------------------+----------------------------------+
| Satanichi_GCC_13.2.1 | 474,713,658 | 2.7 / 1.9 / 0.8 | 1.000 CPUs utilized | 4,211,049,650 | 177,744,185 / 57,211,272 |
| Satanichi_CLANG_17.0.6 | 474,713,658 | 2.7 / 1.9 / 0.8 | 1.000 CPUs utilized | 4,243,632,674 | 179,001,727 / 57,137,600 |
| zstd-v1.5.5 | 374,058,071 | 2.9 / 2.5 / 0.8 | 1.175 CPUs utilized | 19,913,312,819 | 49,593,392 / 6,507,280 |
+--------------------[ FAST ]-+-----------------+----------------------------------+-----------------------+--------------------+----------------------------------+
| brotli_1.1.0 | 370,294,709 | 6.1 / 5.2 / 0.8 | 1.000 CPUs utilized | 17,065,934,394 | 171,518,735 / 82,851,994 |
| rarlinux-x64-700b3 | 399,313,787 | 6.7 / 9.3 / 0.9 | 1.531 CPUs utilized | 37,354,158,966 | 166,230,623 / 85,640,325 |
| BriefLZ_1.3.0_CLANG_17.0.4 | 476,307,190 | 6.9 / 6.0 / 0.8 | 1.000 CPUs utilized | 27,125,792,763 | 88,295,646 / 31,016,221 |
| BriefLZ_1.3.0_GCC_13.2.1 | 476,307,190 | 8.1 / 7.2 / 0.8 | 1.000 CPUs utilized | 31,513,004,141 | 90,967,111 / 32,762,390 |
| 7zzs_23.01's gz | 544,531,970 | 8.8 / 8.4 / 0.3 | 1.000 CPUs utilized | 60,531,034,012 | 1,131,330 / 129,222 |
| 7zzs_23.01's 7z | 366,878,089 | 14.5 / 13.5 / 0.8 | 1.000 CPUs utilized | 76,506,480,464 | 143,437,881 / 68,732,482 |
| 7zzs_23.01's bz2 | 414,301,737 | 19.2 / 28.3 / 0.4 | 1.509 CPUs utilized | 132,876,974,414 | 1,340,889,710 / 11,315,495 |
| BSC_3.3.3_AVX2_CLANG_17.0.4 | 304,827,632 | 29.8 / 213.1 / 4.1 | 7.347 CPUs utilized | 604,969,912,535 | 2,348,629,362 / 1,233,981,644 |
+--------------------[ SLOW ]-+-----------------+----------------------------------+-----------------------+--------------------+----------------------------------+
| zpaq715_sse4.1 | 289,466,679 | 4031.6 / 4000.1 / 9.6 | 1.000 CPUs utilized | 24,939,199,778,486 | 136,354,757,447 / 28,877,270,011 |
+-----------------------------+-----------------+----------------------------------+-----------------------+--------------------+----------------------------------+
Note01: The Walltime includes LOAD-DECOMPRESS-DUMP times, that is, external-RAM -> internal-RAM -> external-RAM. Note02: The decompression is done on RamDisk of size 32GB, both the compressed and the decompressed files are on it. Note03: Comparison was made, each decompressed file was compared with the original. Note04a: The last column is quite informative, latencywise, the Last-Level-Cache misses value is indicative how much physical RAM (and cache hierarchy) stalls the CPU. Note04b: For instance, every 177,744,185 / 57,211,272 = 3.1rd attempt to load from Last-Level-Cache is denied, it says, that with bigger L3 (i7-8565U has 8 MB), Nakamichi's main bottleneck has less impact. Note05: Decompression times are the fastest of three runs, enforcing sleeping for 7 seconds in between in order to cool off. Note06: Another useful measure is DIPB which stands for Decompression-Instructions-Per-Byte, since Nakamichi is simplistic and uses no entropy stage it has the lowest 4,211,049,650/1,576,788,480=2.67 DIPB. Note07: The whole Read-Decompress-Write trio is done on RAM disk, created as follows:
sudo mkdir /tmp/ramdisk
sudo chmod 777 /tmp/ramdisk
sudo mount -t tmpfs -o size=32G myramdisk /tmp/ramdisk
#sudo umount /tmp/ramdisk/
Note08: Joergen's BriefLZ was compiled with these lines:
gcc -O3 -fomit-frame-pointer -fstrict-aliasing -o BriefLZ_1.3.0_GCC_13.2.1_64bit.elf -I../include blzpack.c parg.c ../src/brieflz.c ../src/depack.c ../src/depacks.c -D_N_HIGH_PRIORITY
clang -O3 -fomit-frame-pointer -fstrict-aliasing -o BriefLZ_1.3.0_CLANG_17.0.4_64bit.elf -I../include blzpack.c parg.c ../src/brieflz.c ../src/depack.c ../src/depacks.c -D_N_HIGH_PRIORITY
Bottomlines:
Obviously, WhiskeyLake rocks, being only 25W.
Oh, wanted to include the Fabrice Bellard's superthrasher NNCP... somenight.
2023-Dec-30, Kaze (sanmayce@sanmayce.com)
Hi Conor, many thanks for the undreamt performance of your LZSSE2, simply the FASTEST decompressor!
Not an issue but feedback, wish this site had [Feedback] section as well. My wish was to have LZSSE2 in form of console tool as well, so I attempted to do so, but not as it had to be, your etude - your tool, that's the right combination, however I wanted level 17 in my textual comparisons so I just embedded LZSSE2 into my fastest (old, 1MB sliding window) Nakamichi, the result:
LZSSE2 excels at:
Overall, significantly better everywhere, LZSSE2 is superior to Tengu, hands down. In my benchmarks with Hamid's TurboBench from (Feb 21), LZSSE2 level 16 decompresses 2x faster than Nakamichi 'Goldenboy'! However with Haswell and above I expect 3x, even 4x. For more tests (console dumps), you may see my compression logs/notes (far from finished) at: www.sanmayce.com/Downloads/The-Last-Stand_booklet.pdf
Also, in the www.sanmayce.com/Downloads/TEXTUAL_MADNESS.zip package I made one .bat file running 12 compressors for a given file, thus giving quick look where one is ranked:
Performers:
Level17 gives excellent tightness and incredible decompression speed (i5-2430M @3GHz, DDR3 @666MHz):
The SSE4.1 and AVX .cod files are included (Assembly, that is), do you see register utilization/distribution as you intended? In AVX code I see 4466 lines for LZSSE2_Decompress procedure, while the SSE4.1 amounts to 4819, how does this translate into speed, say, on Haswell? On i5-2430M @3GHz, DDR3 @666MHz I see no speed difference, at all:
And to mix Level 17 with the TurboBench' results:
Very strange, decompression speed differs a lot between Hamid's bench and mine, my trials are 64, with 'dickens' Intel 15.0 is 2x faster than GCC 5.3.0, or I am wrong?! Also, no clue, why with 'alice29' my bench gives the miserable 53 MB/s whereas TurboBench reports 1810.58MB/s?! That's why told you that my knowledge is inferior, I failed to offer reliable bench. Maybe, I will change clock() with:
I still don't understand, even partially, the decompression code, yet, at first glance the code generated by Intel 15.0 is tight and makes full use of registers, no?!
Cannot say keep up the fantastic work since you made good already.
Best, Sanmayce