alef78 / lzoma

experimental data compression algoritm
GNU General Public License v2.0
21 stars 1 forks source link

Mismatched in vs out lengths #5

Closed Sanmayce closed 8 years ago

Sanmayce commented 8 years ago

Hi alef, wanted to quickly compare your "binary" tight compressor versus my "textual" semi-tight one, so some feedback:

C:\Program Files\mingw-w64\x86_64-5.1.0-posix-seh-rt_v4-rev0>echo off
Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\>d:

D:\>cd D:\LZOMA__\lzoma-master

D:\LZOMA__\lzoma-master>dir
 Volume in drive D is S640_Vol5
 Volume Serial Number is 5861-9E6C

 Directory of D:\LZOMA__\lzoma-master

01/15/2016  06:35 AM    <DIR>          .
01/15/2016  06:35 AM    <DIR>          ..
01/15/2016  06:35 AM               246 .gitignore
01/15/2016  06:35 AM    <DIR>          ari
01/15/2016  06:35 AM             2,309 bpe.h
01/15/2016  06:35 AM    <DIR>          bytes
01/15/2016  06:35 AM            47,238 divsufsort.c
01/15/2016  06:35 AM             1,766 divsufsort.h
01/15/2016  06:35 AM               880 e8.h
01/15/2016  06:35 AM            18,047 LICENSE
01/15/2016  06:35 AM               311 lzoma.h
01/15/2016  06:35 AM               364 Makefile
01/15/2016  06:35 AM            26,351 pack.c
01/15/2016  06:35 AM             2,214 readme.txt
01/15/2016  06:35 AM             3,487 unpack.c
01/15/2016  06:35 AM             3,064 unpack_lzoma.S
01/15/2016  06:35 AM    <DIR>          x86
              12 File(s)        106,277 bytes
               5 Dir(s)  85,857,648,640 bytes free

D:\LZOMA__\lzoma-master>gcc -O2 -pipe pack.c divsufsort.c -o pack
pack.c: In function 'main':
pack.c:937:3: warning: implicit declaration of function 'close' [-Wimplicit-function-declaration]
   close(ifd);
   ^

D:\LZOMA__\lzoma-master>gcc -Os -fomit-frame-pointer -std=c99 -Os -pipe unpack.c -o unpack
unpack.c:12:0: warning: "O_BINARY" redefined
 #define O_BINARY 0
 ^
In file included from unpack.c:6:0:
C:/Program Files/mingw-w64/x86_64-5.1.0-posix-seh-rt_v4-rev0/mingw64/x86_64-w64-mingw32/include/fcntl.h:44:0: note: this is the location of the previous definition
 #define O_BINARY _O_BINARY
 ^

D:\LZOMA__\lzoma-master>dir *.exe
 Volume in drive D is S640_Vol5
 Volume Serial Number is 5861-9E6C

 Directory of D:\LZOMA__\lzoma-master

01/15/2016  04:57 PM           102,475 pack.exe
01/15/2016  04:57 PM            51,050 unpack.exe
               2 File(s)        153,525 bytes
               0 Dir(s)  85,857,488,896 bytes free

D:\LZOMA__\lzoma-master>cd..

D:\LZOMA__>dir
 Volume in drive D is S640_Vol5
 Volume Serial Number is 5861-9E6C

 Directory of D:\LZOMA__

01/15/2016  05:00 PM    <DIR>          .
01/15/2016  05:00 PM    <DIR>          ..
01/15/2016  04:48 PM        13,713,275 Complete_Works_of_Fyodor_Dostoyevsky.txt
10/12/2015  09:00 AM         4,544,039 Complete_Works_of_Fyodor_Dostoyevsky.txt.Nakamichi
01/15/2016  04:48 PM        10,192,446 dickens
10/11/2015  04:14 PM         3,722,075 dickens.Nakamichi
01/15/2016  04:48 PM           806,312 Dorogha_nikuda_-_Alieksandr_Grin_(Russian).txt
01/15/2016  04:35 PM           236,153 Dorogha_nikuda_-_Alieksandr_Grin_(Russian).txt.Nakamichi
01/15/2016  05:00 PM       100,000,000 enwik8
10/21/2015  11:47 PM        33,445,192 enwik8.Nakamichi
01/15/2016  04:48 PM           570,901 Gulyakovskiyi_E._Dolgiyi_Voshod_Na_Yenne.html
01/15/2016  04:30 PM           281,248 Gulyakovskiyi_E._Dolgiyi_Voshod_Na_Yenne.html.Nakamichi
01/15/2016  04:48 PM         5,245,293 Ian_Fleming_-_The_James_Bond_Anthology_(complete_collection).epub.txt
10/11/2015  06:02 PM         1,929,859 Ian_Fleming_-_The_James_Bond_Anthology_(complete_collection).epub.txt.Nakamichi
01/15/2016  04:48 PM           721,645 Legends_of_the_Fire_Spirits_[Robert_Lebling].txt
01/15/2016  04:39 PM           331,458 Legends_of_the_Fire_Spirits_[Robert_Lebling].txt.Nakamichi
01/15/2016  04:57 PM    <DIR>          lzoma-master
01/15/2016  04:55 PM            51,370 lzoma-master_2016-Jan-15_16h55m.zip
01/01/2002  04:41 AM           102,912 Nakamichi_Kintaro_Intel_15.0_32bit.exe
01/15/2016  05:00 PM        12,030,464 New_York_Times_Bestsellers_-_August_2015_-_20_ebooks.tar
10/14/2015  05:18 PM         4,328,336 New_York_Times_Bestsellers_-_August_2015_-_20_ebooks.tar.Nakamichi
01/15/2016  05:00 PM        14,613,183 The_Book_of_The_Thousand_Nights_and_a_Night.txt
10/12/2015  03:26 AM         5,228,912 The_Book_of_The_Thousand_Nights_and_a_Night.txt.Nakamichi
01/15/2016  04:49 PM           698,072 The_Death_Ship_-_B.Traven.pdf.txt
01/15/2016  04:43 PM           294,608 The_Death_Ship_-_B.Traven.pdf.txt.Nakamichi
01/15/2016  04:49 PM            92,096 The_Little_Prince_-_Antoine_de_Saint-Exupery.epub.txt
10/11/2015  02:12 PM            43,944 The_Little_Prince_-_Antoine_de_Saint-Exupery.epub.txt.Nakamichi
01/15/2016  04:49 PM         7,137,280 The_Project_Gutenberg_12_Fairy_Books_by_Andrew_Lang.tar
10/11/2015  05:16 PM         2,438,374 The_Project_Gutenberg_12_Fairy_Books_by_Andrew_Lang.tar.Nakamichi
01/15/2016  04:49 PM         4,445,260 The_Project_Gutenberg_EBook_of_The_King_James_Bible_kjv10.txt
10/11/2015  11:15 AM         1,420,630 The_Project_Gutenberg_EBook_of_The_King_James_Bible_kjv10.txt.Nakamichi
01/15/2016  04:50 PM         2,091,543 The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt
01/15/2016  04:14 PM           751,822 The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt.Nakamichi
01/15/2016  04:50 PM         3,265,536 University_of_Canterbury_The_Calgary_Corpus.tar
10/11/2015  02:10 PM         1,307,498 University_of_Canterbury_The_Calgary_Corpus.tar.Nakamichi
              32 File(s)    236,081,736 bytes
               3 Dir(s)  85,621,383,168 bytes free

D:\LZOMA__>lzoma-master\pack.exe -9 The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt.lzoma
got 2091543 bytes, packing The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt into The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt.lzoma...
stats noe8 2064343 e8 2064343
reverted e8
init done.
4095 left
res=4944198
res bytes=618025
out bytes=618024
closing files let=21327 lz=205920 olz=5750
bits lzlit=232997 let=170616 olz=18192 match=3487236 len=1035120

D:\LZOMA__>lzoma-master\unpack.exe The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt.lzoma The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt.unpack

D:\LZOMA__>dir the_se*
 Volume in drive D is S640_Vol5
 Volume Serial Number is 5861-9E6C

 Directory of D:\LZOMA__

01/15/2016  04:50 PM         2,091,543 The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt
01/15/2016  05:04 PM           618,033 The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt.lzoma
01/15/2016  04:14 PM           751,822 The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt.Nakamichi
01/15/2016  05:05 PM         2,092,720 The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt.unpack
               4 File(s)      5,554,118 bytes
               0 Dir(s)  85,618,671,616 bytes free

D:\LZOMA__>lzoma-master\pack.exe -9 University_of_Canterbury_The_Calgary_Corpus.tar University_of_Canterbury_The_Calgary_Corpus.tar.lzoma
got 3265536 bytes, packing University_of_Canterbury_The_Calgary_Corpus.tar into University_of_Canterbury_The_Calgary_Corpus.tar.lzoma...
stats noe8 3080968 e8 3078993
reverted e8
init done.
4095 left
res=7683585
res bytes=960449
out bytes=960321
closing files let=116512 lz=294056 olz=45897
bits lzlit=456465 let=932096 olz=85456 match=4751958 len=1456575

D:\LZOMA__>lzoma-master\unpack.exe University_of_Canterbury_The_Calgary_Corpus.tar.lzoma University_of_Canterbury_The_Calgary_Corpus.tar.unpack

D:\LZOMA__>dir uni*
 Volume in drive D is S640_Vol5
 Volume Serial Number is 5861-9E6C

 Directory of D:\LZOMA__

01/15/2016  05:07 PM         3,265,536 University_of_Canterbury_The_Calgary_Corpus.tar
01/15/2016  05:10 PM           960,330 University_of_Canterbury_The_Calgary_Corpus.tar.lzoma
10/11/2015  02:10 PM         1,307,498 University_of_Canterbury_The_Calgary_Corpus.tar.Nakamichi
01/15/2016  05:13 PM         3,329,597 University_of_Canterbury_The_Calgary_Corpus.tar.unpack
               4 File(s)      8,862,961 bytes
               0 Dir(s)  85,614,379,008 bytes free

D:\LZOMA__>

Is the problem in pack or unpack? Hope, you are gonna fix it since your approach is so tight and promising.

Tried two Russian texts as well:

01/15/2016  04:48 PM        13,713,275 Complete_Works_of_Fyodor_Dostoyevsky.txt
01/15/2016  06:03 PM         3,928,262 Complete_Works_of_Fyodor_Dostoyevsky.txt.lzoma
10/12/2015  09:00 AM         4,544,039 Complete_Works_of_Fyodor_Dostoyevsky.txt.Nakamichi

01/15/2016  04:48 PM           806,312 Dorogha_nikuda_-_Alieksandr_Grin_(Russian).txt
01/15/2016  05:23 PM           202,650 Dorogha_nikuda_-_Alieksandr_Grin_(Russian).txt.lzoma
01/15/2016  04:35 PM           236,153 Dorogha_nikuda_-_Alieksandr_Grin_(Russian).txt.Nakamichi

01/15/2016  04:48 PM           570,901 Gulyakovskiyi_E._Dolgiyi_Voshod_Na_Yenne.html
01/15/2016  05:24 PM           214,803 Gulyakovskiyi_E._Dolgiyi_Voshod_Na_Yenne.html.lzoma
01/15/2016  04:30 PM           281,248 Gulyakovskiyi_E._Dolgiyi_Voshod_Na_Yenne.html.Nakamichi

01/15/2016  04:48 PM           721,645 Legends_of_the_Fire_Spirits_[Robert_Lebling].txt
01/15/2016  06:06 PM           256,580 Legends_of_the_Fire_Spirits_[Robert_Lebling].txt.lzoma
01/15/2016  04:39 PM           331,458 Legends_of_the_Fire_Spirits_[Robert_Lebling].txt.Nakamichi

Oh, and could you give (in comments) the best options/values for textual data.

alef78 commented 8 years ago

Thanks for reporting that. I only tested under linux, have not tried on Windows yet. It seems that unpacked files were written in text mode instead of binary mode. Probably reason is #define O_BINARY 0, I think that should be fixed in my latest commit (but not tested yet).

As for compression options, current settings for level -9 is practical maximum. It is possible to compress better by a few bytes, by increasing values in pack.c source, but compression speed can quickly become extremely slow and compression ratio is practically the same.

Sanmayce commented 8 years ago

It seems you fixed it already:


D:\LZOMA___\lzoma-master>gcc -O2 -pipe pack.c divsufsort.c -o pack
pack.c: In function 'main':
pack.c:937:3: warning: implicit declaration of function 'close' [-Wimplicit-function-declaration]
   close(ifd);
   ^

D:\LZOMA___\lzoma-master>gcc -Os -fomit-frame-pointer -std=c99 -Os -pipe unpack.c -o unpack

D:\LZOMA___\lzoma-master>dir *.exe
 Volume in drive D is S640_Vol5
 Volume Serial Number is 5861-9E6C

 Directory of D:\LZOMA___\lzoma-master

01/15/2016  07:36 PM           102,475 pack.exe
01/15/2016  07:36 PM            51,050 unpack.exe
               2 File(s)        153,525 bytes
               0 Dir(s)  85,609,353,216 bytes free

D:\LZOMA___\lzoma-master>cd..

D:\LZOMA___>dir
 Volume in drive D is S640_Vol5
 Volume Serial Number is 5861-9E6C

 Directory of D:\LZOMA___

01/15/2016  07:36 PM    <DIR>          .
01/15/2016  07:36 PM    <DIR>          ..
01/15/2016  07:36 PM    <DIR>          lzoma-master
01/15/2016  07:34 PM            51,390 lzoma-master_2016-Jan-15_19h35m.zip
01/15/2016  04:50 PM         2,091,543 The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt
01/15/2016  05:07 PM         3,265,536 University_of_Canterbury_The_Calgary_Corpus.tar
               3 File(s)      5,408,469 bytes
               3 Dir(s)  85,603,987,456 bytes free

D:\LZOMA___>lzoma-master\pack.exe -9 The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt.lzoma
got 2091543 bytes, packing The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt into The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt.lzoma...
stats noe8 2064343 e8 2064343
reverted e8
init done.
4095 left
res=4944198
res bytes=618025
out bytes=618024
closing files let=21327 lz=205920 olz=5750
bits lzlit=232997 let=170616 olz=18192 match=3487236 len=1035120

D:\LZOMA___>lzoma-master\unpack.exe The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt.lzoma The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt.unpa
ck

D:\LZOMA___>dir the*
 Volume in drive D is S640_Vol5
 Volume Serial Number is 5861-9E6C

 Directory of D:\LZOMA___

01/15/2016  04:50 PM         2,091,543 The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt
01/15/2016  07:40 PM           618,033 The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt.lzoma
01/15/2016  07:40 PM         2,091,543 The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt.unpack
               3 File(s)      4,801,119 bytes
               0 Dir(s)  85,601,275,904 bytes free

D:\LZOMA___>fc The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt.unpack /b
Comparing files The_Secret_Teachings_of_all_Ages_-_Manly_Palmer_Hall.epub.txt and THE_SECRET_TEACHINGS_OF_ALL_AGES_-_MANLY_PALMER_HALL.EPUB.TXT.UNPACK
FC: no differences encountered

D:\LZOMA___>

Thanks for reporting that.

Oh, LZOMA kicked my ass, it is so cool. Hope you refine it to the point it becomes a paragonic performer.

It is possible to compress better by a few bytes, by increasing values in pack.c source, but compression speed can quickly become extremely slow and compression ratio is practically the same.

Okay, I thought that changing

{3,100,1000}

with some higher values would tighten the ratio. Also, can you say what is the size of your "window", and have you thought of making it 28bit (256MB) as Nakamichi's one.

alef78 commented 8 years ago

Window size is currently set to 16MB. It is possible to make that larger (needs to tune some variables in lzoma.h, so far I tried 64MB). Compressor memory usage is currently too large, 33*window size. I have some ideas how to reduce compressor memory usage, probably will implement option to set window size after that.