ckolivas / lrzip

Long Range Zip
http://lrzip.kolivas.org
GNU General Public License v2.0
619 stars 76 forks source link

A new way to compare compression results #174

Closed pete4abw closed 3 years ago

pete4abw commented 4 years ago

I, like so many, tend to obsess over how well lrzip compresses and am always looking for ways to compare. And, there are two major benchmarks for comparison.

  1. The obvious, compression ratio.
  2. Time to compress.

But how do you get an overall picture of the benefit of one compression method over another? How to you assess whether additional compression is worth additional time?

So, I decided on a (perhaps) mathematically shaky method of creating a compression index and time index for different compression methods and combining the two.

The long form of this analysis is here. [Edited to point to new version in main branch]

But here is a snippet which explains the methodology.

...an attempt is made to create an overall index and Rank for each method. For this the Compression Index and Time Index are ADDED and then divided by 2 to make the Overall Index scale to 100%. The lower the number the better!

The compression index is computed by comparing the size of a compressed file to the maximum (worst) size of all methods. MYSIZE/MAX(ALLSIZES) and the time to compress compared to the maximum (slowest) time to compress MYTIME/MAX(ALLTIMES). The worst compression ratio will have an index of 100%. The slowest time to compress will have an index of 100%. All other compression and time indeces will be relative to the best compression and slowest time.

Example

Compression size: 100
Worst Compression size: 120
Compression Index: 100/120 = 83.33% (percent relative to largest compressed size)

Time to Compress: 60 seconds
Slowest Time to Compress: 320 seconds
Time Index: 60/320 = 18.75% (percent relative to the slowest compression time)

Combine index: (83.33+18.75)/2 = 51.04 This number can be compared to all others in the set.

Highlights of 11 different compression methods

(top 3 here) The differences were small between the top three in compression but the associated times differed by more than double!

Size Name Time
124M LRZIP_ZPAQ 4:22
138M ZPAQ_M4 5:00
145M LRZIP 2:13

If we Index these results, comparing one to each other:

Size Name Time Comp Index Time Index Overall Index Rank
145,853,453 LRZIP 02:13.294 100.00% 44.30% 72.15% 1
123,584,447 LRZIP_ZPAQ 04:22.010 84.73% 87.07% 85.90% 2
137,916,765 ZPAQ_M4 05:00.921 94.56% 100.00% 97.28% 3

Blending time and compression, LRZIP using LZMA comes out on top with an overall index score of 72%, vs 86% and 97% for the ZPAQ variances. Even though it had the worst compression ratio of the three, it had the best time by far, hence the better overall score.

How do you use this?

There is no best way. Obviously for smaller files, the important benchmark is time. For larger files, the important criteria is compression. Text files will always compress faster than binary. With storage costs decreasing, SDRAM becoming faster and faster, processing power ever-increasing, individual needs and requirements will vary. Hopefully, this may help.

pete4abw commented 4 years ago

LRZIP vs. LRZIP in Levels

Size Name Time Comp Index Time Index Overall Index Rank
183,084,179 LRZIP_L3 00:24.920 86.99% 15.13% 51.06% 1
178,224,370 LRZIP_L4 00:29.610 84.68% 17.97% 51.33% 2
192,488,812 LRZIP_L2 00:21.330 91.46% 12.95% 52.20% 3
210,463,000 LRZIP_L1 00:19.620 100.00% 11.91% 55.95% 4
150,801,305 LRZIP_L5 01:54.950 71.65% 69.77% 70.71% 5
149,548,952 LRZIP_L6 01:58.610 71.06% 71.99% 71.53% 6
145,853,453 LRZIP_L7 02:14.720 69.30% 81.77% 75.54% 7
145,288,873 LRZIP_L8 02:40.810 69.03% 97.61% 83.32% 8
145,105,437 LRZIP_L9 02:44.750 68.95% 100.00% 84.47% 9

Here it's clear that lrzip results can be split into two sections. Levels 1-4 had very fast times, between 19 and 29 seconds. Levels 5-9 had slower times between 1:54 and 2:44. In the first group, even though the time index was between 11.9% and 17.9%, the compression index was between 84.7% and 100.0%. So, with level 4, you get a 15 point improvement in compression with only a 6 point drop in time. A good trade.

In the second group, the time index varies by 30 points, 70-100, yet the compression index only varies slightly, between 72 and 69. Here, you only get a 3 point improvement in compression between levels 5 and 9, but with a time penalty of 30 points! Between levels 6 and 7 there is a 1.5 point improvement in compression, but a 10 point drop in time. This is why level 6 has a higher ranking than 7.

This tells me that if speed is important, choose level 4. If best compression for time is important, choose levels 6 or 7.

Here's a little batch program that can run. If running as root, uncomment the drop_caches line. This will flush memory caches which, along with sync, will give a truer speed comparison.

#!/bin/sh
# lrzip speed test
# if running as root, uncomment drop_caches line
usage() {
echo "LRZIP Speed Test"
echo "usage: $0 filename"
exit 1
}

[ -z $1 ] && usage

for i in 1 2 3 4 5 6 7 8 9
do
        sync
        sleep 1
#       echo 3 >/proc/sys/vm/drop_caches
#       sleep 1
        lrzip -L$i -S.$i.lrz $1
        [ $? -ne 0 ] && break
done

exit 0
pete4abw commented 3 years ago

Just closing