atomicobject / heatshrink

data compression library for embedded/real-time systems
ISC License
1.33k stars 180 forks source link

Reducing break even point may lower compression ratio #32

Open unixdj opened 9 years ago

unixdj commented 9 years ago

Reducing the break even point, as done in f533992, should theoretically improve compression ratio. However, in some cases emitting a literal or two followed by a backref would make the encoding overall shorter than two shorter backrefs that the (naturally greedy) encoder finds. E.g., when compressing kennedy.xls from the Canterbury Corpus with -w 8 -l 7 with different break even points, one can observe many such occurences (- is break even point 3, + is 2):

 ## step_search, scan @ +15 (271/512), input size 256
 ss Match not found
 ## step_search, scan @ +16 (272/512), input size 256
-ss Match not found
-## step_search, scan @ +17 (273/512), input size 256
-ss Found match of 6 bytes at 8
+ss Found match of 2 bytes at 1
+## step_search, scan @ +18 (274/512), input size 256
+ss Found match of 5 bytes at 8
 ## step_search, scan @ +23 (279/512), input size 256
 ss Match not found
 ## step_search, scan @ +24 (280/512), input size 256

It would be nice to try to find a way to optimize this within the contraints of an embedded environment. If possible, without backtracking.

unixdj commented 9 years ago

A quick bechmark shows that the following moification of step_search() seems to improve compression ratio:

match = find_match(here);
if (match.found && find_match(here + 1).len <= match.len) {
    emit_backref();
else
    emit_literal();

(here + 1 seems to work better than here + break_even_point / 9 for some reason.)

Time to code it without calling find_match() twice as much. It's also completely meaningless when backref encoding is not longer than literal encoding.

unixdj commented 9 years ago

I haven't had much time to play with it lately, but here's an implementation of the above: https://github.com/atomicobject/heatshrink/compare/master...unixdj:look-forward?expand=1 It's still really slow. Really really slow. But compression ratio is higher.