gain - Githubissues

da0ka commented 5 years ago

regPack.js line 210 & 249 & 266: Z=Rj-R-j-2 line 267: Z1=(R+1)j-(R+1)-j-2

I think that -2 may be -1. because sometimes better result. It's should be changed by options.

Siorki commented 5 years ago

Thanks for taking time to read an review the code. This part is legacy minified code, hence the nonexplicit variable names :

Z is the gain expected by replacing the pattern by a token
R is the number of occurrences of the pattern in the whole string
j is the byte length of the pattern

Thus the formula to compute the gain when replacing the pattern with a token :

R*j-R represents the bytes gained by replacing the pattern by a one-byte token
-j is the cost of listing the pattern in the dictionary
-1 to add the token in the dictionary as a match to the pattern
-1 again to list the token in the unpacking routine

Testing for Z>0 means that the replacement will only be performed if it gains (shortens the input by) at least one byte. Using -1 instead of -2 means that a replacement would be performed with a net gain of zero. I do not see this improving the result on a theoretical basis; do you have any example where using -1 improves the result ?

da0ka commented 5 years ago

Peter piper picked a peck of pickled peppers. A peck of pickled peppers Peter Piper picked. If Peter Piper picked a peck of pickled peppers, Where's the peck of pickled peppers Peter Piper picked

above text is RegPack'ed(using -1): 195B to 148B RegPack'ed(using -2): 195B to 149B

Siorki commented 5 years ago

All right, you found a very interesting edge case. Let's dive into the clockwork to explain what is happening here.

RegPack.findRedundancies() Because of the value -1 instead of -2, the crusher will perform token substitution even with a net gain of 0. This happens here with the string "er" that is replaced by y. Since this is done at break even value, after the crusher phase both -1 and -2 versions have the same length - 140 bytes.

RegPack.packToRegexpCharClass() Now enters the packer. It reuses the matches from the previous phase, builds a dependency graph, reorder matches according to the graph, identifies consecutive tokens from the ASCII space, then starts assigning token to matches. At each step the gain for each string is recomputed, and tested against safeguard that discards any negative gain :

var gain = count*(packerData.matchesLookup[j].len-tokenCost)-packerData.matchesLookup[j].len-2*tokenCost;
var score = options.crushGainFactor*gain+options.crushLengthFactor*packerData.matchesLookup[j].len+options.crushCopiesFactor*count;
if (gain>=0) {

Notice the -2 in the expression (multiplied by tokenCost, usually 1 except for \). However, since the test is for _positive or null, our string "er" is kept even with a gain of zero. In one case we end up with a char class of [1-5] and 149b, in the other [1-6] and 148b. How is that possible, since the gain for the last string was zero, and everything else is perfectly similar ?

The reason is that the computed gain itself is off by one in the case of the packer. The -2 represents two copies of the token, one in the dictionary, one in the unpacking routine, which is always correct for the crusher. However it fails to capture the complexity of the regular expression for the packer : 5 or 6 tokens are included in a range costing 3 bytes. And it will cost 3 bytes no matter the number of consecutive characters it contains.

Replacing -2 with -1 in the packer is not a perfect solution either, I had an example from the benchmarks (need to retrieve it, coming later) where the last match, added thanks to the -1, needs a token from a new range, effectively growing the regular expression by one character.

Suggested solution : count the first three characters in a range as one byte each (thus -2 for them), then the subsequent ones in the same range are free (-1). What do you think of that ?

Siorki / RegPack

gain #91