Improve Token detection

neophob commented 6 years ago

token detection can be improved, here's a sample output of my project:

--- One extra occurrence needed for a gain --
   val=7, gain=0->1 (+1), N=4, str = 16
   val=7, gain=0->1 (+1), N=4, str = (g
   val=7, gain=0->1 (+1), N=4, str = +d
   val=7, gain=0->1 (+1), N=4, str = ]=
   val=7, gain=0->1 (+1), N=4, str = ,u
   val=7, gain=0->1 (+1), N=4, str = *s
   val=8, gain=-1->1 (+2), N=2, str = var
   val=8, gain=-1->1 (+2), N=2, str = ar 
   val=8, gain=-1->1 (+2), N=2, str = if(
   val=8, gain=-1->1 (+2), N=2, str = /51
   val=8, gain=-1->1 (+2), N=2, str = )%5
   val=8, gain=-1->1 (+2), N=2, str = 16,
   val=8, gain=-1->1 (+2), N=2, str = set
   val=8, gain=-1->1 (+2), N=2, str = 2e3
   val=8, gain=-1->1 (+2), N=2, str = *s,
   val=19, gain=0->3 (+3), N=2, str = var

this means:

var and var exist -> should be combined as var - maybe combine it even with ar
16 and 16, exist -> should be combined as 16
ditto for *s, and *s

Siorki commented 6 years ago

That output is produced by feature #48 introduced in v5.0. Those substrings are candidates patterns- meaning that they were considered to be replaced by 1-character tokens, but not selected since that would come at a loss, or just break even.

This section is intended as a help for the user to improve compression. For instance, if you have s* somewhere in your code, you could swap the operands to have an extra instance of *s which would then get compressed and gain 1 byte.

The line giving you this information is :
val=7, gain=0->1 (+1), N=4, str = *s In detail, it means that there are 4 instances of *s, and that replacing them with a token is useless (gain = 0). However if you could have an extra instance of the same substring, the compression would then bring a net gain of 1 byte.

As for merging 16 and 16,, your mileage may vary : you have 4 instances of the former and only 2 of the latter, so keeping the shortest one may or may not be the optimal solution. The gain upon packing depend on both string length and occurrence count, you can influence the algorithm by tweaking the score formula (default 1/0/0).

neophob commented 6 years ago

thanks @Siorki now that make sense!

about the formula - today i use a script which produce several regpack outputs with different settings. what about a brute force attempt to find the best compression? think thats feasible?

Siorki commented 6 years ago

Feasible yet difficult - that's exactly the point of issue #7

neophob commented 6 years ago

thanks!

Siorki / RegPack

Improve Token detection #80