Siorki / RegPack

Self-contained packer for size-constrained JS code
Other
299 stars 14 forks source link

Improve Token detection #80

Closed neophob closed 6 years ago

neophob commented 6 years ago

token detection can be improved, here's a sample output of my project:

--- One extra occurrence needed for a gain --
   val=7, gain=0->1 (+1), N=4, str = 16
   val=7, gain=0->1 (+1), N=4, str = (g
   val=7, gain=0->1 (+1), N=4, str = +d
   val=7, gain=0->1 (+1), N=4, str = ]=
   val=7, gain=0->1 (+1), N=4, str = ,u
   val=7, gain=0->1 (+1), N=4, str = *s
   val=8, gain=-1->1 (+2), N=2, str = var
   val=8, gain=-1->1 (+2), N=2, str = ar 
   val=8, gain=-1->1 (+2), N=2, str = if(
   val=8, gain=-1->1 (+2), N=2, str = /51
   val=8, gain=-1->1 (+2), N=2, str = )%5
   val=8, gain=-1->1 (+2), N=2, str = 16,
   val=8, gain=-1->1 (+2), N=2, str = set
   val=8, gain=-1->1 (+2), N=2, str = 2e3
   val=8, gain=-1->1 (+2), N=2, str = *s,
   val=19, gain=0->3 (+3), N=2, str = var 

this means:

Siorki commented 6 years ago

That output is produced by feature #48 introduced in v5.0. Those substrings are candidates patterns- meaning that they were considered to be replaced by 1-character tokens, but not selected since that would come at a loss, or just break even.

This section is intended as a help for the user to improve compression. For instance, if you have s* somewhere in your code, you could swap the operands to have an extra instance of *s which would then get compressed and gain 1 byte.

The line giving you this information is :
val=7, gain=0->1 (+1), N=4, str = *s In detail, it means that there are 4 instances of *s, and that replacing them with a token is useless (gain = 0). However if you could have an extra instance of the same substring, the compression would then bring a net gain of 1 byte.

As for merging 16 and 16,, your mileage may vary : you have 4 instances of the former and only 2 of the latter, so keeping the shortest one may or may not be the optimal solution. The gain upon packing depend on both string length and occurrence count, you can influence the algorithm by tweaking the score formula (default 1/0/0).

neophob commented 6 years ago

thanks @Siorki now that make sense!

about the formula - today i use a script which produce several regpack outputs with different settings. what about a brute force attempt to find the best compression? think thats feasible?

Siorki commented 6 years ago

Feasible yet difficult - that's exactly the point of issue #7

neophob commented 6 years ago

thanks!