Open Siorki opened 10 years ago
Just passing by to say: much hype for this feature! It's a little annoying to try jscrush / first crush, plus random values for gain, length and copies. I wonder if a "bruteforce" button, next to the "pack" button could help to find the best settings automatically :)
Experimentation not complete, postponing to v6
Hi
I did a very naive brute force loop in the runPacker
(see below - beware, code is just a Proof of concept and ugly), but it can be used as benchmark runner to find the best settings:
var currentData = inputList[inputIndex];
// first stage : configurable crusher
console.error('inputList.length',inputList.length,currentData);
var MAX_ITERATIONS = [1,4,4,4];
var bruteForceTable = [0,0,0,0];
var minimalSize = 99999;
var settings;
var start = Date.now();
while (bruteForceRunning) {
options.crushTiebreakerFactor = bruteForceTable[0];
options.crushGainFactor = bruteForceTable[1];
options.crushLengthFactor = bruteForceTable[2];
options.crushCopiesFactor = bruteForceTable[3];
var tmpData = currentData;
var tmpOutput = this.findRedundancies(tmpData, options);
tmpData.result.push(tmpOutput);
tmpOutput = this.packToRegexpCharClass(tmpData, options);
tmpData.result.push(tmpOutput);
tmpOutput = this.packToNegatedRegexpCharClass(tmpData)[0];
var size = tmpOutput;
if (size < minimalSize) {
minimalSize = size;
settings = bruteForceTable.slice(0);
console.error(bruteForceTable, 'minimalSize', minimalSize);
} else console.error(bruteForceTable, '!', size);
if (bruteForceTable[3]++ >= MAX_ITERATIONS[3]) {
bruteForceTable[3] = 0;
if (bruteForceTable[2]++ >= MAX_ITERATIONS[2]) {
bruteForceTable[2] = 0;
if (bruteForceTable[1]++ >= MAX_ITERATIONS[1]) {
bruteForceTable[1] = 0;
if (bruteForceTable[0]++ >= MAX_ITERATIONS[0]) {
bruteForceRunning = false;
}
}
}
}
}
console.error('duration:', (Date.now()-start)/1000);
console.error('minimalsize', minimalSize);
console.error('settings', settings);
options.crushTiebreakerFactor = settings[0];
options.crushGainFactor = settings[1];
options.crushLengthFactor = settings[2];
options.crushCopiesFactor = settings[3];
it takes quite a while to run all the settings - there are optimisations we could do (no logging, no details etc) but running that each time need more performance improvements.
What do you think, adding a "benchmark" step is usable?
Moved the first point (maximal patterns) to new issue #93.
Disjoint patterns - those that have no character in common between them in any of their occurrences - can be substituted in any order. Replacing pattern A
with a token, then pattern B
with another token, or starting with pattern B
then A
yield equivalent results : while the token assignment will differ, and so will the order in the dictionary string added to the compressed text, both results will be of the same length. Using vocabulary from group theory, we will say that patterns A
and B
commute.
Similarly, if pattern A
is fully included in pattern B
, they will also commute, and the replacement order will not matter.
However, if those patterns overlap by only a few characters, A
extending further on the left and B
on the right, then they will not usually commute : replacing pattern A
with a token, then pattern B
with another token, yield a different result than replacing B
then A
, which can be better (shorter), the same length, or worse (longer).
As an example, consider the following string to be packed :
bank*king-bank/king+bank-king,king*banking/banking;king
It has three maximal patterns (see definition above and in linked issue #93)
bank
: 5 copies, length 4king
: 7 copies, length 4banking
: 2 copies, length 7They overlap by the single character k
. It is obvious that if you start by replacing bank
with a token such as 0
, some occurrences of king
will also be removed at the same time. The resulting string features only 5 of them, instead of the initial 7 :
0*king-0/king+0-king,king*0ing/0ing;king
So after this step, we have the following options for the next pattern :
king
: 5 copies, length 4ing
: 7 copies, length 30ing
: 2 copies, length 4Basically, the other pattern can be selected either fully, but with less occurrences, or have the overlapping part removed, with as many occurrences as before. Both options yield a lower gain than having 7 occurrences of king
.
Had we started by replacing king
first, we would similarly end up choosing between bank
, ban
and ban0
.
In this case, opting for king
first, then ban
, gives a shorter output than starting by bank
then ing
. This can easily be inferred from the fact that king
features two more copies than bank
in the initial string, so 2 more overlapping k
will be removed while replacing by the token by choosing it first. But the choice is not always that simple.
Current crusher phase uses code originating from JSCrush / First Crush. The candidates strings for each token are ranked according to string length and copies count. Default settings are not always optimal and sometimes require hand tuning to obtain the best results.
Improvement : replace the crusher algorithm, the new one shall feature