Replace the crusher phase - explore combinations

Siorki commented 10 years ago

Current crusher phase uses code originating from JSCrush / First Crush. The candidates strings for each token are ranked according to string length and copies count. Default settings are not always optimal and sometimes require hand tuning to obtain the best results.

Improvement : replace the crusher algorithm, the new one shall feature

selection of the best substrings only, not all redundant strings. "Best" means that any string contained within another one, with the same redundancy count is discarded. To achieve this, always try to extend the string on each side until the copies count goes down.
identification of strings with overlap. On one side only, having strings entirely contained within another does not count (changing the order in which they are compressed does not impact the result). Create a graph of overlaps, then process each connected subset separately.
within a set, try all possible orders, unless the set is too big. Optimize the order that way.

xem commented 8 years ago

Just passing by to say: much hype for this feature! It's a little annoying to try jscrush / first crush, plus random values for gain, length and copies. I wonder if a "bruteforce" button, next to the "pack" button could help to find the best settings automatically :)

Siorki commented 7 years ago

Experimentation not complete, postponing to v6

neophob commented 6 years ago

Hi

I did a very naive brute force loop in the runPacker (see below - beware, code is just a Proof of concept and ugly), but it can be used as benchmark runner to find the best settings:

    var currentData = inputList[inputIndex];

     // first stage : configurable crusher

      console.error('inputList.length',inputList.length,currentData);
      var MAX_ITERATIONS = [1,4,4,4];
      var bruteForceTable = [0,0,0,0];
      var minimalSize = 99999;
      var settings;
      var start = Date.now();
      while (bruteForceRunning) {
        options.crushTiebreakerFactor = bruteForceTable[0];
        options.crushGainFactor = bruteForceTable[1];
        options.crushLengthFactor = bruteForceTable[2];
        options.crushCopiesFactor = bruteForceTable[3];
        var tmpData = currentData;

        var tmpOutput = this.findRedundancies(tmpData, options);
        tmpData.result.push(tmpOutput);

        tmpOutput = this.packToRegexpCharClass(tmpData, options);
        tmpData.result.push(tmpOutput);

        tmpOutput = this.packToNegatedRegexpCharClass(tmpData)[0];

        var size = tmpOutput;
        if (size < minimalSize) {
          minimalSize = size;
          settings = bruteForceTable.slice(0);
          console.error(bruteForceTable, 'minimalSize', minimalSize);
        } else console.error(bruteForceTable, '!', size);
        if (bruteForceTable[3]++ >= MAX_ITERATIONS[3]) {
          bruteForceTable[3] = 0;
          if (bruteForceTable[2]++ >= MAX_ITERATIONS[2]) {
            bruteForceTable[2] = 0;
            if (bruteForceTable[1]++ >= MAX_ITERATIONS[1]) {
              bruteForceTable[1] = 0;
              if (bruteForceTable[0]++ >= MAX_ITERATIONS[0]) {
                bruteForceRunning = false;
              }
            }
          }
        }
      }
      console.error('duration:', (Date.now()-start)/1000);
      console.error('minimalsize', minimalSize);
      console.error('settings', settings);
      options.crushTiebreakerFactor = settings[0];
      options.crushGainFactor = settings[1];
      options.crushLengthFactor = settings[2];
      options.crushCopiesFactor = settings[3];

it takes quite a while to run all the settings - there are optimisations we could do (no logging, no details etc) but running that each time need more performance improvements.

What do you think, adding a "benchmark" step is usable?

Siorki commented 5 years ago

Moved the first point (maximal patterns) to new issue #93.

Siorki commented 2 years ago

Disjoint patterns - those that have no character in common between them in any of their occurrences - can be substituted in any order. Replacing pattern A with a token, then pattern B with another token, or starting with pattern B then A yield equivalent results : while the token assignment will differ, and so will the order in the dictionary string added to the compressed text, both results will be of the same length. Using vocabulary from group theory, we will say that patterns A and B commute.

Similarly, if pattern A is fully included in pattern B, they will also commute, and the replacement order will not matter.

However, if those patterns overlap by only a few characters, A extending further on the left and Bon the right, then they will not usually commute : replacing pattern A with a token, then pattern B with another token, yield a different result than replacing B then A, which can be better (shorter), the same length, or worse (longer).

As an example, consider the following string to be packed :

bank*king-bank/king+bank-king,king*banking/banking;king

It has three maximal patterns (see definition above and in linked issue #93)

bank : 5 copies, length 4
king: 7 copies, length 4
banking: 2 copies, length 7

They overlap by the single character k. It is obvious that if you start by replacing bank with a token such as 0, some occurrences of king will also be removed at the same time. The resulting string features only 5 of them, instead of the initial 7 :

0*king-0/king+0-king,king*0ing/0ing;king

So after this step, we have the following options for the next pattern :

king: 5 copies, length 4
ing: 7 copies, length 3
0ing: 2 copies, length 4

Basically, the other pattern can be selected either fully, but with less occurrences, or have the overlapping part removed, with as many occurrences as before. Both options yield a lower gain than having 7 occurrences of king.

Had we started by replacing king first, we would similarly end up choosing between bank, banand ban0.

In this case, opting for king first, then ban, gives a shorter output than starting by bank then ing. This can easily be inferred from the fact that king features two more copies than bank in the initial string, so 2 more overlapping k will be removed while replacing by the token by choosing it first. But the choice is not always that simple.

Siorki / RegPack

Replace the crusher phase - explore combinations #7