Implement smarter qoipcrunch_encode

chocolate42 / qoipond

Lossless image format inspired by QOI “Quite OK Image” format

MIT License

1 stars 1 forks source link

Implement smarter qoipcrunch_encode #5

Open chocolate42 opened 2 years ago

chocolate42 commented 2 years ago

Currently qoipcrunch_encode encodes a set of combinations and picks the smallest representation, where the combinations have been pre-calculated and sorted by how well they compress a corpus (a combination of the QOI corpus and images-lance, a more alpha-orientated corpus). A smarter solution could do a stat pass over the input, process the stats to find the best combination, then do a single encode using the combination.

At first glance:

Looks like we'd have to do a run per index configuration (12 runs 1 byte index 7/6/5/4/3/0 and 2 byte index 8/0), one monolithic solution doesn't seem viable because varying indexing means varying which pixels are LUMA-encoded which would make stat-gathering complex if not impossible
1 byte FIFO indexing is compatible with this approach if and when that gets implemented
2 byte FIFO indexing is not compatible with this approach (possibly incorrect here), as when to add to the cache depends on if a 1 byte encoding beats OP_INDEX8 to the punch. Possibly INDEX8FIFO could be reworked, possibly it's fine if INDEX8 remains a hash cache even if 1 byte indexing uses FIFO.

First glance smart encode entails:

Read input once for stat gathering
Process stats to find the best combination that uses a given indexing and return it
The calling function aggregates results of all index configs, picks the best combination and encodes it
INDEX8 being present complicates things slightly, as 1 byte LUMA-encoding may beat it to the punch. We'd have to split 1 byte LUMA-encodings from the rest, so test RUN1->INDEX7..3->LUMA1->INDEX8->LUMA2..4

edit: At second glance:

Stat pass including all ops including index ops. RLE is handled separately, non-run pixels have flags for each non-run op in a stat integer defining whether the pixel can be encoded with the op
Combination pass for every combination we're testing to find the size of the encoding. A pass involves calculating total RLE size and adding to total-other-ops size. Other-ops-size calculation involves generating masks for ops with lengths 1-4 in the combination, then testing every stat integer against the length masks.
A combination pass should be cheap, most of the work is done in the stat pass. Currently each length mask 1-4 needs to be tested in turn in an if-else chain which is the majority of a combination-passes runtime. There may be a way to eliminate the chain
Concurrent combination passes are embarrassingly parallel, as are summing the other-ops if OpenMP overhead is small enough
Summing other-ops may be vectorisable, especially if the if-else chain can be eliminated

chocolate42 commented 2 years ago

Initial smart crunch function stats:

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
    4.053      6.221       114.52        74.61      401  24.4%: qoip(0).threads(1).entropy(0).smart(0)
    4.164     18.137       111.46        25.59      398  24.2%: qoip(1).threads(1).entropy(0).smart(0)
    4.106     32.747       113.05        14.17      393  24.0%: qoip(2).threads(1).entropy(0).smart(0)
    4.105     57.542       113.07         8.07      392  23.9%: qoip(3).threads(1).entropy(0).smart(0)
    4.088    107.643       113.55         4.31      390  23.8%: qoip(4).threads(1).entropy(0).smart(0)
    4.093    212.393       113.40         2.19      390  23.8%: qoip(5).threads(1).entropy(0).smart(0)
    4.079    409.721       113.80         1.13      390  23.8%: qoip(6).threads(1).entropy(0).smart(0)

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
    4.006     24.224       115.87        19.16      401  24.4%: qoip(0).threads(1).entropy(0).smart(1)
    4.084     25.670       113.67        18.08      398  24.2%: qoip(1).threads(1).entropy(0).smart(1)
    4.053     26.903       114.54        17.25      393  24.0%: qoip(2).threads(1).entropy(0).smart(1)
    4.042     29.296       114.85        15.84      392  23.9%: qoip(3).threads(1).entropy(0).smart(1)
    4.048     33.813       114.65        13.73      390  23.8%: qoip(4).threads(1).entropy(0).smart(1)
    4.025     42.536       115.33        10.91      390  23.8%: qoip(5).threads(1).entropy(0).smart(1)
    4.017     60.857       115.56         7.63      390  23.8%: qoip(6).threads(1).entropy(0).smart(1)

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
   25.365     31.705       160.10       128.09     1664  10.7%: qoip(0).threads(1).entropy(0).smart(0)
   25.698     71.780       158.03        56.58     1566  10.1%: qoip(1).threads(1).entropy(0).smart(0)
   25.604    143.126       158.61        28.37     1541   9.9%: qoip(2).threads(1).entropy(0).smart(0)
   25.436    284.000       159.66        14.30     1531   9.8%: qoip(3).threads(1).entropy(0).smart(0)
   25.564    538.957       158.86         7.54     1528   9.8%: qoip(4).threads(1).entropy(0).smart(0)
   25.362   1042.374       160.13         3.90     1527   9.8%: qoip(5).threads(1).entropy(0).smart(0)
   25.252   2051.279       160.82         1.98     1527   9.8%: qoip(6).threads(1).entropy(0).smart(0)

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
   25.372    120.906       160.06        33.59     1664  10.7%: qoip(0).threads(1).entropy(0).smart(1)
   25.833    123.804       157.21        32.80     1566  10.1%: qoip(1).threads(1).entropy(0).smart(1)
   25.575    125.917       158.79        32.25     1541   9.9%: qoip(2).threads(1).entropy(0).smart(1)
   25.382    132.474       160.00        30.66     1531   9.8%: qoip(3).threads(1).entropy(0).smart(1)
   25.262    144.818       160.76        28.04     1528   9.8%: qoip(4).threads(1).entropy(0).smart(1)
   25.171    168.852       161.34        24.05     1527   9.8%: qoip(5).threads(1).entropy(0).smart(1)
   25.105    218.135       161.76        18.62     1527   9.8%: qoip(6).threads(1).entropy(0).smart(1)

chocolate42 commented 2 years ago

Potential stat pass optimisations:

[x] Smarter LUMA handling. Instead of calling a function per op, many ops can be checked at once by exploiting the fact that LUMA ops are often a superset of other LUMA ops
[ ] Split stat pass into single-threaded and concurrent sections
[ ] SIMD in stat pass
- May be possible for indexing
- May be possible for LUMA

Potential combination pass optimisations:

[x] Somehow remove branches
[ ] SIMD (especially if branches can be eliminated) (edit: AVX512 has lzcnt functions that would work, but AVX2 apparently doesn't. A different algorithm has to be found for AVX2. AVX2 is the main interest because most modern x86 supports it, intel just removed AVX512 from consumer CPUs and AMD don't support it yet)
[ ] Caching of length 1 op total sizes so work doesn't need to be redone. Length 1 ops if present are guaranteed to be used if matching
- [ ] Taken a step further, caching of length 2 sizes (in the presence of combinations of length 1 ops)
- [ ] ... caching of length 3 sizes
- The above caching could possibly be replaced with recursion? Caching and concurrency could be a minefield, but then so could concurrency and recursion
[ ] Concurrency
- [x] Trivial OpenMP combination concurrency, OR
- [ ] Concurrency per-length-1-set. If there's enough length 1 sets to saturate threads it would be a way to implicitly cache that may be simpler than proper caching in the presence of concurrency

chocolate42 commented 2 years ago

Post smarter LUMA handling in stat pass:

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
    4.177     18.876       111.13        24.59      401  24.4%: qoip(0).threads(1).entropy(0).smart(1)
    4.129     19.461       112.41        23.85      398  24.2%: qoip(1).threads(1).entropy(0).smart(1)
    4.132     20.534       112.35        22.60      393  24.0%: qoip(2).threads(1).entropy(0).smart(1)
    4.089     23.097       113.52        20.10      392  23.9%: qoip(3).threads(1).entropy(0).smart(1)
    4.079     27.716       113.81        16.75      390  23.8%: qoip(4).threads(1).entropy(0).smart(1)
    4.088     37.323       113.53        12.44      390  23.8%: qoip(5).threads(1).entropy(0).smart(1)
    4.028     56.392       115.24         8.23      390  23.8%: qoip(6).threads(1).entropy(0).smart(1)

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
   25.657     91.775       158.29        44.25     1664  10.7%: qoip(0).threads(1).entropy(0).smart(1)
   25.648     94.039       158.34        43.19     1566  10.1%: qoip(1).threads(1).entropy(0).smart(1)
   25.502     96.235       159.24        42.20     1541   9.9%: qoip(2).threads(1).entropy(0).smart(1)
   25.357    103.525       160.16        39.23     1531   9.8%: qoip(3).threads(1).entropy(0).smart(1)
   25.362    117.562       160.13        34.54     1528   9.8%: qoip(4).threads(1).entropy(0).smart(1)
   25.143    144.470       161.52        28.11     1527   9.8%: qoip(5).threads(1).entropy(0).smart(1)
   25.245    199.111       160.87        20.40     1527   9.8%: qoip(6).threads(1).entropy(0).smart(1)

chocolate42 commented 2 years ago

Post trivial concurrency for combination passes with OpenMP:

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
    1.949      2.339       238.20       198.45      463  28.2%: qoi-c04a975
    4.044      6.202       114.77        74.85      401  24.4%: qoip(0).threads(8).entropy(0).smart(1)
    4.079     19.305       113.78        24.04      398  24.2%: qoip(1).threads(8).entropy(0).smart(1)
    4.158     19.973       111.64        23.24      393  24.0%: qoip(2).threads(8).entropy(0).smart(1)
    4.208     30.461       110.32        15.24      392  23.9%: qoip(3).threads(8).entropy(0).smart(1)
    4.258     32.080       109.02        14.47      390  23.8%: qoip(4).threads(8).entropy(0).smart(1)
    4.324     35.779       107.34        12.97      390  23.8%: qoip(5).threads(8).entropy(0).smart(1)
    4.406     41.746       105.36        11.12      390  23.8%: qoip(6).threads(8).entropy(0).smart(1)

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
   14.939     13.446       271.85       302.04     2109  13.6%: qoi-c04a975
   25.660     31.910       158.26       127.26     1664  10.7%: qoip(0).threads(8).entropy(0).smart(1)
   26.069     94.753       155.78        42.86     1566  10.1%: qoip(1).threads(8).entropy(0).smart(1)
   26.034     94.895       155.99        42.80     1541   9.9%: qoip(2).threads(8).entropy(0).smart(1)
   25.906    107.086       156.76        37.92     1531   9.8%: qoip(3).threads(8).entropy(0).smart(1)
   25.901    111.122       156.79        36.55     1528   9.8%: qoip(4).threads(8).entropy(0).smart(1)
   25.815    119.115       157.31        34.09     1527   9.8%: qoip(5).threads(8).entropy(0).smart(1)
   25.891    134.683       156.85        30.15     1527   9.8%: qoip(6).threads(8).entropy(0).smart(1)

Puzzling why the middle effort levels don't have better results, effort level 6 is a big improvement over level 5. Maybe the single-threaded stat pass is heavily hampered by the slower clocks from a recent SMT chunk, overcome only when there's enough SMT work to justify existing.

chocolate42 commented 2 years ago

Post eliminating branching in combination passes:

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
    1.957      2.344       237.22       198.06      463  28.2%: qoi-c04a975
    4.085      6.334       113.62        73.28      401  24.4%: qoip(0).threads(1).entropy(0).smart(1)
    4.129     13.470       112.42        34.46      398  24.2%: qoip(1).threads(1).entropy(0).smart(1)
    4.087     13.757       113.57        33.74      393  24.0%: qoip(2).threads(1).entropy(0).smart(1)
    4.063     14.378       114.24        32.28      392  23.9%: qoip(3).threads(1).entropy(0).smart(1)
    4.073     15.670       113.96        29.62      390  23.8%: qoip(4).threads(1).entropy(0).smart(1)
    4.080     18.138       113.76        25.59      390  23.8%: qoip(5).threads(1).entropy(0).smart(1)
    4.068     23.192       114.11        20.01      390  23.8%: qoip(6).threads(1).entropy(0).smart(1)

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
   15.194     13.696       267.28       296.52     2109  13.6%: qoi-c04a975
   25.444     32.182       159.61       126.19     1664  10.7%: qoip(0).threads(1).entropy(0).smart(1)
   25.974     68.963       156.35        58.89     1566  10.1%: qoip(1).threads(1).entropy(0).smart(1)
   25.569     69.303       158.83        58.60     1541   9.9%: qoip(2).threads(1).entropy(0).smart(1)
   25.302     72.116       160.50        56.31     1531   9.8%: qoip(3).threads(1).entropy(0).smart(1)
   25.313     78.076       160.44        52.01     1528   9.8%: qoip(4).threads(1).entropy(0).smart(1)
   25.343     90.290       160.24        44.98     1527   9.8%: qoip(5).threads(1).entropy(0).smart(1)
   25.536    113.415       159.04        35.81     1527   9.8%: qoip(6).threads(1).entropy(0).smart(1)

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
    4.278     27.860       108.51        16.66      390  23.8%: qoip(6).threads(8).entropy(0).smart(1)

Effort level 0 now has an escape hatch, it's representative of roughly how long a single encode takes
Estimating based on images, a current stat pass takes ~10% longer than a normal encode, and a current combination pass is ~44x quicker than a normal encode
Because the combination passes are much more efficient now, concurrency actually hurts performance even for effort level 6 as the clocks are lowered for the stat pass thanks to not waiting long enough between image benchmarks for the CPU to recover from SMT. It might take trialling hundreds of combinations to make concurrency (as implemented) to be worthwhile for batch processing, it should be worthwhile for burst processing now as most of the single-threaded work is done first.

chocolate42 commented 2 years ago

Mild smart crunch optimisations:

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
    1.945      2.335       238.71       198.81      463  28.2%: qoi-c04a975
    4.017      6.304       115.54        73.63      401  24.4%: qoip(0).threads(1).entropy(0).smart(1)
    4.002     12.927       115.99        35.91      398  24.2%: qoip(1).threads(1).entropy(0).smart(1)
    4.027     13.268       115.27        34.98      393  24.0%: qoip(2).threads(1).entropy(0).smart(1)
    4.112     14.187       112.88        32.72      392  23.9%: qoip(3).threads(1).entropy(0).smart(1)
    4.065     15.308       114.20        30.32      391  23.8%: qoip(4).threads(1).entropy(0).smart(1)
    4.077     17.892       113.87        25.94      390  23.8%: qoip(5).threads(1).entropy(0).smart(1)
    4.100     23.038       113.20        20.15      390  23.8%: qoip(6).threads(1).entropy(0).smart(1)

Added many LUMA ops, LUMA ops are now generated instead of manually written. There is now an RGB LUMA op for every data bit length from 6..22 bits, and an RGBA op for every data bit length from 8..29 bits. There are now 50 explicit ops in total, 54 including the implicit ops
Replaced OP_LUMA4_7777 with LUMA4_7876. This is done to fit the pattern of how other LUMA ops are generated, which allows the stat pass to do less branching and perform better
Haven't updated the qoipcrunch_unified list. Removing OP_LUMA4_7777 looks like a mild regression when comparing effort level 4, however this is almost certainly because the list hasn't been retrained
Haven't updated smart crunch or dumb crunch to use the new ops (aside from using 7876 instead of 7777)
Smart crunch stat pass now uses calculated magnitude of avg* variables to reduce comparisons. This introduces extra forced branches and is only worthwhile for the stat pass which does a lot of comparisons (not worth it as-is for op encode functions)
Default opstring disabled for now
Tidy qoipbench.c
Better memory management in qoipconv.c
A custom string of "t" now uses a test string which is a combination for every explicit op

chocolate42 commented 2 years ago

A smarter smart function at first glance:

Ops are grouped according to their byte length and what they target( 1..3 byte RGB ops, 2..4 byte RGBA ops, 1..2 byte INDEX ops, OP_A). There is large overlap between ops in a group, so at most one op is chosen from each group
Do a stat pass as before
Detect if alpha changes to split RGB and RGBA properly (RGB passed as 4 channel will be correctly detected as RGB, something we can do now that multiple passes are used)
- Most images are RGB, this reduces the search space for them by a few orders of magnitude
- To the point where an exhaustive RGB search with combination passes is reasonably practical
If RGBA source, do an OP_A pass to see if it's useful and either include it or don't (using some metric of efficiency)
We can probably ignore delta/diff/luma ops with 8 bit tags, for the cost of one more opcode we can use a better 7-bit-tag op. Doubling op coverage for one extra op always seems good
- Might be able to get away with the same logic to ignore 7 bit tags too
Do an index pass to potentially eliminate inefficient index caching ops

With the ops we have now (50 explicit) the search space would look something like this:

Worst-case combination search space 
                         fixed_OP_A   optional_OP_A fixed OP_A      optional_OP_A
                         fixed_INDEX8 fixed_INDEX8  optional_INDEX8 optional_INDEX8
RGB DELTA FULL             2016          2016          4032            4032
RGB DELTA minus 8 bit      1512          1512          3024            3024
RGB DELTA minus 8/7 bit    1080          1080          2160            2160
RGBA DELTA FULL          774144       1548288       1548288         3096576
RGBA DELTA minus 8 bit   370440        740880        740880         1481760
RGBA DELTA minus 8/7 bit 155520        311040        311040          622080

An exhaustive RGB search taking 4032 combination passes is not unreasonable for an extreme setting (or even a high setting)
An exhaustive RGBA search taking up to 3.1e6 combination passes is unreasonable even for an extreme setting (>1 hour for the average "images" image)
Even 7.4e5 (no 8 bit, fixed OP_A optional INDEX8 or vice-versa) or 7.7e5 (fixed OP_A fixed INDEX8) RGBA is excessive. Estimating with the average "images" run that would be ~30 minutes for one image
1.6e5 (fixed OP_A, fixed INDEX8, no 8/7 bit ops) RGBA is excessive but perhaps fine for an extreme setting (~7 minutes per "images" image)
Every op we can eliminate from the search space before searching reduces this worst-case search space by at least 7/8, all ops have a multiplicitive effect on the worst-case

A smarterer smart function at first glance: All smart functions to date have been concerned with finding exact representations and picking the smallest. A different smart function could compare stats from individual ops and possibly other things to make heuristic guesses at near-optimal combinations:

Greedily pick the most efficient ops from each group (1 byte RGB ops, 2 byte RGB ops, etc) and build a search space from them
Greedily pick additively
Iterate on a combination by replacing ops one at a time and doing a combination pass to see if there is an improvement.
LUMA processing has been split into the discrete ops we are using. If instead we did something clever involving a continuous distribution we might be able to quickly find optimal LUMA combinations without lots of trial and error. This is made more complicated by having non-LUMA 1 byte delta ops, but maybe that just multiplies the required effort by the number of 1 byte delta ops
Multiple cheap approaches could be tried simultaneously with the best result used

chocolate42 commented 2 years ago

Very early days results for a smarter crunch function that tests with all the new ops:

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
    4.017    170.711       115.55         2.72      387  23.6%: qoip(5).threads(1).entropy(0).smart(2)

Encode time has ballooned again, but the search space has ballooned faster (RGBA input tries ~200K combinations in the above test, RGB input only has a few thousand combinations to try). Encode time only increasing by an order of magnitude when the search space has increased by 3 orders of magnitude seems fine, and no optimisation has been attempted yet. tl;dr the smarter-crunch-function uses tables of pixel logs instead of using a stat integer for each pixel. There is a table for each index1/index2/delta1 combination, and every pixel not handled by a run1/run2/index1/index2/delta1 op is put into the table, which essentially aggregates all pixels that can be handled with the same LUMA op.

edit: OP_A handling fixed and done in stat pass. Naively clamping lumalog values to (a=0..8, g=3..8, rb=2..8) roughly halves the size of the lumalog tables which does this to the encode time:

decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
    4.031     91.899       115.16         5.05      387  23.6%: qoip(5).threads(1).entropy(0).smart(2)