chocolate42 / qoipond

Lossless image format inspired by QOI “Quite OK Image” format
MIT License
1 stars 1 forks source link

Implement smarter qoipcrunch_encode #5

Open chocolate42 opened 2 years ago

chocolate42 commented 2 years ago

Currently qoipcrunch_encode encodes a set of combinations and picks the smallest representation, where the combinations have been pre-calculated and sorted by how well they compress a corpus (a combination of the QOI corpus and images-lance, a more alpha-orientated corpus). A smarter solution could do a stat pass over the input, process the stats to find the best combination, then do a single encode using the combination.

At first glance:

First glance smart encode entails:

edit: At second glance:

chocolate42 commented 2 years ago

Initial smart crunch function stats:

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
    4.053      6.221       114.52        74.61      401  24.4%: qoip(0).threads(1).entropy(0).smart(0)
    4.164     18.137       111.46        25.59      398  24.2%: qoip(1).threads(1).entropy(0).smart(0)
    4.106     32.747       113.05        14.17      393  24.0%: qoip(2).threads(1).entropy(0).smart(0)
    4.105     57.542       113.07         8.07      392  23.9%: qoip(3).threads(1).entropy(0).smart(0)
    4.088    107.643       113.55         4.31      390  23.8%: qoip(4).threads(1).entropy(0).smart(0)
    4.093    212.393       113.40         2.19      390  23.8%: qoip(5).threads(1).entropy(0).smart(0)
    4.079    409.721       113.80         1.13      390  23.8%: qoip(6).threads(1).entropy(0).smart(0)

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
    4.006     24.224       115.87        19.16      401  24.4%: qoip(0).threads(1).entropy(0).smart(1)
    4.084     25.670       113.67        18.08      398  24.2%: qoip(1).threads(1).entropy(0).smart(1)
    4.053     26.903       114.54        17.25      393  24.0%: qoip(2).threads(1).entropy(0).smart(1)
    4.042     29.296       114.85        15.84      392  23.9%: qoip(3).threads(1).entropy(0).smart(1)
    4.048     33.813       114.65        13.73      390  23.8%: qoip(4).threads(1).entropy(0).smart(1)
    4.025     42.536       115.33        10.91      390  23.8%: qoip(5).threads(1).entropy(0).smart(1)
    4.017     60.857       115.56         7.63      390  23.8%: qoip(6).threads(1).entropy(0).smart(1)

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
   25.365     31.705       160.10       128.09     1664  10.7%: qoip(0).threads(1).entropy(0).smart(0)
   25.698     71.780       158.03        56.58     1566  10.1%: qoip(1).threads(1).entropy(0).smart(0)
   25.604    143.126       158.61        28.37     1541   9.9%: qoip(2).threads(1).entropy(0).smart(0)
   25.436    284.000       159.66        14.30     1531   9.8%: qoip(3).threads(1).entropy(0).smart(0)
   25.564    538.957       158.86         7.54     1528   9.8%: qoip(4).threads(1).entropy(0).smart(0)
   25.362   1042.374       160.13         3.90     1527   9.8%: qoip(5).threads(1).entropy(0).smart(0)
   25.252   2051.279       160.82         1.98     1527   9.8%: qoip(6).threads(1).entropy(0).smart(0)

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
   25.372    120.906       160.06        33.59     1664  10.7%: qoip(0).threads(1).entropy(0).smart(1)
   25.833    123.804       157.21        32.80     1566  10.1%: qoip(1).threads(1).entropy(0).smart(1)
   25.575    125.917       158.79        32.25     1541   9.9%: qoip(2).threads(1).entropy(0).smart(1)
   25.382    132.474       160.00        30.66     1531   9.8%: qoip(3).threads(1).entropy(0).smart(1)
   25.262    144.818       160.76        28.04     1528   9.8%: qoip(4).threads(1).entropy(0).smart(1)
   25.171    168.852       161.34        24.05     1527   9.8%: qoip(5).threads(1).entropy(0).smart(1)
   25.105    218.135       161.76        18.62     1527   9.8%: qoip(6).threads(1).entropy(0).smart(1)
chocolate42 commented 2 years ago

Potential stat pass optimisations:

Potential combination pass optimisations:

chocolate42 commented 2 years ago

Post smarter LUMA handling in stat pass:

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
    4.177     18.876       111.13        24.59      401  24.4%: qoip(0).threads(1).entropy(0).smart(1)
    4.129     19.461       112.41        23.85      398  24.2%: qoip(1).threads(1).entropy(0).smart(1)
    4.132     20.534       112.35        22.60      393  24.0%: qoip(2).threads(1).entropy(0).smart(1)
    4.089     23.097       113.52        20.10      392  23.9%: qoip(3).threads(1).entropy(0).smart(1)
    4.079     27.716       113.81        16.75      390  23.8%: qoip(4).threads(1).entropy(0).smart(1)
    4.088     37.323       113.53        12.44      390  23.8%: qoip(5).threads(1).entropy(0).smart(1)
    4.028     56.392       115.24         8.23      390  23.8%: qoip(6).threads(1).entropy(0).smart(1)

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
   25.657     91.775       158.29        44.25     1664  10.7%: qoip(0).threads(1).entropy(0).smart(1)
   25.648     94.039       158.34        43.19     1566  10.1%: qoip(1).threads(1).entropy(0).smart(1)
   25.502     96.235       159.24        42.20     1541   9.9%: qoip(2).threads(1).entropy(0).smart(1)
   25.357    103.525       160.16        39.23     1531   9.8%: qoip(3).threads(1).entropy(0).smart(1)
   25.362    117.562       160.13        34.54     1528   9.8%: qoip(4).threads(1).entropy(0).smart(1)
   25.143    144.470       161.52        28.11     1527   9.8%: qoip(5).threads(1).entropy(0).smart(1)
   25.245    199.111       160.87        20.40     1527   9.8%: qoip(6).threads(1).entropy(0).smart(1)
chocolate42 commented 2 years ago

Post trivial concurrency for combination passes with OpenMP:

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
    1.949      2.339       238.20       198.45      463  28.2%: qoi-c04a975
    4.044      6.202       114.77        74.85      401  24.4%: qoip(0).threads(8).entropy(0).smart(1)
    4.079     19.305       113.78        24.04      398  24.2%: qoip(1).threads(8).entropy(0).smart(1)
    4.158     19.973       111.64        23.24      393  24.0%: qoip(2).threads(8).entropy(0).smart(1)
    4.208     30.461       110.32        15.24      392  23.9%: qoip(3).threads(8).entropy(0).smart(1)
    4.258     32.080       109.02        14.47      390  23.8%: qoip(4).threads(8).entropy(0).smart(1)
    4.324     35.779       107.34        12.97      390  23.8%: qoip(5).threads(8).entropy(0).smart(1)
    4.406     41.746       105.36        11.12      390  23.8%: qoip(6).threads(8).entropy(0).smart(1)

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
   14.939     13.446       271.85       302.04     2109  13.6%: qoi-c04a975
   25.660     31.910       158.26       127.26     1664  10.7%: qoip(0).threads(8).entropy(0).smart(1)
   26.069     94.753       155.78        42.86     1566  10.1%: qoip(1).threads(8).entropy(0).smart(1)
   26.034     94.895       155.99        42.80     1541   9.9%: qoip(2).threads(8).entropy(0).smart(1)
   25.906    107.086       156.76        37.92     1531   9.8%: qoip(3).threads(8).entropy(0).smart(1)
   25.901    111.122       156.79        36.55     1528   9.8%: qoip(4).threads(8).entropy(0).smart(1)
   25.815    119.115       157.31        34.09     1527   9.8%: qoip(5).threads(8).entropy(0).smart(1)
   25.891    134.683       156.85        30.15     1527   9.8%: qoip(6).threads(8).entropy(0).smart(1)

Puzzling why the middle effort levels don't have better results, effort level 6 is a big improvement over level 5. Maybe the single-threaded stat pass is heavily hampered by the slower clocks from a recent SMT chunk, overcome only when there's enough SMT work to justify existing.

chocolate42 commented 2 years ago

Post eliminating branching in combination passes:

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
    1.957      2.344       237.22       198.06      463  28.2%: qoi-c04a975
    4.085      6.334       113.62        73.28      401  24.4%: qoip(0).threads(1).entropy(0).smart(1)
    4.129     13.470       112.42        34.46      398  24.2%: qoip(1).threads(1).entropy(0).smart(1)
    4.087     13.757       113.57        33.74      393  24.0%: qoip(2).threads(1).entropy(0).smart(1)
    4.063     14.378       114.24        32.28      392  23.9%: qoip(3).threads(1).entropy(0).smart(1)
    4.073     15.670       113.96        29.62      390  23.8%: qoip(4).threads(1).entropy(0).smart(1)
    4.080     18.138       113.76        25.59      390  23.8%: qoip(5).threads(1).entropy(0).smart(1)
    4.068     23.192       114.11        20.01      390  23.8%: qoip(6).threads(1).entropy(0).smart(1)

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
   15.194     13.696       267.28       296.52     2109  13.6%: qoi-c04a975
   25.444     32.182       159.61       126.19     1664  10.7%: qoip(0).threads(1).entropy(0).smart(1)
   25.974     68.963       156.35        58.89     1566  10.1%: qoip(1).threads(1).entropy(0).smart(1)
   25.569     69.303       158.83        58.60     1541   9.9%: qoip(2).threads(1).entropy(0).smart(1)
   25.302     72.116       160.50        56.31     1531   9.8%: qoip(3).threads(1).entropy(0).smart(1)
   25.313     78.076       160.44        52.01     1528   9.8%: qoip(4).threads(1).entropy(0).smart(1)
   25.343     90.290       160.24        44.98     1527   9.8%: qoip(5).threads(1).entropy(0).smart(1)
   25.536    113.415       159.04        35.81     1527   9.8%: qoip(6).threads(1).entropy(0).smart(1)
# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
    4.278     27.860       108.51        16.66      390  23.8%: qoip(6).threads(8).entropy(0).smart(1)
chocolate42 commented 2 years ago

Mild smart crunch optimisations:

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
    1.945      2.335       238.71       198.81      463  28.2%: qoi-c04a975
    4.017      6.304       115.54        73.63      401  24.4%: qoip(0).threads(1).entropy(0).smart(1)
    4.002     12.927       115.99        35.91      398  24.2%: qoip(1).threads(1).entropy(0).smart(1)
    4.027     13.268       115.27        34.98      393  24.0%: qoip(2).threads(1).entropy(0).smart(1)
    4.112     14.187       112.88        32.72      392  23.9%: qoip(3).threads(1).entropy(0).smart(1)
    4.065     15.308       114.20        30.32      391  23.8%: qoip(4).threads(1).entropy(0).smart(1)
    4.077     17.892       113.87        25.94      390  23.8%: qoip(5).threads(1).entropy(0).smart(1)
    4.100     23.038       113.20        20.15      390  23.8%: qoip(6).threads(1).entropy(0).smart(1)
chocolate42 commented 2 years ago

A smarter smart function at first glance:

With the ops we have now (50 explicit) the search space would look something like this:

Worst-case combination search space 
                         fixed_OP_A   optional_OP_A fixed OP_A      optional_OP_A
                         fixed_INDEX8 fixed_INDEX8  optional_INDEX8 optional_INDEX8
RGB DELTA FULL             2016          2016          4032            4032
RGB DELTA minus 8 bit      1512          1512          3024            3024
RGB DELTA minus 8/7 bit    1080          1080          2160            2160
RGBA DELTA FULL          774144       1548288       1548288         3096576
RGBA DELTA minus 8 bit   370440        740880        740880         1481760
RGBA DELTA minus 8/7 bit 155520        311040        311040          622080

A smarterer smart function at first glance: All smart functions to date have been concerned with finding exact representations and picking the smallest. A different smart function could compare stats from individual ops and possibly other things to make heuristic guesses at near-optimal combinations:

chocolate42 commented 2 years ago

Very early days results for a smarter crunch function that tests with all the new ops:

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
    4.017    170.711       115.55         2.72      387  23.6%: qoip(5).threads(1).entropy(0).smart(2)

Encode time has ballooned again, but the search space has ballooned faster (RGBA input tries ~200K combinations in the above test, RGB input only has a few thousand combinations to try). Encode time only increasing by an order of magnitude when the search space has increased by 3 orders of magnitude seems fine, and no optimisation has been attempted yet. tl;dr the smarter-crunch-function uses tables of pixel logs instead of using a stat integer for each pixel. There is a table for each index1/index2/delta1 combination, and every pixel not handled by a run1/run2/index1/index2/delta1 op is put into the table, which essentially aggregates all pixels that can be handled with the same LUMA op.

edit: OP_A handling fixed and done in stat pass. Naively clamping lumalog values to (a=0..8, g=3..8, rb=2..8) roughly halves the size of the lumalog tables which does this to the encode time:

decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
    4.031     91.899       115.16         5.05      387  23.6%: qoip(5).threads(1).entropy(0).smart(2)