Open chocolate42 opened 2 years ago
Initial smart crunch function stats:
# Grand total for images
decode_ms encode_ms decode_mpps encode_mpps size_kb rate
4.053 6.221 114.52 74.61 401 24.4%: qoip(0).threads(1).entropy(0).smart(0)
4.164 18.137 111.46 25.59 398 24.2%: qoip(1).threads(1).entropy(0).smart(0)
4.106 32.747 113.05 14.17 393 24.0%: qoip(2).threads(1).entropy(0).smart(0)
4.105 57.542 113.07 8.07 392 23.9%: qoip(3).threads(1).entropy(0).smart(0)
4.088 107.643 113.55 4.31 390 23.8%: qoip(4).threads(1).entropy(0).smart(0)
4.093 212.393 113.40 2.19 390 23.8%: qoip(5).threads(1).entropy(0).smart(0)
4.079 409.721 113.80 1.13 390 23.8%: qoip(6).threads(1).entropy(0).smart(0)
# Grand total for images
decode_ms encode_ms decode_mpps encode_mpps size_kb rate
4.006 24.224 115.87 19.16 401 24.4%: qoip(0).threads(1).entropy(0).smart(1)
4.084 25.670 113.67 18.08 398 24.2%: qoip(1).threads(1).entropy(0).smart(1)
4.053 26.903 114.54 17.25 393 24.0%: qoip(2).threads(1).entropy(0).smart(1)
4.042 29.296 114.85 15.84 392 23.9%: qoip(3).threads(1).entropy(0).smart(1)
4.048 33.813 114.65 13.73 390 23.8%: qoip(4).threads(1).entropy(0).smart(1)
4.025 42.536 115.33 10.91 390 23.8%: qoip(5).threads(1).entropy(0).smart(1)
4.017 60.857 115.56 7.63 390 23.8%: qoip(6).threads(1).entropy(0).smart(1)
# Grand total for images-lance
decode_ms encode_ms decode_mpps encode_mpps size_kb rate
25.365 31.705 160.10 128.09 1664 10.7%: qoip(0).threads(1).entropy(0).smart(0)
25.698 71.780 158.03 56.58 1566 10.1%: qoip(1).threads(1).entropy(0).smart(0)
25.604 143.126 158.61 28.37 1541 9.9%: qoip(2).threads(1).entropy(0).smart(0)
25.436 284.000 159.66 14.30 1531 9.8%: qoip(3).threads(1).entropy(0).smart(0)
25.564 538.957 158.86 7.54 1528 9.8%: qoip(4).threads(1).entropy(0).smart(0)
25.362 1042.374 160.13 3.90 1527 9.8%: qoip(5).threads(1).entropy(0).smart(0)
25.252 2051.279 160.82 1.98 1527 9.8%: qoip(6).threads(1).entropy(0).smart(0)
# Grand total for images-lance
decode_ms encode_ms decode_mpps encode_mpps size_kb rate
25.372 120.906 160.06 33.59 1664 10.7%: qoip(0).threads(1).entropy(0).smart(1)
25.833 123.804 157.21 32.80 1566 10.1%: qoip(1).threads(1).entropy(0).smart(1)
25.575 125.917 158.79 32.25 1541 9.9%: qoip(2).threads(1).entropy(0).smart(1)
25.382 132.474 160.00 30.66 1531 9.8%: qoip(3).threads(1).entropy(0).smart(1)
25.262 144.818 160.76 28.04 1528 9.8%: qoip(4).threads(1).entropy(0).smart(1)
25.171 168.852 161.34 24.05 1527 9.8%: qoip(5).threads(1).entropy(0).smart(1)
25.105 218.135 161.76 18.62 1527 9.8%: qoip(6).threads(1).entropy(0).smart(1)
Potential stat pass optimisations:
Potential combination pass optimisations:
Post smarter LUMA handling in stat pass:
# Grand total for images
decode_ms encode_ms decode_mpps encode_mpps size_kb rate
4.177 18.876 111.13 24.59 401 24.4%: qoip(0).threads(1).entropy(0).smart(1)
4.129 19.461 112.41 23.85 398 24.2%: qoip(1).threads(1).entropy(0).smart(1)
4.132 20.534 112.35 22.60 393 24.0%: qoip(2).threads(1).entropy(0).smart(1)
4.089 23.097 113.52 20.10 392 23.9%: qoip(3).threads(1).entropy(0).smart(1)
4.079 27.716 113.81 16.75 390 23.8%: qoip(4).threads(1).entropy(0).smart(1)
4.088 37.323 113.53 12.44 390 23.8%: qoip(5).threads(1).entropy(0).smart(1)
4.028 56.392 115.24 8.23 390 23.8%: qoip(6).threads(1).entropy(0).smart(1)
# Grand total for images-lance
decode_ms encode_ms decode_mpps encode_mpps size_kb rate
25.657 91.775 158.29 44.25 1664 10.7%: qoip(0).threads(1).entropy(0).smart(1)
25.648 94.039 158.34 43.19 1566 10.1%: qoip(1).threads(1).entropy(0).smart(1)
25.502 96.235 159.24 42.20 1541 9.9%: qoip(2).threads(1).entropy(0).smart(1)
25.357 103.525 160.16 39.23 1531 9.8%: qoip(3).threads(1).entropy(0).smart(1)
25.362 117.562 160.13 34.54 1528 9.8%: qoip(4).threads(1).entropy(0).smart(1)
25.143 144.470 161.52 28.11 1527 9.8%: qoip(5).threads(1).entropy(0).smart(1)
25.245 199.111 160.87 20.40 1527 9.8%: qoip(6).threads(1).entropy(0).smart(1)
Post trivial concurrency for combination passes with OpenMP:
# Grand total for images
decode_ms encode_ms decode_mpps encode_mpps size_kb rate
1.949 2.339 238.20 198.45 463 28.2%: qoi-c04a975
4.044 6.202 114.77 74.85 401 24.4%: qoip(0).threads(8).entropy(0).smart(1)
4.079 19.305 113.78 24.04 398 24.2%: qoip(1).threads(8).entropy(0).smart(1)
4.158 19.973 111.64 23.24 393 24.0%: qoip(2).threads(8).entropy(0).smart(1)
4.208 30.461 110.32 15.24 392 23.9%: qoip(3).threads(8).entropy(0).smart(1)
4.258 32.080 109.02 14.47 390 23.8%: qoip(4).threads(8).entropy(0).smart(1)
4.324 35.779 107.34 12.97 390 23.8%: qoip(5).threads(8).entropy(0).smart(1)
4.406 41.746 105.36 11.12 390 23.8%: qoip(6).threads(8).entropy(0).smart(1)
# Grand total for images-lance
decode_ms encode_ms decode_mpps encode_mpps size_kb rate
14.939 13.446 271.85 302.04 2109 13.6%: qoi-c04a975
25.660 31.910 158.26 127.26 1664 10.7%: qoip(0).threads(8).entropy(0).smart(1)
26.069 94.753 155.78 42.86 1566 10.1%: qoip(1).threads(8).entropy(0).smart(1)
26.034 94.895 155.99 42.80 1541 9.9%: qoip(2).threads(8).entropy(0).smart(1)
25.906 107.086 156.76 37.92 1531 9.8%: qoip(3).threads(8).entropy(0).smart(1)
25.901 111.122 156.79 36.55 1528 9.8%: qoip(4).threads(8).entropy(0).smart(1)
25.815 119.115 157.31 34.09 1527 9.8%: qoip(5).threads(8).entropy(0).smart(1)
25.891 134.683 156.85 30.15 1527 9.8%: qoip(6).threads(8).entropy(0).smart(1)
Puzzling why the middle effort levels don't have better results, effort level 6 is a big improvement over level 5. Maybe the single-threaded stat pass is heavily hampered by the slower clocks from a recent SMT chunk, overcome only when there's enough SMT work to justify existing.
Post eliminating branching in combination passes:
# Grand total for images
decode_ms encode_ms decode_mpps encode_mpps size kb rate
1.957 2.344 237.22 198.06 463 28.2%: qoi-c04a975
4.085 6.334 113.62 73.28 401 24.4%: qoip(0).threads(1).entropy(0).smart(1)
4.129 13.470 112.42 34.46 398 24.2%: qoip(1).threads(1).entropy(0).smart(1)
4.087 13.757 113.57 33.74 393 24.0%: qoip(2).threads(1).entropy(0).smart(1)
4.063 14.378 114.24 32.28 392 23.9%: qoip(3).threads(1).entropy(0).smart(1)
4.073 15.670 113.96 29.62 390 23.8%: qoip(4).threads(1).entropy(0).smart(1)
4.080 18.138 113.76 25.59 390 23.8%: qoip(5).threads(1).entropy(0).smart(1)
4.068 23.192 114.11 20.01 390 23.8%: qoip(6).threads(1).entropy(0).smart(1)
# Grand total for images-lance
decode_ms encode_ms decode_mpps encode_mpps size kb rate
15.194 13.696 267.28 296.52 2109 13.6%: qoi-c04a975
25.444 32.182 159.61 126.19 1664 10.7%: qoip(0).threads(1).entropy(0).smart(1)
25.974 68.963 156.35 58.89 1566 10.1%: qoip(1).threads(1).entropy(0).smart(1)
25.569 69.303 158.83 58.60 1541 9.9%: qoip(2).threads(1).entropy(0).smart(1)
25.302 72.116 160.50 56.31 1531 9.8%: qoip(3).threads(1).entropy(0).smart(1)
25.313 78.076 160.44 52.01 1528 9.8%: qoip(4).threads(1).entropy(0).smart(1)
25.343 90.290 160.24 44.98 1527 9.8%: qoip(5).threads(1).entropy(0).smart(1)
25.536 113.415 159.04 35.81 1527 9.8%: qoip(6).threads(1).entropy(0).smart(1)
# Grand total for images
decode_ms encode_ms decode_mpps encode_mpps size kb rate
4.278 27.860 108.51 16.66 390 23.8%: qoip(6).threads(8).entropy(0).smart(1)
Mild smart crunch optimisations:
# Grand total for images
decode_ms encode_ms decode_mpps encode_mpps size kb rate
1.945 2.335 238.71 198.81 463 28.2%: qoi-c04a975
4.017 6.304 115.54 73.63 401 24.4%: qoip(0).threads(1).entropy(0).smart(1)
4.002 12.927 115.99 35.91 398 24.2%: qoip(1).threads(1).entropy(0).smart(1)
4.027 13.268 115.27 34.98 393 24.0%: qoip(2).threads(1).entropy(0).smart(1)
4.112 14.187 112.88 32.72 392 23.9%: qoip(3).threads(1).entropy(0).smart(1)
4.065 15.308 114.20 30.32 391 23.8%: qoip(4).threads(1).entropy(0).smart(1)
4.077 17.892 113.87 25.94 390 23.8%: qoip(5).threads(1).entropy(0).smart(1)
4.100 23.038 113.20 20.15 390 23.8%: qoip(6).threads(1).entropy(0).smart(1)
A smarter smart function at first glance:
With the ops we have now (50 explicit) the search space would look something like this:
Worst-case combination search space
fixed_OP_A optional_OP_A fixed OP_A optional_OP_A
fixed_INDEX8 fixed_INDEX8 optional_INDEX8 optional_INDEX8
RGB DELTA FULL 2016 2016 4032 4032
RGB DELTA minus 8 bit 1512 1512 3024 3024
RGB DELTA minus 8/7 bit 1080 1080 2160 2160
RGBA DELTA FULL 774144 1548288 1548288 3096576
RGBA DELTA minus 8 bit 370440 740880 740880 1481760
RGBA DELTA minus 8/7 bit 155520 311040 311040 622080
A smarterer smart function at first glance: All smart functions to date have been concerned with finding exact representations and picking the smallest. A different smart function could compare stats from individual ops and possibly other things to make heuristic guesses at near-optimal combinations:
Very early days results for a smarter crunch function that tests with all the new ops:
# Grand total for images
decode_ms encode_ms decode_mpps encode_mpps size_kb rate
4.017 170.711 115.55 2.72 387 23.6%: qoip(5).threads(1).entropy(0).smart(2)
Encode time has ballooned again, but the search space has ballooned faster (RGBA input tries ~200K combinations in the above test, RGB input only has a few thousand combinations to try). Encode time only increasing by an order of magnitude when the search space has increased by 3 orders of magnitude seems fine, and no optimisation has been attempted yet. tl;dr the smarter-crunch-function
uses tables of pixel logs instead of using a stat integer for each pixel. There is a table for each index1/index2/delta1 combination, and every pixel not handled by a run1/run2/index1/index2/delta1 op is put into the table, which essentially aggregates all pixels that can be handled with the same LUMA op.
edit: OP_A handling fixed and done in stat pass. Naively clamping lumalog values to (a=0..8, g=3..8, rb=2..8) roughly halves the size of the lumalog tables which does this to the encode time:
decode_ms encode_ms decode_mpps encode_mpps size_kb rate
4.031 91.899 115.16 5.05 387 23.6%: qoip(5).threads(1).entropy(0).smart(2)
Currently qoipcrunch_encode encodes a set of combinations and picks the smallest representation, where the combinations have been pre-calculated and sorted by how well they compress a corpus (a combination of the QOI corpus and images-lance, a more alpha-orientated corpus). A smarter solution could do a stat pass over the input, process the stats to find the best combination, then do a single encode using the combination.
At first glance:
First glance smart encode entails:
edit: At second glance: