dankamongmen / notcurses

blingful character graphics/TUI library. definitely not curses.
https://nick-black.com/dankwiki/index.php/Notcurses
Other
3.61k stars 114 forks source link

color divergence in sixel with fewer colors than color registers #2503

Closed dankamongmen closed 2 years ago

dankamongmen commented 2 years ago

with quantanal improved in #2502, we're now seeing divergence on a more complex, but still simple, image:

[schwarzgerat](0) $ ./quantanal ../data/chunli23.png
analyzing ../data/chunli23.png...
 source pixels: 99x46
 rendered: 99x46 18216B
 control sequence: 3134 bytes (17.2%)
 00004554 pixels analyzed Δr 318498 Δg 3546 Δb 3600
 Colors: 17 vs 27
 4554px Δr 318498 (69.9) Δg 3546 (0.779) Δb 3600 (0.791)
 avg diff per pixel: 71.5
done with ../data/chunli23.png.
[schwarzgerat](0) $

we're for some reason using 10 more colors, and missing by a wide 71.5 per. adding NOINTERPOLATION makes things worse:

[schwarzgerat](0) $ ./quantanal ../data/chunli23.png
analyzing ../data/chunli23.png...
 source pixels: 99x46
 rendered: 99x46 18216B
 control sequence: 3134 bytes (17.2%)
 00004554 pixels analyzed Δr 315107 Δg 5968 Δb 3489
 Colors: 17 vs 97
 4554px Δr 315107 (69.2) Δg 5968 (1.31) Δb 3489 (0.766)
 avg diff per pixel: 71.3
done with ../data/chunli23.png.
[schwarzgerat](0) $

meanwhile, on another, we drastically reduce the number of colors:

[schwarzgerat](0) $ ./quantanal ../data/worldmap.png
analyzing ../data/worldmap.png...
 source pixels: 475x860
 rendered: 475x860 1634000B
 control sequence: 130695 bytes (7.998%)
 00408500 pixels analyzed Δr 828526 Δg 899641 Δb 877151
 Colors: 233 vs 43
 408500px Δr 828526 (2.03) Δg 899641 (2.2) Δb 877151 (2.15)
 avg diff per pixel: 6.38
done with ../data/worldmap.png.
[schwarzgerat](0) $
dankamongmen commented 2 years ago

even more oddly, i only see 11 colors in chunli23.png, fewer than we count in quantanal. ahhh, perhaps becuase of the initial reduction from rgb to sixel space? we really oughtn't be doing that so early--it loses precision!

dankamongmen commented 2 years ago

yep that's why

dankamongmen commented 2 years ago
[0->10] hi: 18 6 0 lo: 18 6 0
[10->8] hi: 94 75 62 lo: 94 75 62
[10->8] hi: 94 87 69 lo: 94 75 62
[11->4] hi: 94 94 94 lo: 94 94 94
[12->12] hi: 18 40 72 lo: 18 40 72
[13->13] hi: 43 18 9 lo: 43 18 9
[14->14] hi: 94 87 69 lo: 94 87 69
[15->15] hi: 31 12 3 lo: 31 12 3
[1->9] hi: 0 21 53 lo: 0 21 53
[2->0] hi: 0 31 56 lo: 0 31 56
[2->0] hi: 18 40 72 lo: 0 31 56
[3->3] hi: 25 6 0 lo: 25 6 0
[3->3] hi: 31 12 3 lo: 25 6 0
[3->3] hi: 43 18 9 lo: 25 6 0
[4->11] hi: 43 28 9 lo: 43 28 9
[5->2] hi: 37 56 81 lo: 37 56 81
[6->5] hi: 56 28 15 lo: 56 28 15
[7->6] hi: 69 40 28 lo: 69 40 28
[8->1] hi: 59 72 90 lo: 59 72 90
[9->7] hi: 84 53 43 lo: 84 53 43
FOUND COLOR 0 0 25 50
FOUND COLOR 10 0 0 0
FOUND COLOR 11 25 25 0
FOUND COLOR 1 50 50 75
FOUND COLOR 2 25 50 75
FOUND COLOR 3 25 0 0
FOUND COLOR 4 75 75 75
FOUND COLOR 5 50 25 0
FOUND COLOR 6 50 25 25
FOUND COLOR 7 75 50 25
FOUND COLOR 8 75 75 50
FOUND COLOR 9 0 0 50

those deets entries look a bit askance, though maybe i'm forgetting something...

dankamongmen commented 2 years ago

yeah these hi/lo values are fubar:

FOUND COLOR 11 25 25 0
FOUND COLOR 11 25 25 0
FOUND COLOR 11 25 25 0
FOUND COLOR 11 25 25 0
000 206 hi 18/40/72 lo 0/31/56
001 69 hi 59/72/90 lo 59/72/90
002 78 hi 37/56/81 lo 37/56/81
003 343 hi 43/18/9 lo 25/6/0
004 72 hi 94/94/94 lo 94/94/94
005 187 hi 56/28/15 lo 56/28/15
006 155 hi 69/40/28 lo 69/40/28
007 124 hi 84/53/43 lo 84/53/43
008 178 hi 94/87/69 lo 94/75/62
009 83 hi 0/21/53 lo 0/21/53
010 35 hi 18/6/0 lo 18/6/0
011 226 hi 43/28/9 lo 43/28/9
dankamongmen commented 2 years ago

ok the 56 is floor(0x90100/255). so that's sixel space. likewise floor(0x50100/255) == 31...though how we're ending up with different hi vs lo remains confusing.

dankamongmen commented 2 years ago

so is this all due to our preclumping, where we're applying that large mask?

FOUND COLOR 0 0 25 50 (orig 0xff905000)
000 4 hi 0/31/56 lo 0/31/56
FOUND COLOR 0 0 25 50 (orig 0xffb86830)
000 5 hi 18/40/72 lo 0/31/56

differences of 40, 24, and 48. we start with a mask of 0xc0, aka two bits, aka enclumpenings of 64, so yeah, that's wide enough to take them. still, if we're storing with full precision, we ought be able to get them out losslessly. so it looks like that's where our breakage is.

dankamongmen commented 2 years ago

yeah we're always averaging in refine_color(), stupid stupid stupid.

dankamongmen commented 2 years ago

definite bug in extract_color_table(), argh!

dankamongmen commented 2 years ago

alright, i think i've found one trick that wins us back a good bit of quality, testing for performance regression...

dankamongmen commented 2 years ago

yeah that seems to improve the visual a decent bit, though our average error changes very little...

dankamongmen commented 2 years ago

alright, bold new idea for the relaxation algorithm: initially, we take a 323 MSB liftup from the channels. use that as a key into a 256-element table. now the full extraction is O(1) per pixel (currently O(lgn)), at the cost of a possibly sparse color table. but use that same sparsity as a topological guide to our relaxation, relying on some locality. if need be at the end, compress the table down -- this can be delayed until encoding time, and done on the fly. this ought be a very substantial improvement, i think.

ghost commented 2 years ago

I would really like to port your encoder sometime to Java for the vastly superior speed and quality over other sixel encoders. Do you have a planned milestone spot in the future that would be a good place to start from?

dankamongmen commented 2 years ago

progress thus far indicates about a 50% cut in sixel quantization runtime, fuckin' sweet

dankamongmen commented 2 years ago

I would really like to port your encoder sometime to Java for the vastly superior speed and quality over other sixel encoders. Do you have a planned milestone spot in the future that would be a good place to start from?

righjt now the encoder's no good =] i'm bringing it up to at least libsixel territory now. libsixel uses straight-up maxmedian cut (as you do, iirc), of which i'm not a tremendous fan. i think octrees are likely the way to go, though i've been playing around with kohonen neural networks. my deep dislike for statistical/ML methods means i probably won't go that way, either. right now i'm experimenting with static octree preclustering + my own "relaxation" approach, and it is showing very positive results. #1452 then suggests getting rid of the final iteration of relaxation, which i think is a great idea, but we'll see.

i'm down to a bit under 50% of my old rendering time with just the octree preclustering, but i've got a bug in there for images that are fewer than 256 colors. i should have it knocked out tonight, hopefully.

on the left is current algorithm, running in .9s±.2s when coming from page cache. on the right is the new algorithm, running in .5s±.1s when coming from page cache.

2021-12-29-152524_1787x1440_scrot

dankamongmen commented 2 years ago

(of that .5s, about .35s is due to writing out to the terminal, and the terminal rendering, so we're talking about .4s vs .1s for actual quantization, so a bit more speedup than advertised)

dankamongmen commented 2 years ago

i'm on it

dankamongmen commented 2 years ago

ok, i think i just got that bug resolved. things are looking very good...

dankamongmen commented 2 years ago

so we're seeing equal quality on most images now, while preserving nice speed pickups....except for notcursesIII.mov, on which we seem noticeably slower and crappier.

dankamongmen commented 2 years ago

likewise we're definitely slower and crappier on samoa.avi. very strange that this correlates with video, though probably not informative. huh.

dankamongmen commented 2 years ago

new algorithm:

[schwarzgerat](0) $ time ./ncplayer ../data/chunli* -t0 -q

real    0m7.650s
user    0m5.349s
sys     0m0.228s
[schwarzgerat](0) $ 

old:

[schwarzgerat](0) $ time ncplayer ../data/chunli* -t0 -q

real    0m10.964s
user    0m5.285s
sys     0m0.325s
[schwarzgerat](0) $ 

but then on worldmap.png, old beats new (and looks better);

[schwarzgerat](0) $ time ncplayer ../data/worldmap.png -t0 -q

real    0m0.241s
user    0m0.180s
sys     0m0.012s
[schwarzgerat](0) $ time ncplayer ../data/worldmap.png -t0 -q

real    0m0.244s
user    0m0.175s
sys     0m0.020s
[schwarzgerat](0) $

new:

[schwarzgerat](0) $ time ./ncplayer ../data/worldmap.png -t0 -q

real    0m0.267s
user    0m0.197s
sys     0m0.012s
[schwarzgerat](0) $ time ./ncplayer ../data/worldmap.png -t0 -q

real    0m0.269s
user    0m0.203s
sys     0m0.008s
[schwarzgerat](0) $

new pastes old on AllRGB images:

[schwarzgerat](0) $ time ./ncplayer /media/chungus/images/allrgb/* -t0 -q

real    0m6.671s
user    0m5.923s
sys     0m0.392s
[schwarzgerat](0) $ time ./ncplayer /media/chungus/images/allrgb/* -t0 -q

real    0m6.681s
user    0m5.955s
sys     0m0.348s
[schwarzgerat](0) $ time ./ncplayer /media/chungus/images/allrgb/* -t0 -q

real    0m6.712s
user    0m5.945s
sys     0m0.379s
[schwarzgerat](0) $ 

old:

[schwarzgerat](0) $ time ncplayer /media/chungus/images/allrgb/* -t0 -q

real    0m7.575s
user    0m6.826s
sys     0m0.372s
[schwarzgerat](0) $ time ncplayer /media/chungus/images/allrgb/* -t0 -q

real    0m7.507s
user    0m6.821s
sys     0m0.391s
[schwarzgerat](0) $ time ncplayer /media/chungus/images/allrgb/* -t0 -q

real    0m7.502s
user    0m6.827s
sys     0m0.398s
[schwarzgerat](0) $ 

soooooooooo AllRGB is going to fill up the color table on the initial pass, causing the second pass to abort immediately (it's also going to be dominated by writing/processing time, as they're 64MB RGB). meanwhile, very few colors are going to likewise cause very little second phase, since most will get slotted out to different octets in the first phase. so that points to our first phase being good, and our second phase being ass. it doesn't explain why we look shittier in that case, though. maybe a remaining bug in the second phase?

well, we ought be able to cut a chunk off that ass with the #1452 proposal. let's apply that and see where we get. if that doesn't win in all cases, i think we move to an all-octree solution, and call it a day.

dankamongmen commented 2 years ago

pure octree is, what, 2NlgN? you're doing a lgn insert for each of N, then a lgN lookup for each of N (N pixels). that's assuming you don't want arbitrary fidelity, though. if you do a relaxation following, you're talking what an additional PN (P maxcolors) worst case (for each free color pick a node, and expand that node). 9 levels get you full fidelity on 24-bit color (each level is 3 bits), so really it's 16N + P²lgP for relaxation-free pure-octree (keep a sum-over-subtree in each node, increment it as you traverse for insertion, and then pick your P by breadth-first-search with a sorted frontier, PlgP to sort the frontier....actually, no, at each point traverse down on the most populous...but then selecting the most populous requires searching a frontier; a frontier is implicit i think). BUT you avoid the N*P of relaxation, which is absolutely going to overwhelm P²lgP for all but the smallest images (this might not be true for 1024 colors, a case we have been sedulously ignoring).

if you do pure octree out to, say, 7 levels (losing 6 bits of detail--in a no-relaxation approach, this would be maybe 1 bit of red, 2 of green, 3 of blue) you need 256K nodes. since there's no relaxation, you needn't keep a hi/lo, and only need 3 words of color tracking per leaf node (one each for colorsums; we already have count). this is small enough to do a pointer-free implementation, i think -- each node is 256B with 64-bit words or 128B with 32-bit words (32 bit words guarantee overflow-free operation only though 2^24 summed pixels, aka 4096x4096, but that might be enough?). so you're talking 128K==2^17 nodes at 2^7==128 bytes + 128K==2^17 nodes at 2^5==32 bytes == 16MB + 4MB == 20MB for a static octree of 7 levels. a bit extravagant, not too wonderful on an embedded system, not great for on-die cache (there's a compulsory miss for the entirety due to initial zorching).

at 5 levels you lose 12 bits of detail, half your colordepth. that's still way better than preclustering on 256, where you lose 16 bits of detail, and pretty much have to apply relaxation. indeed, at 5 levels in the densest case you're throwing out 93.75% of your samples, as you lose 15/16 of your bits (you're keeping 256 of 4096). on an even split you're losing 4 bits per channel, i.e. you have clumps 16-wide on each channel. not so bad. might not even need relaxation most of the time on that. in that case you have 2K heavy nodes and 2K light nodes for a static 327,680 byte octree. now we're talkin'!

just for fun, let's look at 5 levels on 1024 registers. well now you're just throwing out all but 1024 of 4096, aka 3/4 of your solutions. you still clump out at 16 on your leaves, but you can take more leaves. space cost remains 320KB--this is independent of P (and N) in all cases. the lack of relaxation allows us to scale P (and N!) out arbitrarily.

our shared object is well over 100KB. if you're running Notcurses, i think you can afford to toss 320KB at quantization (it's not necessary to keep the octree beyond encoding time).

let's fuckin' do it you sons of bitches arrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr

dankamongmen commented 2 years ago

while this is great for PN scaling, it still doesn't explain why e.g. worldmap.png was slower.

i think relaxation just totally sucks? definitely when we're doing the iterated O(N) approach rather than a final sort for NlgP.

fuck relaxation. Octrees for Victory!

don't bring the bang out unless you plan to bang

dankamongmen commented 2 years ago

it would be a fine thing if we could get this 320KB all under the auspices of one HugeTLB entry

dankamongmen commented 2 years ago

I think I was using bytes up there where I wanted bits (for the node sizes), so cut all that by a factor of 8, which is obviously good =]

dankamongmen commented 2 years ago

how hard would it be to add a recombination pass? octree for initial structure, recombine following? in one dimension, recombination works like: we have 4-wide sets on [1..20]. 20 samples at 1, 10 at 4, 10 at 5, 20 at 15, and three registers. well we probably want to treat the 20 at 4,5 as a group, rather than the 30 in the first set. recombination looks across node boundaries. does this end up just being an optimization problem? in that case, fuck it

dankamongmen commented 2 years ago

ehhh to do that effectively you've pretty much gotta keep deep structure within your nodes, I think. so aye fuck that.

relaxation is cheaper here than our initial implementation. though, since you can just deal it out from your frontier. you do then have a search at the end...actually, you always have one, since you need find your ancestor that made it into the table (you are guaranteed to only have one).

so let's lock this down: 1) heavy nodes are 16B, light nodes are 4B 2) we will have 2^11 each of light and heavy nodes in a pointer-free static structure, so 2^15+2^13, aka 40KB(!). might as well do 64 bits for 80KB. 3) N insertions into octree at (fixed) 6 steps each

now we either prune down or meld up. pruning down requires a frontier. you're looking for the most populous P nodes such that all N pixels are included and node in P also has an ancestor in P, right? because we must account for all N pixels, and want to combine (losing resolution) on as few as possible (though a distance*population metric is better if we're trying to minimize total error, I think). so pruning down means, starting at the root, if the node is populated, etc etc. best for sparse octrees, where you only need touch a sunset of the leaves. in a dense octree, though, this is worse than melding up, as you must touch all nodes rather than just leaves. it's also more complicated, so fuxk it. meld up.

to meld up, you have to touch all leaves, to find which ones are occupied. then you relax or fold until you have P.

what if we combined the two? we can track the number of populated leaves in a single word while inserting then, so o(1) state and negligible time assuming cache presence (which we can force by layout). as soon as you exceed some number of.populated leaves, meld up entirely. before that, DFS to find.your populated leaf set. or fuck it, throw another 2^12 bits (2^9 bytes, a mere 512B) as an occupancy bitmap...that's actually not a win assuming reasonably packed leaf nodes, as you can check the node pop in o(1) just as you can the bitmap, dumb.

either way you end up with a set of occupied leaf nodes partitioning N, with no guaranteed relation of this set's size to P. we need to be less than or equal to P, and ideally equal to P when we have at least P actual source colors.

so that's just relax and fold in the desired direction, using whatever metric we want. we can do this greediy; doing it optimally for the population metric sounds difficult and doing it optimally for a distance*population metric sounds right fucking unpleasant.

dankamongmen commented 2 years ago

alright so to lock the second phase down:

  1. select the set of between 1 and 2048 leaf nodes which are occupied. call this L, and its size |L|.
  2. if |L| == P, goto 9
  3. if |L| > P, find the smallest sum population of adjacent (within L) nodes, and combine them. goto 5
  4. (|L| < P) find the largest node in L with at least two source colors
  5. if there is such a node, split it, goto 5
  6. second phase done

first-order adjustment to this would, i think, be a metric that is the product of sum population and true (i.e. not logical) distance between nodes. this requires no intranode structure, but seeks to minimize global error (it also carries a hint of recombination as described earlier). there are a lot of multiplications, though. ehhh, we can afford some multiplies.

when you combine two nodes, you....hrmmm. that's actually kinda tricky, since we don't have pointers (and thus aliases), and it's not a binary tree (thus you can't simply take over the parent node). hrmmmmmmmm. we could just zorch out the lower of the two combined nodes, and always lean to the right when our true path is zorched? or with a state bit we could indicate the node we were combined into in the count word, and roll up a lgN chain to the final combined node. yeah, i think that latter's the way to do it.

at step 9, we have a set L where |L| <= P, and no node in L has an ancestor that's also in L, which was our goal for quantization. now we must look up each pixel's quantized color, which can be done in 2NlgN (for each pixel, descend to our leaf node, and possibly hop over to our merged node). we then have a quantized set of colors, and have mapped each pixel to one of them. Encode. Done.

dankamongmen commented 2 years ago

qemfd let's do it

dankamongmen commented 2 years ago

alright so the first detail is the actual machinations for this static octree.

operations we will need include:

...is that the only one? do we need ancestor given a node, or siblings? i think we do for melding, yeah. so...

level 0: light node at 0..0 x 8 level 1: 8x light nodes at 1..8 x 8 level 2: 64x light nodes at 9..72 x 8 level 3: 512x light nodes at 73..328 x 8 level 4: 4096x light nodes at 329..2376 x 8 level 5: 32768x heavy nodes at 2377..18760 x 16

so that's 4681 light nodes and 32768 heavy nodes, losing 9 (not 12) bits of depth. at 8 bytes for lights and 32 bytes for heavies, that's 37448 + 1048576 == 1086024B, a little over 1MB. not great, not terrible.

5 levels only means 585 lights and 4096 heavies, losing 12 bits of depth. that's 135752B, a little over 128KB.

given that at 5 levels we're already tossing 93.75%, i see no advantage in going deeper. let's make up some insertion keys:

level 0: n/a level 1: 1R, 1G, 1B (777 remain) level 2: 2R, 1G, 0B (567 remain) level 3: 1R, 1G, 1B (456 remain) level 4: 1R, 1G, 1B (345 remain)

what's the point of a non-uniform distribution here? it's only going to affect melding under a pop-distance metric if we weigh RGB differently in said metric (well, the distance part anyway).

so to define our operations:

path given a node at address A:

let's code this up and see what happens.

dankamongmen commented 2 years ago

fuck you can just do (A - base) / nodesize(L) to get a 12-bit index and then chop it up into 4 3-bit path sections, i'm fucking retarded

dankamongmen commented 2 years ago

fuck you can just do (A - base) / nodesize(L) to get a 12-bit index and then chop it up into 4 3-bit path sections, i'm fucking retarded

for leaf nodes, anyway. your base would be address of octree + 4681 * 8.

dankamongmen commented 2 years ago

for adddress of descendants you always come in knowing your L, base, and path thus far. so you just need say base + 2^3L nodesize(L) + path[L] nextoff(path[0..L-1]) where nextoff shifts current 3 left and adds path[i] through L-1.

dankamongmen commented 2 years ago

one remaining question is when do we map RGB space [0..255] down to Sixel space [0.100]? i guess you want to carry precision as far forwards as it is useful?

dankamongmen commented 2 years ago

you know...i don't think we even need the light nodes. we're not pruning down, but instead melding up, so there's no point in tracking the population at higher levels. this is basically just a 2^24 -> 2^11 masking.

dankamongmen commented 2 years ago

yeah. =]

dankamongmen commented 2 years ago

ok so true heavy nodes (the only node we have) will be 14B each. we'll have some power of 2 of them. we choppa and map. store hi and lo for each, and 32-bit pop for each (64-bit pop means 22B each, maybe do that). recombine/relax to P, using fast relaxation (no pixel moves). take the lower of all recombinations, since the eye distinguishes dark better than light. use the idea from above where we rewrite the popcount with the node that replaced it. finally, map down, following the popcount links.

imho this ties together everything best from relaxation, sweetmasks, and octrees.

dankamongmen commented 2 years ago

one last thing: maybe write out the color table and use binary search, rather than searching the octree. it's the same NlgN either way, but the former has much better locality for cache.

dankamongmen commented 2 years ago

alright, just got our first light off the new scheme, looking great thus far, i've got high hopes here

dankamongmen commented 2 years ago

have notcurses-info working once more under the new regime...

dankamongmen commented 2 years ago

so here's what we've got in the octrees-for-victory branch:

so it's looking positive. need to finish out merge_color_table() at a minimum.

dankamongmen commented 2 years ago

but yeah the original motivation for this bug, the divergence, is eliminated.

dankamongmen commented 2 years ago

uncontrollable

dankamongmen commented 2 years ago

alright, we're now functional on images with more leaves than color registers, though it's a first-order merge and tends to look like shit. that ends functional necessities; i'm going to clean out the diagnostics and start some fine-tuning and numeric comparison. i expect to merge this tonight.

dankamongmen commented 2 years ago

here's some crude notcurses-demo data:

xterm 80x61 (880x1403)

new

             runtime│ frames│output(B)│    FPS│%r│%a│%w│TheoFPS║                
══╤════════╤════════╪═══════╪═════════╪═══════╪══╪══╪══╪═══════╣                
 1│   intro│   4.14s│    210│   3.79Mi│   50.7│ 6│ 1│47│  90.00║                
 2│    xray│  15.54s│    217│  14.91Mi│   14.0│ 1│ 0│19│  65.65║                
 3│     box│   6.88s│    100│   9.88Mi│   14.5│ 3│ 0│50│  26.56║                
 4│  keller│  13.72s│    866│   3.45Mi│   63.1│ 7│ 2│ 7│ 356.05║                
 5│    view│  22.05s│    962│  54.07Mi│   43.6│ 3│ 1│37│ 102.58║                
══╧════════╧════════╪═══════╪═════════╪═══════╧══╧══╧══╧═══════╝                
              62.34s│   2355│  86.11Mi│                 
             runtime│ frames│output(B)│    FPS│%r│%a│%w│TheoFPS║                
══╤════════╤════════╪═══════╪═════════╪═══════╪══╪══╪══╪═══════╣                
 1│   intro│   3.96s│    321│   6.71Mi│   81.2│ 9│ 2│39│ 156.45║                
 2│    xray│  15.48s│    233│  15.48Mi│   15.1│ 1│ 0│16│  85.47║                
 3│     box│   6.57s│    100│   9.36Mi│   15.2│ 5│ 0│46│  29.05║                
 4│  keller│  13.71s│    878│   3.46Mi│   64.1│ 6│ 2│ 7│ 385.39║                
 5│    view│  21.21s│    963│  54.08Mi│   45.4│ 3│ 1│36│ 110.45║                
══╧════════╧════════╪═══════╪═════════╪═══════╧══╧══╧══╧═══════╝                
              60.92s│   2495│  89.09Mi│                       

current

             runtime│ frames│output(B)│    FPS│%r│%a│%w│TheoFPS║                
══╤════════╤════════╪═══════╪═════════╪═══════╪══╪══╪══╪═══════╣                
 1│   intro│   4.18s│    327│   7.87Mi│   78.3│ 8│ 2│42│ 145.51║                
 2│    xray│  15.52s│    207│  14.80Mi│   13.3│ 0│ 0│17│  69.76║                
 3│     box│   7.07s│    100│  12.06Mi│   14.1│ 6│ 0│48│  25.60║                
 4│  keller│  13.63s│    864│   3.43Mi│   63.4│ 7│ 2│ 5│ 407.31║                
 5│    view│  39.62s│    961│  73.55Mi│   24.3│ 1│ 0│18│ 114.95║                
══╧════════╧════════╪═══════╪═════════╪═══════╧══╧══╧══╧═══════╝                
              80.01s│   2459│ 111.70Mi│               
             runtime│ frames│output(B)│    FPS│%r│%a│%w│TheoFPS║                
══╤════════╤════════╪═══════╪═════════╪═══════╪══╪══╪══╪═══════╣                
 1│   intro│   4.13s│    317│   7.71Mi│   76.7│ 8│ 2│42│ 143.30║                
 2│    xray│  15.53s│    206│  15.01Mi│   13.3│ 0│ 0│18│  68.25║                
 3│     box│   6.63s│    100│  10.13Mi│   15.1│ 6│ 0│45│  28.70║                
 4│  keller│  13.62s│    864│   3.43Mi│   63.5│ 7│ 2│ 5│ 404.63║                
 5│    view│  39.33s│    963│  73.55Mi│   24.5│ 1│ 0│18│ 117.42║                
══╧════════╧════════╪═══════╪═════════╪═══════╧══╧══╧══╧═══════╝                
              79.24s│   2450│ 109.82Mi│                        

contour is kinda unstable right now, and i'm not smart enough to copy-and-paste from xwayland into chrome on x so we only have xterm numbers. but alright! the only one where we would expect to see big timing changes is view, where indeed we kick it in the balls for a healthy 50% reduction in runtime. we'd expect some total frame count changes in xray, where time is constant and we drop frames, and we indeed have about a 10% gain there.

alright, this is looking good, performance-wise! let's get some detailed info from quantsixel.

dankamongmen commented 2 years ago

(from the same runs as above)

current

Bitmap emits:elides: 1420:306 (17.73%) 66.70MiB (59.70%) SuM: 0 (0.00%) Bitmap emits:elides: 1376:299 (17.85%) 65.00MiB (59.18%) SuM: 0 (0.00%)

new

Bitmap emits:elides: 1444:304 (17.39%) 44.32MiB (49.74%) SuM: 0 (0.00%)

so it looks like we're emitting significantly smaller bitmaps, which surprises me. we'll want to look into that and determine why, i think. i mean, it's a good effect, but it needs to come from a righteous cause.

dankamongmen commented 2 years ago

alright, so we see a slight increase in accuracy on chunli23.png:

current

analyzing ../data/chunli23.png...
 source pixels: 99x46 rendered: 99x46 18216B
 control sequence: 3133 bytes (17.2%)
 00004554 pixels analyzed Δr 319236 Δg 2799 Δb 2877
 Colors: 17 vs 17
 4554px Δr 319236 (70.1) Δg 2799 (0.615) Δb 2877 (0.632)
 avg diff per pixel: 71.3
done with ../data/chunli23.png in 81.068ms.

new

analyzing ../data/chunli23.png...
 source pixels: 99x46 rendered: 99x46 18216B
 control sequence: 3116 bytes (17.11%)
 00004554 pixels analyzed Δr 317209 Δg 1271 Δb 1626
 Colors: 17 vs 17
 4554px Δr 317209 (69.7) Δg 1271 (0.279) Δb 1626 (0.357)
 avg diff per pixel: 70.3
done with ../data/chunli23.png in 81.758ms.

worldmap, as expected, looks worse (though it is slightly faster and smaller)

current

analyzing ../data/worldmap.png...
 source pixels: 475x860 rendered: 475x860 1634000B
 control sequence: 149102 bytes (9.125%)
 00408500 pixels analyzed Δr 773700 Δg 747897 Δb 840216
 Colors: 233 vs 54
 408500px Δr 773700 (1.89) Δg 747897 (1.83) Δb 840216 (2.06)
 avg diff per pixel: 5.78
done with ../data/worldmap.png in 500.359ms.

new

analyzing ../data/worldmap.png...
 source pixels: 475x860 rendered: 475x860 1634000B
 control sequence: 129291 bytes (7.913%)
 00408500 pixels analyzed Δr 1352655 Δg 1352043 Δb 1154968
 Colors: 233 vs 47
 408500px Δr 1352655 (3.31) Δg 1352043 (3.31) Δb 1154968 (2.83)
 avg diff per pixel: 9.45
done with ../data/worldmap.png in 439.945ms.

time to run ncplayer ../data/notcurses.png -q -t0

current

[schwarzgerat](0) $ time ./ncplayer ../data/notcurses.png -q -t0

real    0m0.998s
user    0m0.481s
sys     0m0.021s
[schwarzgerat](0) $ time ./ncplayer ../data/notcurses.png -q -t0

real    0m0.997s
user    0m0.497s
sys     0m0.008s
[schwarzgerat](0) $ time ./ncplayer ../data/notcurses.png -q -t0

real    0m1.017s
user    0m0.494s
sys     0m0.012s
[schwarzgerat](0) $ 

new

[schwarzgerat](0) $ time ./ncplayer ../data/notcurses.png -q -t0

real    0m0.834s
user    0m0.308s
sys     0m0.016s
[schwarzgerat](0) $ time ./ncplayer ../data/notcurses.png -q -t0

real    0m0.801s
user    0m0.311s
sys     0m0.013s
[schwarzgerat](0) $ time ./ncplayer ../data/notcurses.png -q -t0

real    0m0.805s
user    0m0.311s
sys     0m0.012s
[schwarzgerat](0) $ 

both current and new seem to generate a busted ncvisual on intake from sixel in notcurses.png using NCSCALE_STRETCH, generating:

analyzing ../data/notcurses.png...
 Image too large, scaling to display
 source pixels: 320x1280 rendered: 1380x880 4857600B
 control sequence: 805394 bytes (16.58%)
 00281600 pixels analyzed Δr 28101341 Δg 15579849 Δb 1586032error getting pixel at 320/0 (1380/880)
error getting pixel at 320/0 (1380/880)
error comparing sixels

ncplayer /media/chungus/images/allrgb/* -t0 -q:

current

real    0m29.891s
user    0m29.277s
sys     0m0.605s

new

real    0m37.361s
user    0m36.704s
sys     0m0.648s

so current beats new on these pathological images.

time to run quantanal on everything in ../data:

new

real    0m9.174s
user    0m8.779s
sys     0m0.392s
real    0m9.242s
user    0m8.815s
sys     0m0.423s

current

real    0m9.618s
user    0m9.134s
sys     0m0.477s
real    0m9.527s
user    0m9.171s
sys     0m0.353s
dankamongmen commented 2 years ago

ok, i'm thinking i'm gonna merge this, and work to improve the two cases that have regressed (smooth gradients result in more banding due to eliminating relaxation; large numbers of leaves lead to slowness and quality droppage). the causes of both problems are understood, and this is a much better foundation to work from going forward.

the first problem will be easier to solve -- we just need to add fast relaxation.

on the latter, we need better merging, but that's not going to give us any perf back, and we already have a perf deficit. we'll have to science up that bitch.

dankamongmen commented 2 years ago

i still don't understand why we're seeing such a high average diff on chunli23.png. we ought be able to match those 17 colors exactly, right? how many colors are truly in the image? identify -format %k says 17. in that case, if we have 17 colors (check this), they ought be pretty much exact, no?

dankamongmen commented 2 years ago
0xff001030 -> 784
0xff001040 -> 1040
0xff082050 -> 1312
0xff183070 -> 1840
0xff184870 -> 1864
0xff284890 -> 2377
0xff4868b0 -> 2922
0xff7088d8 -> 3467
0xff883800 -> 60
0xff905000 -> 84
0xffa0c0f0 -> 4037
0xffb0e0f0 -> 4069
0xffb86830 -> 877
0xffd09060 -> 1686
0xffe8b898 -> 2495
0xfff0f0f0 -> 4087
dankamongmen commented 2 years ago

0;2;44;28;9#1;2;56;28;16#2;2;44;19;9#3;2;69;41;28#4;2;85;53;44#5;2;94;75;63