better integrate pixel graphics with render/raster pipeline

dankamongmen commented 3 years ago

With the completion of #1378, our Sixel encoder seems to be working properly. It's now time to better integrate pixel graphics with the render and raster pipelines, so it's useful in Notcurses as something other than a novelty.

[x] we've absolutely got to get input quantized down to 256 colors, or do it ourselves, as we reject most media right now due to this constraint
[x] we need quantize down to a palette of 256 colors (xterm only supports 16 by default)
[x] always do a hard cup after emitting pixel graphics -- i think we're already doing this for rendered mode
[x] we need to be setting the width field correctly (we currently always set it to 1)...though i'm not sure how useful this is,
[x] what's up with scaling? see below
[ ] see if we can use the DEC macro functionality to only write these things once (DECDMAC and DEVINVM)
[ ] rigourize corner cases -- what happens if pixel is last output on frame? first? with both pixel and otherwise coming next/before?
[X] (updated: moved this to #1388) see what can be done about transparency, in both directions (pixel on top, pixel below)
[x] we absolutely must have some unit tests on this
[x] speed up the sixel encoder, which is aggressively suboptimal

with regards to scaling, currently if i run

./ncplayer -b pixel ../data/worldmap.png -k -q

in a 78x80 xterm, i get no output. if i run instead

./ncplayer -b pixel ../data/worldmap.png -k -q -s none

i get about two empty pages, with the image in the middle, with part of the right chopped off.

stretch, none, et. al. ought work for pixel like they do anything else, and this excess vertical space has got to go. i think the latter comes from creation of a n/6-row plane for n pixels going into the sixel.

dankamongmen commented 3 years ago

we know we can print over a sixel graphic, but how we do that in the current rendering system is not at all obvious, since the whole thing is based around bringing down a single egc. hrmmm. that's gonna be...tough. erp.

for instance, right now you must supply -q if not using -k with ncplayer -b pixel, or else the frame count results in it not being printed at all, due to not being selected in paint(). hrmmmm hrmmm hrmmm hrmmmm.

how the hell do you erase one of these things?

dankamongmen commented 3 years ago

i think a good acceptance criteria for this bug would be, say, ncneofetch can use the NCBLIT_PIXEL support when available.

dankamongmen commented 3 years ago

for quantizing, unless we can find code to do it in oiio/ffmpeg, we could just try building a table, and if we overflow, make a nonlinear or even linear one, and restart the scan with closest match. that succccccccccccccks, but it would work for now.

dankamongmen commented 3 years ago

resolved the extreme vertical space issue by restricting ncvisual_options->rows to 1 for NCBLIT_PIXEL in ncdirect, yay!

dankamongmen commented 3 years ago

I've implemented a simple, stupid quantization algorithm that at least lets us load things. We can now properly display everything I throw at it, including video =] =] =].

dankamongmen commented 3 years ago

(make no mistake--we still want a better quantizer than this baby solution)

dankamongmen commented 3 years ago

https://user-images.githubusercontent.com/143473/109427008-5f9ffe00-79be-11eb-94f2-298463ba09ad.mp4

slow as shit, and the quantizer sucks, but we're way past where we were last night =]

dankamongmen commented 3 years ago

here's the current quantization vs adding back half the mask

2021-02-28-124710_733x942_scrot

dankamongmen commented 3 years ago

top is run through gimp reduction to 255 colors with normal floyd-steinberg dithering, allowing a maximum mask of 0xff. bottom is our current solution.

2021-02-28-140855_909x1184_scrot

dankamongmen commented 3 years ago

scaling now works in rendered mode

dankamongmen commented 3 years ago

scaling now works in direct mode

dankamongmen commented 3 years ago

ncneofetch

2021-03-02-074408_1044x1415_scrot

dankamongmen commented 3 years ago

I think we're at a point where it's worth committing to some optimization. If we speed up what we have, it's pretty much ready for release.

Color table: right now we've got up to 8 runs using progressively looser masks. furthermore, we have a linear insertion sort to build the table.
- replace linear search with binary search, since we're sorted -- trivial
- replace moves of array half with bidirectional index-linked (as opposed to raw pointer, for cache reasons) -- obnoxious
- rather than rerunning on table fill, keep the table we have, but go through and merge entries on the looser mask. this allows us to continue from where we overflowed the table. it would involve a lot of moves to hew out the colliding entries...
- perhaps better would be a scheme where we remove a single item on overflow (possibly the new candidate item). currently we tear it down, loosen the mask, and start back over. eventually we reach some mask that allows us to fit the colors in the table. so whenever we overflow, find the maximum mask that matches both the candidate and an existing item. if that maximum mask is less than any other maximum mask, we can replace the latter with the former. yessssssssss. and this allows us to always keep the maximum number of colors once we achieve it, rather than potentially dropping with the looser mask. yes!!!
Data table: right now we do a pass over all content for each color. note that the scheme above might exacerbate this due to maintaining the maximum number of colors
- build the data table while we build the color table, easy enough, pretty obvious

I bet these three steps will substantially speed things up. A further Tip: try to loosen on the actual squarespace result rather than the rgb source, for a more even distribution.

dankamongmen commented 3 years ago

I've thus far been able to get a ~70% reduction by overlapping color and data table extraction, not bad at all. It's currently got a bug, and I'm going to go to sleep rather than try to force this tonight. Tomorrow i expect to pick this up, figure out what's going on along the top here, and reap the big perf get.

2021-03-03-011208_1124x1055_scrot

i'd like to get down to about 5% max of the original running time. this is a good start.

dnkl commented 3 years ago

Hi, I'm the author of foot, a sixel capable Wayland terminal emulator. I'm really looking forward to this!

Is there anything I can do to help? Testing things work in foot is obviously a given; but is there anything in particular I should be running? Anything I should look out for, focus more on etc?

I haven't (yet) used notcurses in any applications, but really liking the tech :)

(If you are interested in running notcurses in foot, I'd recommend running the git master branch)

dankamongmen commented 3 years ago

Hi, I'm the author of foot, a sixel capable Wayland terminal emulator. I'm really looking forward to this!

awesome, i will add it to my terminal matrix!

Is there anything I can do to help? Testing things work in foot is obviously a given; but is there anything in particular I should be running? Anything I should look out for, focus more on etc?

don't even look at the sixel stuff just yet; it's going to be a great deal faster in a few hours =]. ummm you can run different images through it, but rendering a sixel image is trivial; what's difficult is integrating it with a dynamic TUI, which involes writing a bunch of TUI code =]. umm just bang on it! if you'd like to test both your terminal and notcurses, write a script around xdotool that resizes a bunch. or submit code =]. but yeah, any hammering you can do is appreciated!

one thing i was thinking earlier today was that it might be interesting to build a shell around notcurses. i don't see any real applications within a terminal emulator itself for it, though.

but just commenting alone made me feel good, and me feeling good means code gets written =D. so thanks!

(If you are interested in running notcurses in foot, I'd recommend running the git master branch)

i will definitely check it out, if i get my wayland configuration working on my nvidia box again! which i ought do.

dankamongmen commented 3 years ago

w00t w00t.

before: real 0m1.377s real 0m1.372s real 0m1.373s real 0m1.369s real 0m1.380s

after: real 0m0.311s real 0m0.340s real 0m0.339s real 0m0.343s real 0m0.346s

that's about a 75% cut by uniting the first two passes

dankamongmen commented 3 years ago

moved to binary search in find_colors():

before: real 0m0.320s real 0m0.327s real 0m0.309s real 0m0.315s real 0m0.345s

after: real 0m0.216s real 0m0.194s real 0m0.183s real 0m0.210s real 0m0.185s

that's about a 40% further decrease, or about 85% down from the original =].

dankamongmen commented 3 years ago

with this second iteration of optimization, write_sixel_data() is now taking more time than extract_ctable(), so time to work on that for a bit, methinks.

dnkl commented 3 years ago

I did do a couple of quick tests before your optimization work, and the difference compared to then is already very noticable..

I've looked through the sixel code (very quickly, mind you), and just a couple of comments:

I see you use private mode 80 (cursor positioning), which I think is appropriate here. Have you looked into private mode 1070? It switches between shared and private color tables. The default is private, where each sixel (i.e. everything between the \EPq and the terminating \E\\) gets its own set of color registers.
Saw your note about 8-bit control sequences (regarding \EPq); very few terminals support 8-bit control sequences (CC1) as they can cause decoding problems in UTF-8 mode, depending on how the terminal implements decoding + VT parsing. Using the 7-bit equivalents (CC0) is encouraged.

Also seeing this (the black area before, and under the mech), with ./ncplayer -b pixel ~/src/notcurses/data/warmech.bmp -k -q -s none (in foot) warmech-black

This is slightly different from XTerm, though there are some weird glitches there too.

I haven't looked into why this is happening. Do you know why the mech ends up all the way to the right?

WSLUser commented 3 years ago

I'd like to add that the Windows console and Windows Terminal (which uses the open source version of the console as the backend for VT) support both the C1 and C0 controls. However discussions there from experts suggests that 7 bits is actually the norm. The comments in this WIndows Terminal PR (and the relevant references including the issues associated and the xterm docs) should help understand more about the support for C0 and C1 controls: https://github.com/microsoft/terminal/pull/7340 (Note, you could probably still support 8 bits if you wanted and do something similar as was done in that PR to default to 7 bits first and then use 8 bits if the 7 bit support isn't there for some reason ex. you're using some legacy DEC terminal, emulated or otherwise, but still want to run notcurses with sixel support on it)

dankamongmen commented 3 years ago

I did do a couple of quick tests before your optimization work, and the difference compared to then is already very noticable..

I've looked through the sixel code (very quickly, mind you), and just a couple of comments:

I see you use private mode 80 (cursor positioning), which I think is appropriate here. Have you looked into private mode 1070? It switches between shared and private color tables. The default is private, where each sixel (i.e. everything between the \EPq and the terminating \E\\) gets its own set of color registers.

Saw your note about 8-bit control sequences (regarding \EPq); very few terminals support 8-bit control sequences (CC1) as they can cause decoding problems in UTF-8 mode, depending on how the terminal implements decoding + VT parsing. Using the 7-bit equivalents (CC0) is encouraged.

Also seeing this (the black area before, and under the mech), with ./ncplayer -b pixel ~/src/notcurses/data/warmech.bmp -k -q -s none (in foot)

This is slightly different from XTerm, though there are some weird glitches there too.

I haven't looked into why this is happening. Do you know why the mech ends up all the way to the right?

great feedback; i will respond at length later. for now, the mech being on the far right is intended behavior with -k. if you use some other blitter (you can just leave off -b) you ought see the same behavior, ala:

2021-03-04-152945_1124x1055_scrot

dnkl commented 3 years ago

I see you use private mode 80 (cursor positioning), which I think is appropriate here

Sorry, was mixing up private mode 80 and 8452... 80 is for scrolling.

You might want to take a look at private mode 8452 too; when enabled, the cursor is left to the right, on the same line as the sixel, instead of on the next line. This is assuming mode 80 is enabled (the default, by the way).

or now, the mech being on the far right is intended behavior with -k

Ah, ok. I assume the black area isn't intended? XTerm looked better, but still not "clean". Should I spend some time looking at the escapes your emitting and see if there's something that needs to be changed?

dankamongmen commented 3 years ago

optimization notes:

i spent a few hours this evening looking at sixel_blit_inner(): the actual output is only about 1/4 of the total time of this function, but it can likely be completely eliminated iff we can cut out the libc *printf()ery entirely. i replaced the memstream with a huge buffer for experimentation, and wrote directly into it except for %d. the remaining stdio just for the %ds still took about the entirety of that 1/4. take a look at Alexandrescu's uint64_to_ascii() from a few years back (we can't just reuse the method of term_esc_rgb()--direct index into a pregenerated table of stringified integers--because we might have arbitrarily much RLE). i suspect that anything done here will end up without proportional returns.
just about all of our remaining time in ctable construction is being spent in memcmp(). i would have thought memcmp() would have been inlined away given the known comparison size (3 bytes), but i guess not? quite possible we could handroll the comparison (hell, if we align it, it's two 32-bit values (the unused byte would be set to 0 to avoid a masking requirement)) to force the issue. if we're seriously taking a function call for those comparisons...yuck. probably ought align each ctable entry at 8 bytes: zero byte, red, green, blue, four bytes for dtable index (keeping this fixed at the current 2 means we can't support more than 256 color registers). then do a four-byte aligned sub for compare, requiring at most a single mask of the other 4 bytes. this might be a win. memcmp() is roughly 20% of our cycles; memmove() barely shows up; the extra moveweight may well be worth aligned comparisons...
i fucked around the other night trying to do a merge-in-place scheme as outlined in an earlier comment. couldn't get a win off of it. timing analysis did make clear that ctable attempts failing due to too strict a mask tend to fail quickly, relative to the final success. typically on e.g. covid19.jpg all failed attempts added up would be on the order of the successful attempt (mask of 0xe0 iirc). so on one hand, cut them to 0 time (by e.g. starting with a loose mask), and cut half the ctable construction time. on the other hand, all you cut is half the ctable construction time, and you lose color detail. ooooh what if we started with a high mask, locked in the used color range, and then backfilled the lower bits (and unused entries)? that's a very exciting idea!!

our sixel rendering remains pretty much flat-out unacceptable for rendering even double-digit FPS, which i consider personally embarrassing. here's our current lbr profile (./ncplayer -b pixel ../data/fm6.mkv -q, 80x37 xterm 1124x1055):

-   99.62%     0.00%  ncplayer  libnotcurses.so.2.2.2       [.] ffmpeg_stream
   + 35.79% ffmpeg_stream;ncvisual_render;ncvisual_render_pixels;ncvisual_blit;ffmpeg_blit;sixel_blit▒
   + 18.73% ffmpeg_stream;ncvisual_render;ncvisual_render_pixels;ncvisual_blit;ffmpeg_blit;sixel_blit;memcmp@plt
   + 10.33% ffmpeg_stream;ncvisual_render;ncvisual_render_pixels;ncvisual_blit;ffmpeg_blit;sixel_blit;sixel_blit_inner
   + 10.02% ffmpeg_stream;ncvisual_render;ncvisual_render_pixels;ncvisual_blit;ffmpeg_blit;sws_scale;0x7f644742337e;0x7f6447435da1;0x7f6447411890
   + 4.15% ffmpeg_stream;ncvisual_render;ncvisual_render_pixels;ncvisual_blit;ffmpeg_blit;sixel_blit;0x7f6449272098;0xffffffff9cc00ac0
   + 1.49% ffmpeg_stream;ncvisual_render;ncvisual_render_pixels;ncvisual_blit;ffmpeg_blit;sixel_blit;sixel_blit_inner;fprintf;__vfprintf_internal
   + 1.44% ffmpeg_stream;ncvisual_render;ncvisual_render_pixels;ncvisual_blit;ffmpeg_blit;sixel_blit;0x7f6449272098
   + 1.38% ffmpeg_stream;ncvisual_render;ncvisual_render_pixels;ncvisual_blit;ffmpeg_blit;sixel_blit;0x7f6449272098;0xffffffff9cc00ade;0xffffffff9ca6e251
   + 1.36% ffmpeg_stream;ncvisual_render;ncvisual_render_pixels;ncvisual_blit;ffmpeg_blit;sixel_blit;0x7f6449272098;0xffffffff9cc00ade;0xffffffff9ca6dcc7;0xffffffff9c07836f;
   + 0.74% ffmpeg_stream;ncvisual_render;ncvisual_render_pixels;ncvisual_blit;ffmpeg_blit;sws_getCachedContext;sws_init_context;0x7f64473e0fb5
   + 0.64% ffmpeg_stream;ncvisual_render;ncvisual_render_pixels;ncvisual_blit;ffmpeg_blit;sixel_blit;0x7f6449272098;0xffffffff9cc00ade;0xffffffff9ca6dcc7;0xffffffff9c07836f;
   + 0.62% ffmpeg_stream;ncvisual_render;ncvisual_render_pixels;ncvisual_blit;ffmpeg_blit;sixel_blit;cfree@GLIBC_2.2.5;munmap_chunk;0xffffffff9cc0008c;0xffffffff9ca6a1e3;0xf
   + 0.62% ffmpeg_stream;ncvisual_render;ncvisual_render_pixels;ncvisual_blit;ffmpeg_blit;sixel_blit;0x7f6449272098;0xffffffff9cc01382;0xffffffff9ca6b031
   + 0.52% ffmpeg_stream;ncvisual_render;ncvisual_render_pixels;ncvisual_blit;ffmpeg_blit;sixel_blit;sixel_blit_inner;write_rle.isra.0
   + 0.51% ffmpeg_stream;ncvisual_render;ncvisual_render_pixels;ncvisual_blit;ffmpeg_blit;sixel_blit;sixel_blit_inner;fprintf;__vfprintf_internal;_IO_default_xsputn

dankamongmen commented 3 years ago

Sorry, was mixing up private mode 80 and 8452... 80 is for scrolling.

which i haven't even grokked the semantics of yet. i'm taking notes as i build this up at https://nick-black.com/dankwiki/index.php?title=Sixel, since there seems to be no authoritative study anywhere else.

You might want to take a look at private mode 8452 too; when enabled, the cursor is left to the right, on the same line as the sixel, instead of on the next line. This is assuming mode 80 is enabled (the default, by the way).

i think this is irrelevant to us, since we do a hard cursor positioning operation after any sprixel emission. though, we will most often admittedly want to be immediately to the (top) right of the sixel (not sure if you mean bottom or top right), so that could potentially save us a cup. with that said, cup is cheap (relative to e.g. RGB escapes), and i'm not terribly confident in pixel-to-cell conversions, and thus i might continue to cup unconditionally to ensure i'm always where i want to be.

but yes, this seems a saner mode of operation. if it can save me some cups, i'll use it, but i suspect that i'll continue cupping up, in which case this would just be dead byte baggage in the already grotesque sprixel control sequence.

Ah, ok. I assume the black area isn't intended? XTerm looked better, but still not "clean". Should I spend some time looking at the escapes your emitting and see if there's something that needs to be changed?

i'm not sure precisely what you mean here -- i probably need reread your earlier comment. my head is full of optimization thoughts at the moment (also dayjob thoughts, not to mention standard dankhaze), and i was not yet going to open up that can of worms =]. with that said, you clearly know what you're doing, and i fully encourage you to study my code and output, and suggest or even implement fixes, and appreciate all help and wisdom you might provide =].

dnkl commented 3 years ago

It's not so much about saving some cursor escapes, but more to avoid unwanted terminal content scrolling when the sixel is at the bottom of the screen.

With ?80h (which you're already doing, and should be doing), terminal content scrolls when the sixel continues past the screen edge. I.e. sixels behave like any other terminal content. With 8452, the cursor is left just outside the sixel, in its bottom right corner. Think of sixels as being printed on a line printer.

Without 8452, it's as if you hit enter after printing the sixel - the cursor ends up in column 0 on the next line, potentially scrolling all output.

Btw, nice performance work! Keeping my fingers crossed and hoping for something that outperforms libsixel... :)

dnkl commented 3 years ago

Also seeing this (the black area before, and under the mech),

Turns out the black area is caused by ncplayer setting the background color to (0,0,0) (i.e. black) during its initialization(?)

When the prompt is at the bottom of the screen, emitting the sixel requires scrolling the terminal. Scrolled in lines are erased with the current background color (assuming the terminal has the bce terminfo capability) - hence the black area.

XTerm is doing this as well, but for some reason, the color it's using when scrolling a sixel is different. I haven't figured out why yet.

This isn't really specific to the sixel/pixel blitter. But the effect is probably hidden by the fact that other blitters then change, and most importantly, reset, the background color while blitting.

dnkl commented 3 years ago

https://user-images.githubusercontent.com/14031526/110212005-98961200-7e99-11eb-9612-d14fc4f707af.mp4

I tried playing a video file (with ncplayer). Mostly just for kicks. It's pretty slow (see attached screen recording).

One thing I noticed was that foot uses much more CPU when decoding the ncplayer output, compared to mpv's --vo=sixel output. With the same window size, foot uses ~10-20% CPU when decoding mpv's output, and ~60-70% when decoding ncplayer's.

A high level comparison of the encoded sixel data shows that both mpv and ncplayer configures roughly the same number of color registers for each frame, but ncplayer emits longer sixel sequences. Looking at the DCS strings sent to the terminal, ncplayer's output isn't that much longer that mpv's. But, after RLE expansion (i.e. the ! encoded sixel characters), the number of sixel characters emitted from ncplayer is often more than 10x compared to mpv (for same-sized frames).

dankamongmen commented 3 years ago

A high level comparison of the encoded sixel data shows that both mpv and ncplayer configures roughly the same number of color registers for each frame, but ncplayer emits longer sixel sequences. Looking at the DCS strings sent to the terminal, ncplayer's output isn't that much longer that mpv's. But, after RLE expansion (i.e. the ! encoded sixel characters), the number of sixel characters emitted from ncplayer is often more than 10x compared to mpv (for same-sized frames).

fascinating! ok let me run some mpv -vo sixel and do the differential analysis and get to the bottom of this. great find! if i'm emitting more sixels, am i not emitting a larger image?

dankamongmen commented 3 years ago

lol, mpv -vo sixel just assert()s out for me, hrmm:

[schwarzgerat](0) $ mpv -vo sixel ../data/fm6.mkv 
 (+) Video --vid=1 (*) (vp9 640x480 29.970fps)
File tags:
 Title: output
[autoconvert] Converting yuv420p -> rgb24
mpv: ../video/zimg.c:242: repack_entrypoint: Assertion `!(i & (mp_repack_get_align_y(r->repack) - 1))' failed.
Aborted (core dumped)
[schwarzgerat](134) $

dankamongmen commented 3 years ago

lol, mpv -vo sixel just assert()s out for me.

i was able to get it working on my Arch laptop. jeezum crow, yeah, this whips our ass. alright, i've broken out #1391 to address this. at one time, mpv -vo tct was better than ncplayer, too :D.

dankamongmen commented 3 years ago

I'm closing this issue up; I've expanded it into several other issues to address particular problems.

dankamongmen / notcurses

better integrate pixel graphics with render/raster pipeline #1380