chafa sixel-encodes AllRGB images substantially faster than notcurses

dankamongmen commented 2 years ago

In the course of stunting on https://github.com/hpjansson/chafa/issues/27, I discovered that chafa 1.8.0 renders AllRGB/balloon.jpg in about half the time of ncplayer on my 3970X. Representative timings:

[schwarzgerat](0) $ time chafa /media/chungus/images/allrgb/balloon.png  -f sixel > /dev/null

real    0m0.420s
user    0m0.744s
sys     0m0.057s
[schwarzgerat](0) $

vs

[schwarzgerat](0) $ time ~/src/dankamongmen/notcurses/build-3.0.5/ncplayer /media/chungus/images/allrgb/balloon.png  -sscale -k > /dev/null

real    0m0.774s
user    0m0.720s
sys     0m0.053s
[schwarzgerat](0) $

in addition, the chafa output is slightly better -- look at the purple sections, and their borders with the blue sections to their right. we have some weird green pixels in there, unlike chafa.

ncplayer:

2022-01-23-150607_883x891_scrot

chafa:

2022-01-23-150616_880x922_scrot

note that this is on amd64, where chafa AFAIK uses hand-coded SIMD. i'm not sure if that's true on other architectures, but who cares; i tolerate defeat on no platform. well, i suppose there's only one thing to do now.

THERE CAN BE ONLY ONE

dankamongmen commented 2 years ago

this is a case ripe for threading, which chafa makes use of. i bet tossing some in could catch us up pretty quickly. we'll then want to revalidate on small, simple images to ensure we haven't added much overhead.

dankamongmen commented 2 years ago

as for the image quality, chafa's using mincut, which is working on a 3-dimensional metric. if we expanded to three dimensions of metric, we'd eliminate this annoying tendency to prefer a close green match. that ought close the quality gap.

dankamongmen commented 2 years ago

also, gcc12 introduced autovectorization at O2, from which we might benefit. if we don't, it's probably time to take a look at SIMDizing this shit ourselves via intrinsics. it's always good to have competition! even better to then blow said competition out of the water.

dankamongmen commented 2 years ago

i'm thinking we ought be able to do a fixed-size three-dimensional pseudometric and eliminate the green preference in constant time. if the difference is beyond first order, it's barely going to show up anyway. there's no need to search the entire space.

right now we have constant time with a single dimension by checking high and lo over the channel sum. expanding that into three dimensions would mean eight bounds instead of two, but all of them are in cache, and you could likely SIMDize the actual comparisons thus eliminating any branches. hell, we could probably do that and improve our performance as it exists simply by reducing two comparisons to a single SIMD instruction and killing the branch. hot damn, pepper your angus!

dankamongmen commented 2 years ago

looking at a single-instance run over the folder, we get:

real    0m5.382s
user    0m7.287s
sys     0m0.939s

for chafa and

real    0m8.572s
user    0m8.030s
sys     0m0.423s

for notcurses

dankamongmen commented 2 years ago

so yeah i think a good first step would be...let's see, we get 1024 RGBA pixels to a page. that's a 32x32 chunk or any equivalent rectangle. we'd want contiguous sections in memory to go to the same thread, so it just takes 1024 (or some multiple thereof, N). next thread takes the next N. you can work out the starting pixel in O(1) from your pixel offset (we'll need adapt this a bit for linestrides). threads grab chunks until it's done.

we could spin up some worker threads ahead of time maybe, upon detecting sixel support, so they're ready by the time we actually get a sixel encode request. they can be applied to any sixel encoding, even multiple concurrent ones, so long as job info is present in the work queue. there are three phases for the purposes of a concurrency graph:

loading qstates (parallel with random contention)
refinement/merging (serialized)
building up data table (embarrassingly parallel)

so it would be good to see how much of our time each phase is occupying. if a majority of it is refinement/merging, threads probably aren't going to get the job done (Amdahl's Law and all that). if it's 1 and 3, especially 3, they'll help very much.

i'm inclined to think 8192 or 16384 are good chunk sizes (2048 or 4096 pixels). they don't trounce the entire L1 no matter what uarch properties it has (so long as it's 32KB+). we don't want an abundance of threads--the more threads, the more contention in phase 1.

dankamongmen commented 2 years ago

then, within a thread, it would be a fine thing to vectorize at least qnode_keys(). i think that would be the most fruitful place to start applying SIMD, anyway.

dankamongmen commented 2 years ago

shifting to -O3 brought negligible gains, so i don't think autovectorization is doing much for us. looks like we'll have to get in thar ourselves.

dankamongmen commented 2 years ago

hrmmm, that's not going to be the best way to do it given how extract_color_table() currently works:

for each 6y
- for each x
- for each of 6 lines
  - do a bunch of crap with the TAM
  - call insert_color()

so we'd need parallelize the TAM crap as well. which i think we can do -- there are no carried dependencies. but we'd need arrange the workers differently; we'd want each one to do a cell at a time, or some multiple thereof.

so if we wanted to do that, i think the thing to do would be to factor out an extract_cell_color_table() which took the current args plus a cell's y/x coordinate within the image. this will update the TAM for the cell (or a NxM rectangle of cells with its origin at the specified cell) and insert all colors encountered (yes, we're sounding dangeously close to mosaics, aren't we?).

but overall this is good -- it means more of the total computation ought be parallelizable. remember, we have to do a lot more stuff than chafa, what with those other planes and such.

alright, let's get to it, you've got your marching orders!

dankamongmen commented 2 years ago

so the first question -- can we eliminate the sixel-based loop in extract_color_table()? sixels ought not have any relevance here, and it would simplify things tremendously.

dankamongmen commented 2 years ago

i've restructured extract_color_table() to be cell-based. it's working almost perfectly; we have a weird bit of regression in the [intro] demo, but only following the first move (i.e. the orca is first drawn correctly, but on subsequent draws to the left, she has some cruft above her dorsal fin).

while doing this, i found what i thought to be a bug (lastrow is being set based on it being the last row of a sixel, not a cell), and fixed it -- was i misunderstanding something, and that's the cause of the regression? let's get this ironed out before moving on.

dankamongmen commented 2 years ago

oh ok lastrow doesn't refer to the last row of a sixel, but the last sixel of the image, duh. and firstpix doesn't refer to the first pixel of the image, but of each cell.

dankamongmen commented 2 years ago

oh, motherfucker, i bet i had the firstpix=false following a bail on transparency...yeah that's almost certainly it, motherfucker motherfucker

dankamongmen commented 2 years ago

yep, works perfectly now, w00t w00t

dankamongmen commented 2 years ago

alright, we now have an extract_color_table() which works on a per-cell basis. in addition to being suitable for parallelism, it's significantly more readable, with a full level less of looping.

dankamongmen commented 2 years ago

hrmmm. initial profiling looks...not promising. the majority of our time seems to be spent in sixel_blit_inner() especially write_sixel_payload() (at least for the AllRGB images). which makes sense; that's handling 409640961024, an immense 2^34 operations. ack! extract_color_table(), inlined, primarily shows up as untracked sixel_blit().

(actually, we spend more time in inflate_fast() than anything, presumably due to PNG decoding)

here's a profile for covid19.jpg and aidsrobots.jpg:

+   57.90%    56.29%  ncplayer  libnotcurses-core.so.3.0.5  [.] write_sixel_payload                         ◆
+   20.42%    14.92%  ncplayer  libnotcurses-core.so.3.0.5  [.] sixel_blit                                  ▒
+    6.09%     0.00%  ncplayer  [kernel.kallsyms]           [k] handle_mm_fault                             ▒
+    6.09%     0.00%  ncplayer  [kernel.kallsyms]           [k] __handle_mm_fault                           ▒
+    5.78%     0.08%  ncplayer  [kernel.kallsyms]           [k] asm_exc_page_fault                          ▒
+    5.70%     0.00%  ncplayer  [kernel.kallsyms]           [k] exc_page_fault                              ▒
+    5.70%     0.00%  ncplayer  [kernel.kallsyms]           [k] do_user_addr_fault                          ▒
+    5.40%     5.30%  ncplayer  [kernel.kallsyms]           [k] clear_page_rep                              ▒
+    5.30%     0.00%  ncplayer  [kernel.kallsyms]           [k] do_huge_pmd_anonymous_page                  ▒
+    4.77%     0.00%  ncplayer  [kernel.kallsyms]           [k] __alloc_pages                               ▒
+    4.77%     0.10%  ncplayer  [kernel.kallsyms]           [k] get_page_from_freelist                      ▒
+    4.73%     0.00%  ncplayer  [kernel.kallsyms]           [k] alloc_pages_vma                             ▒
+    3.42%     0.00%  ncplayer  [kernel.kallsyms]           [k] entry_SYSCALL_64_after_hwframe              ▒
+    3.42%     0.00%  ncplayer  [kernel.kallsyms]           [k] do_syscall_64                               ▒
     2.27%     2.27%  ncplayer  libc-2.33.so                [.] __vfprintf_internal

and AllRGB:

+   37.88%    37.30%  ncplayer  libz.so.1.2.11              [.] inflate_fast
+   22.40%    21.88%  ncplayer  libnotcurses-core.so.3.0.5  [.] write_sixel_payload
+    8.87%     6.60%  ncplayer  libnotcurses-core.so.3.0.5  [.] sixel_blit
+    4.06%     0.06%  ncplayer  [kernel.kallsyms]           [k] asm_exc_page_fault
+    4.04%     0.01%  ncplayer  [kernel.kallsyms]           [k] handle_mm_fault
+    4.01%     0.01%  ncplayer  [kernel.kallsyms]           [k] __handle_mm_fault
+    3.99%     0.01%  ncplayer  [kernel.kallsyms]           [k] exc_page_fault
+    3.99%     0.00%  ncplayer  [kernel.kallsyms]           [k] do_user_addr_fault
+    3.81%     0.00%  ncplayer  [unknown]                   [.] 0000000000000000
+    3.71%     0.00%  ncplayer  [kernel.kallsyms]           [k] do_huge_pmd_anonymous_page
+    3.70%     3.66%  ncplayer  [kernel.kallsyms]           [k] clear_page_rep
+    3.04%     0.01%  ncplayer  [kernel.kallsyms]           [k] __alloc_pages
+    3.03%     0.04%  ncplayer  [kernel.kallsyms]           [k] get_page_from_freelist
+    3.01%     0.00%  ncplayer  [kernel.kallsyms]           [k] alloc_pages_vma
+    2.71%     2.65%  ncplayer  libz.so.1.2.11              [.] adler32_z
+    1.83%     0.00%  ncplayer  [kernel.kallsyms]           [k] entry_SYSCALL_64_after_hwframe
+    1.83%     0.00%  ncplayer  [kernel.kallsyms]           [k] do_syscall_64
     1.50%     1.49%  ncplayer  libz.so.1.2.11              [.] inflate
+    1.27%     0.82%  ncplayer  libc-2.33.so                [.] __memmove_avx_unaligned_erms
+    1.19%     0.00%  ncplayer  [kernel.kallsyms]           [k] __irq_exit_rcu
+    1.19%     0.01%  ncplayer  [kernel.kallsyms]           [k] __softirqentry_text_start
+    1.08%     0.00%  ncplayer  libpthread-2.33.so          [.] __libc_write

dankamongmen commented 2 years ago

so for the former case, infinite ideal parallelism in extract_color_table() can eliminate only 20% of runtime, and in the latter case, it's an even smaller ~9%.

well, good thing we ran that. so the big wins are going to come from parallelizing write_sixel_payload(), where we could knock out ~60% of runtime in the former case and ~22% in the latter.

so, what is write_sixel_payload()? we iterate over

six-row sections (a row of sixels) (horizontal width in pixels)
for each color
until we've handled all six-row sections

the data is stored as vectors for each color, with each vector equal in size to the total number of sixels. cache performance ought be pretty reasonable, then -- we're reading left to right for the duration of a sixel line (one byte per sixel, so 64 pixels == cacheline). worst case is horizontal width congruent to 1 modulo 64, in which case we're doing a cacheline fill of 1.5% efficiency, per color, per sixel row. when width is congruent to 0 modulo 64, cacheline fills are 100% efficient.

we could raise that efficiency by storing by sixel row, and then by color within said sixel row. something to consider.

normally we would expect to make significant use of sixel RLE (not applicable to e.g. the AllRGB images, though). if we admitted dynamic lengths, we could pre-RLE this data. this helps us not at all in the worst case, though.

in the case where we have many colors (and thus the slowest case), it's unlikely that any given color will be in a sixel row. can we not preprocess these away? one bit per color per sixel row, set in the previous step, would allow us to skip the full width for that color. this might be a very nice, very easy pickup. try it. you could place these in their own bitvector for cache friendliness.

so my first thought would be parallelizing on sixel rows. a chunk is then width * colors data, each of size 1B. for 1024 colors, you hit an x86 page at 4 pixels wide (geez, no matter we're going through so many pages -- this is a huge structure!). so a thread has plenty of work. each writes to their own internal buffer, and then you just lay those into the fbuf. ought be perfect parallelism.

alright. between these last two ideas, i think we can get somewhere. beyond that, i think we need to just shrink this elephantine structure down. too big! 819KB per sixel row for 800 pixels across @ 1024 colors -- too mas!

dankamongmen commented 2 years ago

so if we wanted to add the 1-bit "color absent" signal, that's best done in build_data_table(), no?

dankamongmen commented 2 years ago

so you need (colors sixel rows) bits total for this map, right? yeah. initialize it to 0. whenever you load stab->map->data, set `map[sixelrow colors + color` to 1. a 1 here means we must inspect the sixel row for the color.

so you know we've really built a three-pass encoder, not a two-pass. can we not integrate these two pieces of shit?

dankamongmen commented 2 years ago

ok the actionmap proposed above is looking good initially, about a 20% pickup on AllRGB, need flush out a bug though so who knows maybe it actually makes us 30000 times slower

dankamongmen commented 2 years ago

we could raise that efficiency by storing by sixel row, and then by color within said sixel row. something to consider.

this is probably an easy pickup as well, and it would imply 100% cachefill efficiency for all but the last color in a given sixelrow. almost certainly a win. let's get down and do the do.

dankamongmen commented 2 years ago

ok the actionmap proposed above is looking good initially, about a 20% pickup on AllRGB, need flush out a bug though so who knows maybe it actually makes us 30000 times slower

fixed bug. gainz went away but we've got some bullshit going on, along the lines of say 7194 writes to the actionmap for the first sixel row...no, that's right, we're 1199px wide * 6 == 7194. oh, duh, i had enlarged my temrinal. ok.

before, allrgb: 14.192, 14.164, 14.231 after, allrgb: 12.697, 12.708, 12.312

37.717 / 42.587 == 88.56%, so about an 11% pickup for allrgb.

before, aids+covid: 0.789, 0.798, 0.778 after, aids+covid: 0.712, 0.672, 0.683

2.067 / 2.365 == 87.40%, so a similar win there.

ok! well it's a start! we got some hats now, motherfuckers! we shall continue.

dankamongmen commented 2 years ago

ok actiontable is committed. let's pump up the cache efficiency for write_sixel_payload() by striping by sixelrows rather than colors. i've got a good feeling there.

dankamongmen commented 2 years ago

reducing the actionmap from a byte per to a bit per had negligible impact. let's go ahead and do it, just to save the memory, but no perf change results.

dankamongmen commented 2 years ago

you know, looking at this, a lot of our delay seems to be coming not from drawing or computing, but scrolling. i suspect if we scrolled prior to drawing (see #2361) we could cut a big chunk (approaching real - user - sys), as that's just XTerm playing with itself

dankamongmen commented 2 years ago

ok, imma take a walk and go buy cheap Publix newports. for when i get back, here's a rollup gameplan:

[x] ~improve cache performance of write_sixel_payload() via changing map layout~
- [ ] no wins. what's eating all the time, then, in write_sixel_payload()?
[ ] parallelize write_sixel_payload()
[ ] scroll before drawing
[x] try to shrink map, or use memory hinting to prefault most of it in; maybe use explicit hugepages here

some work of noble note may yet be done, not unbecoming men who strove with Gods!

dankamongmen commented 2 years ago

it occurs to me that ncls ought be significantly faster than ncplayer for a bunch of images like this. and indeed, indeed it is. it handles the AllRGB directory in about 2s. not quite relevant, but it did make me happy to see.

dankamongmen commented 2 years ago

hrmmm not much of a pickup from changing the map layout -- even in a case chosen for it (width % 64 was large, terminal was large), any improvement was lost in noise.

dankamongmen commented 2 years ago

equations for layout change:

-  size_t dsize = sizeof(*stab->map->data) * colors * stab->map->sixelcount;
+  size_t dsize = sizeof(*stab->map->data) * colors * lenx * ((leny + 5) / 6);

and

+      long mapidx = sixelrow * map->colors * lenx + i * lenx;

in place of i * map->sixelcount + m.

oh and

-        stab->map->data[cidx * stab->map->sixelcount + pos] |= (1u << (sy - visy));
+        const long mapidx = sixelrow * colors * lenx + cidx * lenx + (visx - begx);
+//fprintf(stderr, "MAPIDX WRITE: %ld\n", mapidx);
+        stab->map->data[mapidx] |= (1u << (sy - visy));

dankamongmen commented 2 years ago

-  size_t dsize = sizeof(*stab->map->data) * colors * stab->map->sixelcount;
+  size_t dsize = sizeof(*stab->map->data) * colors * lenx * ((leny + 5) / 6);

hrmmm, oughtn't these be the same? wasn't that the whole point -- it's all just a layout change?

dankamongmen commented 2 years ago

sizeof(stab->map->data) colors * stab->map->sixelcount;

yeah, killing this diff leaves everything unchanged

dankamongmen commented 2 years ago

oh hrmm our profile shape has actually changed a good bit from earlier work. here's allrgb/balloon.png:

+   42.52%    42.52%  ncplayer  libz.so.1.2.11              [.] inflate_fast                             ◆
+    9.96%     7.52%  ncplayer  libnotcurses-core.so.3.0.5  [.] sixel_blit                               ▒
     9.82%     9.82%  ncplayer  libnotcurses-core.so.3.0.5  [.] write_sixel_payload                      ▒
+    5.61%     0.27%  ncplayer  [kernel.kallsyms]           [k] asm_exc_page_fault                       ▒
+    5.32%     0.00%  ncplayer  [kernel.kallsyms]           [k] handle_mm_fault                          ▒
+    5.31%     0.00%  ncplayer  [kernel.kallsyms]           [k] exc_page_fault                           ▒
+    5.31%     0.03%  ncplayer  [kernel.kallsyms]           [k] do_user_addr_fault                       ▒
+    5.30%     0.01%  ncplayer  [kernel.kallsyms]           [k] __handle_mm_fault                        ▒
+    4.71%     4.68%  ncplayer  [kernel.kallsyms]           [k] clear_page_rep                           ▒
+    4.61%     0.00%  ncplayer  [unknown]                   [.] 0000000000000000                         ▒
+    4.17%     0.00%  ncplayer  [kernel.kallsyms]           [k] do_huge_pmd_anonymous_page               ▒
+    4.03%     0.03%  ncplayer  [kernel.kallsyms]           [k] __alloc_pages                            ▒
+    4.00%     0.17%  ncplayer  [kernel.kallsyms]           [k] get_page_from_freelist                   ▒
+    3.96%     0.00%  ncplayer  [kernel.kallsyms]           [k] alloc_pages_vma                          ▒
+    2.74%     2.74%  ncplayer  libz.so.1.2.11              [.] adler32_z                                ▒
+    2.42%     0.00%  ncplayer  [kernel.kallsyms]           [k] entry_SYSCALL_64_after_hwframe           ▒
+    2.42%     0.00%  ncplayer  [kernel.kallsyms]           [k] do_syscall_64                            ▒
+    2.13%     0.63%  ncplayer  libc-2.33.so                [.] __memmove_avx_unaligned_erms             ▒
+    2.11%     0.00%  ncplayer  [unknown]                   [.] 0x000055f02ff91770                       ▒
     2.07%     2.07%  ncplayer  libz.so.1.2.11              [.] inflate                                  ▒
+    1.73%     1.73%  ncplayer  libz.so.1.2.11              [.] inflate_table                            ▒
+    1.41%     0.00%  ncplayer  [unknown]                   [.] 0x000055f02ff916d8                       ▒
+    1.29%     0.00%  ncplayer  libpthread-2.33.so          [.] __libc_write                             ▒
+    1.29%     0.00%  ncplayer  [kernel.kallsyms]           [k] ksys_write                               ▒
+    1.29%     0.00%  ncplayer  [kernel.kallsyms]           [k] vfs_write                                ▒
+    1.29%     0.00%  ncplayer  [kernel.kallsyms]           [k] new_sync_write                           ▒
+    1.29%     0.00%  ncplayer  [kernel.kallsyms]           [k] file_tty_write.constprop.0               ▒
+    1.26%     0.93%  ncplayer  [kernel.kallsyms]           [k] n_tty_write                              ▒
     1.22%     1.22%  ncplayer  libswscale.so.6.5.100       [.] 0x0000000000020918                       ▒
+    1.22%     0.00%  ncplayer  libswscale.so.6.5.100       [.] 0x00007f7111308918                       ▒
+    1.11%     0.03%  ncplayer  [kernel.kallsyms]           [k] clear_huge_page                          ▒
+    1.02%     0.09%  ncplayer  libc-2.33.so                [.] __memset_avx2_unaligned_erms

dankamongmen commented 2 years ago

so we're now burning a decent amount of time, relatively, in sixel_blit(), which means extract_color_table() i'm pretty sure...yeah, puilling it out of the inline yields:

+   41.91%    41.91%  ncplayer  libz.so.1.2.11              [.] inflate_fast
     9.48%     9.48%  ncplayer  libnotcurses-core.so.3.0.5  [.] write_sixel_payload
+    7.55%     7.52%  ncplayer  libnotcurses-core.so.3.0.5  [.] extract_color_table
+    5.77%     0.21%  ncplayer  [kernel.kallsyms]           [k] asm_exc_page_fault
+    5.60%     0.00%  ncplayer  [kernel.kallsyms]           [k] handle_mm_fault
+    5.53%     0.00%  ncplayer  [kernel.kallsyms]           [k] exc_page_fault
+    5.53%     0.00%  ncplayer  [kernel.kallsyms]           [k] do_user_addr_fault
+    5.49%     0.06%  ncplayer  [kernel.kallsyms]           [k] __handle_mm_fault
+    5.18%     0.00%  ncplayer  [unknown]                   [k] 0000000000000000

dankamongmen commented 2 years ago

so extract_color_table() isn't as heavy as the entirety of write_sixel_payload(), but it's substantial. we'll likely end up parallelizing both of them.

dankamongmen commented 2 years ago

as for what's taking all the time in write_sixel_payload():

Percent│     ↓ jge    22d
       │       mov    0x4(%rsp),%ebx
       │       mov    0x8(%r14),%rsi
       │     unsigned char crle = 0; // character being repeated
       │       xor    %r8d,%r8d
       │     int seenrle = 0; // number of repetitions
       │       xor    %ecx,%ecx
       │     if(write_rle(&printed, i, f, seenrle, crle, &needclosure)){
       │       lea    0x18(%rsp),%r12
       │     ↓ jmp    e8
       │       nop
       │     if(map->data[i * map->sixelcount + m] == crle){
       │ d0:   cmp    %al,%r8b
  0.64 │     ↓ jne    1b0
       │     ++seenrle;
       │       add    $0x1,%ecx
       │     for(int m = p ; m < map->sixelcount && m < p + lenx ; ++m){
  2.57 │ dc:   add    $0x1,%ebx
  0.64 │       cmp    %edx,%ebx
  1.28 │     ↓ jge    10a
       │ e3:   cmp    %r15d,%ebx
  0.64 │     ↓ je     10a
       │     if(map->data[i * map->sixelcount + m] == crle){
       │ e8:   mov    %edx,%eax
       │       imul   %r13d,%eax
 29.58 │       add    %ebx,%eax
       │       cltq
  0.96 │       movzbl (%rsi,%rax,1),%eax
       │     if(seenrle){
 52.73 │       test   %ecx,%ecx
       │     ↑ jne    d0
       │     for(int m = p ; m < map->sixelcount && m < p + lenx ; ++m){
       │       add    $0x1,%ebx
       │     if(map->data[i * map->sixelcount + m] == crle){
       │       movzbl %al,%r8d
       │     seenrle = 1;
       │       mov    $0x1,%ecx
       │     for(int m = p ; m < map->sixelcount && m < p + lenx ; ++m){
       │       cmp    %edx,%ebx
       │     ↑ jl     e3
       │     if(crle){
       │10a:   test   %r8b,%r8b
  0.64 │     ↓ jne    1f0
       │     needclosure = needclosure | printed;
  0.32 │       mov    0x1c(%rsp),%eax
       │       mov    (%r14),%edi
       │     for(int i = 0 ; i < map->colors ; ++i){
       │11a:   add    $0x1,%r13d
       │     needclosure = needclosure | printed;
       │       or     %eax,0x18(%rsp)
       │     for(int i = 0 ; i < map->colors ; ++i){
       │       cmp    %r13d,%edi
       │     ↑ jg     75
       │12b:   mov    %r14,%rax
       │       mov    %r15d,%r14d

dankamongmen commented 2 years ago

over a set of smaller images from data, the profile isn't too surprising:

    27.45%    27.39%  ncplayer  libnotcurses-core.so.3.0.5  [.] write_sixel_payload                      ◆
+   19.59%    19.54%  ncplayer  libnotcurses-core.so.3.0.5  [.] extract_color_table                      ▒
+   17.61%     2.23%  ncplayer  libc-2.33.so                [.] __memset_avx2_unaligned_erms             ▒
+   15.87%     0.01%  ncplayer  [kernel.kallsyms]           [k] handle_mm_fault                          ▒
+   15.86%     0.04%  ncplayer  [kernel.kallsyms]           [k] __handle_mm_fault                        ▒
+   15.47%     0.05%  ncplayer  [kernel.kallsyms]           [k] asm_exc_page_fault                       ▒
+   15.41%     0.00%  ncplayer  [kernel.kallsyms]           [k] exc_page_fault                           ▒
+   15.41%     0.00%  ncplayer  [kernel.kallsyms]           [k] do_user_addr_fault                       ▒
+   15.13%     0.01%  ncplayer  [kernel.kallsyms]           [k] do_huge_pmd_anonymous_page               ▒
+   14.96%    14.88%  ncplayer  [kernel.kallsyms]           [k] clear_page_rep                           ▒
+   11.98%     0.04%  ncplayer  [kernel.kallsyms]           [k] __alloc_pages                            ▒
+   11.93%     0.06%  ncplayer  [kernel.kallsyms]           [k] get_page_from_freelist                   ▒
+   11.86%     0.01%  ncplayer  [kernel.kallsyms]           [k] alloc_pages_vma                          ▒
+    3.61%     0.09%  ncplayer  [kernel.kallsyms]           [k] clear_huge_page                          ▒
+    3.36%     0.00%  ncplayer  [kernel.kallsyms]           [k] entry_SYSCALL_64_after_hwframe           ▒
+    3.36%     0.00%  ncplayer  [kernel.kallsyms]           [k] do_syscall_64                            ▒
+    2.10%     0.00%  ncplayer  libswscale.so.6.5.100       [.] 0x00007f40dd051918                       ▒
     2.09%     2.09%  ncplayer  libswscale.so.6.5.100       [.] 0x0000000000020918                       ▒
     2.00%     2.00%  ncplayer  libc-2.33.so                [.] __vfprintf_internal                      ▒
+    1.96%     0.01%  ncplayer  libpthread-2.33.so          [.] __libc_write                             ▒
+    1.95%     0.00%  ncplayer  [kernel.kallsyms]           [k] ksys_write                               ▒
+    1.94%     0.00%  ncplayer  [kernel.kallsyms]           [k] vfs_write                                ▒
+    1.94%     0.00%  ncplayer  [kernel.kallsyms]           [k] new_sync_write                           ▒
+    1.94%     0.01%  ncplayer  [kernel.kallsyms]           [k] file_tty_write.constprop.0               ▒
+    1.89%     1.54%  ncplayer  [kernel.kallsyms]           [k] n_tty_write

goddamn that's a lot of soft page faults

dankamongmen commented 2 years ago

here's an interesting result: when i don't stretch the images to occupy the full terminal (generally yielding far smaller images), write_sixel_payload() drops far, far below extract_color_table(), which begins to rudely dominate. makes sense. so takeaways:

improving extract_color_table() ought improve all cases
write_sixel_payload() is strongly coupled to pixel count

+   25.80%    25.80%  ncplayer  libnotcurses-core.so.3.0.5  [.] extract_color_table                      ◆
+   21.11%     0.00%  ncplayer  [kernel.kallsyms]           [k] entry_SYSCALL_64_after_hwframe           ▒
+   21.11%     0.00%  ncplayer  [kernel.kallsyms]           [k] do_syscall_64                            ▒
+   13.99%    13.99%  ncplayer  libz.so.1.2.11              [.] inflate_fast                             ▒
+   13.41%     0.00%  ncplayer  [kernel.kallsyms]           [k] vm_mmap_pgoff                            ▒
+   13.15%     0.00%  ncplayer  libc-2.33.so                [.] __mmap                                   ▒
+   12.87%     0.00%  ncplayer  [kernel.kallsyms]           [k] __mm_populate                            ▒
+   12.87%     0.00%  ncplayer  [kernel.kallsyms]           [k] populate_vma_page_range                  ▒
+   12.87%     0.44%  ncplayer  [kernel.kallsyms]           [k] __get_user_pages                         ▒
+   12.54%     0.00%  ncplayer  [kernel.kallsyms]           [k] handle_mm_fault                          ▒
+   12.40%     1.36%  ncplayer  [kernel.kallsyms]           [k] __handle_mm_fault                        ▒
+    9.67%     0.00%  ncplayer  [unknown]                   [k] 0000000000000000                         ▒
     7.39%     7.23%  ncplayer  libnotcurses-core.so.3.0.5  [.] write_sixel_payload                      ▒
+    7.10%     0.99%  ncplayer  [kernel.kallsyms]           [k] alloc_pages_vma                          ▒
+    6.24%     0.49%  ncplayer  [kernel.kallsyms]           [k] __alloc_pages                            ▒
+    5.44%     0.35%  ncplayer  [kernel.kallsyms]           [k] get_page_from_freelist                   ▒
+    4.39%     4.39%  ncplayer  [kernel.kallsyms]           [k] clear_page_rep                           ▒
+    4.04%     3.92%  ncplayer  libnotcurses-core.so.3.0.5  [.] paint

btw welcome back paint(); for the first time, we're seeing the non-bitmap elements of the render core on profiles.

dankamongmen commented 2 years ago

yeah let's go back to the plan of parallelizing extract_color_table() first.

dankamongmen commented 2 years ago

alright, i've got extract_color_table() ready to go for multiple threads. i've gone ahead and inserted all necessary locking, blocking, and signaling. it all added negligible time in the singly-threaded case. let's see what happens when we let 32 cores get at it...

dankamongmen commented 2 years ago

alright, threaded it out. results are poor so far, but this is strictly PoC with very coarse locking. we're running immediately into contention walls, as expected. the firmament is now there, though. i'm going to sleep for a bit. we'll develop upon this theme tomorrow. well, today, but later.

dankamongmen commented 2 years ago

alright. i've now got sixel's extract_color_table() arbitrarily data-parallelized, and it works with any number of threads, so far as i can tell. the performance curves i'm getting are disheartening; 2 threads are slower than 1, and in general N+1 are slower than N. we've almost certainly got gross contention (this seems self-evident, though i ought go ahead and prove it with profiling).

i mean, we've got a single mutex in the qstate, and we hold it across the length of insert_color() and while picking a new workchunk. workchunk selection was originally lock-free, but that wasn't behaving as expected -- i need to look at that and figure out what the problem was (we had an atomic long chunkstaken on which this turned). we'll need either more granular locking in insert_color(), or a lockfree solution (maybe possible, maybe not).

so that's disheartening. but! this whole effort has massively cleaned up and improved this code, and i think we'll be able to get somewhere with the parallelism quickly now.

dankamongmen commented 2 years ago

also, about those soft page faults...

ALLOCING 40
ALLOCING 24288
ALLOCING 8192
ALLOCING 4800
ALLOCING 10560
ALLOCING 178 112 000 <-----------
ALLOCING 2640
ALLOCING 25392

we're suffering a 178M allocation on a 111K image (this is coupled to the scaled size). 178M is 46K pages at 4K per page. grotesque. still, i'm surprised that this shows up so prominently in our profiles....

dankamongmen commented 2 years ago

mutrace: Showing 4 most contended mutexes:

 Mutex #   Locked  Changed    Cont. tot.Time[ms] avg.Time[ms] max.Time[ms]  Flags
       9  2743279   232173   231205      259.141        0.000        0.014 Mx.--.
      10       83        2        0        0.006        0.000        0.001 M-.--.
       5        7        2        0        0.002        0.000        0.001 Mx.--.
       1        3        2        0        0.011        0.004        0.005 M-.--.
                                                                           ||||||
                                                                           /|||||
          Object:                                     M = Mutex, W = RWLock /||||
           State:                                 x = dead, ! = inconsistent /|||
             Use:                                 R = used in realtime thread /||
      Mutex Type:                 r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
  Mutex Protocol:                                      i = INHERIT, p = PROTECT /
     RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC

yep, that's 259ms of contention (btw, mutrace is fantastic shit!)

dankamongmen commented 2 years ago

we're suffering a 178M allocation on a 111K image (this is coupled to the scaled size). 178M is 46K pages at 4K per page. grotesque. still, i'm surprised that this shows up so prominently in our profiles....

you know, quad-channel 3970x@3000 ought have about 90GB/s of memory bandwidth, and 178M would therefore expect to take about 2ms to move, which is quite plausible here (assuming this ever actually goes out to memory, which at least a significant portion of it surely does). yeah, we really want to shrink that allocation. let's think of a good sparse solution. we're already using the action bitmap -- what if we only allocated for entries in the action map? that could cut things down by two orders of magnitude easy.

dankamongmen commented 2 years ago

so with regards to more granular locking on insert_color(), i don't think we can do the front end locklessly. we need a lock per N elements, where N=1 for the most granular locking. the backend, though (any path that pulls down a new qnode), can probably be done losslessly, which is good since that needs be shared across the object.

soooooooo let's start with, hell, a lock per static qnode, and see where that takes us.

dankamongmen commented 2 years ago

ok, i've now got active parallelism, but barely. still massive contention on the _free counters. we've got to make those lockless, or we're pretty much done here.

dankamongmen commented 2 years ago

yeah we're churning locks hard:

  50.41%  ncplayer  [kernel.kallsyms]           [k] native_queued_spin_lock_slow
  11.39%  ncplayer  libz.so.1.2.11              [.] inflate_fast
   6.35%  ncplayer  libpthread-2.33.so          [.] __pthread_mutex_lock
   2.60%  ncplayer  [kernel.kallsyms]           [k] delay_halt_mwaitx
   2.02%  ncplayer  [kernel.kallsyms]           [k] futex_wake
   1.94%  ncplayer  libpthread-2.33.so          [.] __pthread_mutex_unlock_userc
   1.93%  ncplayer  libnotcurses-core.so.3.0.5  [.] write_sixel_payload
   1.79%  ncplayer  libnotcurses-core.so.3.0.5  [.] qstate_work

dankamongmen commented 2 years ago

alright, i want to switch away from that for a minute, and consider sparse data structures for our core data table. it really shouldn't be much if at all larger than the actual image source; we're currently three orders of magnitude larger. it must be intensely sparse.

dankamongmen commented 2 years ago

so as a first step, simply converting to an array of linked lists (array has one list per sixel row) ought significantly cut down the total size. the only difficulty would be finding a colors' row while building it up. if we really need o(1) lookup there, just gather them at the end of a sixelrow's processing. we only need an additional word of state per sixelrow. in the best case we cut space to 1/colors + a word per sixelrow; in the worst case we add the word per sixelrow and cut nothing (this would be a pathological image).

dankamongmen commented 2 years ago

shitfire! rather than doing all that, let's just do a sixelrow at a time. at 1024 pixels across, that's an even meg for 1024 colors. hell, that'll likely all fit in LLC and never need be written out to memory at all. we just need to interleave build_data_table() with write_sixel_payload(), which will be a slight mess, but well worth it to never hit RAM! yessssssssssssssssssssssss.

dankamongmen / notcurses

chafa sixel-encodes AllRGB images substantially faster than notcurses #2573