Closed dankamongmen closed 2 years ago
this is a case ripe for threading, which chafa
makes use of. i bet tossing some in could catch us up pretty quickly. we'll then want to revalidate on small, simple images to ensure we haven't added much overhead.
also, gcc12 introduced autovectorization at O2, from which we might benefit. if we don't, it's probably time to take a look at SIMDizing this shit ourselves via intrinsics. it's always good to have competition! even better to then blow said competition out of the water.
i'm thinking we ought be able to do a fixed-size three-dimensional pseudometric and eliminate the green preference in constant time. if the difference is beyond first order, it's barely going to show up anyway. there's no need to search the entire space.
right now we have constant time with a single dimension by checking high and lo over the channel sum. expanding that into three dimensions would mean eight bounds instead of two, but all of them are in cache, and you could likely SIMDize the actual comparisons thus eliminating any branches. hell, we could probably do that and improve our performance as it exists simply by reducing two comparisons to a single SIMD instruction and killing the branch. hot damn, pepper your angus!
looking at a single-instance run over the folder, we get:
real 0m5.382s
user 0m7.287s
sys 0m0.939s
for chafa and
real 0m8.572s
user 0m8.030s
sys 0m0.423s
for notcurses
so yeah i think a good first step would be...let's see, we get 1024 RGBA pixels to a page. that's a 32x32 chunk or any equivalent rectangle. we'd want contiguous sections in memory to go to the same thread, so it just takes 1024 (or some multiple thereof, N). next thread takes the next N. you can work out the starting pixel in O(1) from your pixel offset (we'll need adapt this a bit for linestrides). threads grab chunks until it's done.
we could spin up some worker threads ahead of time maybe, upon detecting sixel support, so they're ready by the time we actually get a sixel encode request. they can be applied to any sixel encoding, even multiple concurrent ones, so long as job info is present in the work queue. there are three phases for the purposes of a concurrency graph:
so it would be good to see how much of our time each phase is occupying. if a majority of it is refinement/merging, threads probably aren't going to get the job done (Amdahl's Law and all that). if it's 1 and 3, especially 3, they'll help very much.
i'm inclined to think 8192 or 16384 are good chunk sizes (2048 or 4096 pixels). they don't trounce the entire L1 no matter what uarch properties it has (so long as it's 32KB+). we don't want an abundance of threads--the more threads, the more contention in phase 1.
then, within a thread, it would be a fine thing to vectorize at least qnode_keys()
. i think that would be the most fruitful place to start applying SIMD, anyway.
shifting to -O3
brought negligible gains, so i don't think autovectorization is doing much for us. looks like we'll have to get in thar ourselves.
hrmmm, that's not going to be the best way to do it given how extract_color_table()
currently works:
insert_color()
so we'd need parallelize the TAM crap as well. which i think we can do -- there are no carried dependencies. but we'd need arrange the workers differently; we'd want each one to do a cell at a time, or some multiple thereof.
so if we wanted to do that, i think the thing to do would be to factor out an extract_cell_color_table()
which took the current args plus a cell's y/x coordinate within the image. this will update the TAM for the cell (or a NxM rectangle of cells with its origin at the specified cell) and insert all colors encountered (yes, we're sounding dangeously close to mosaics, aren't we?).
but overall this is good -- it means more of the total computation ought be parallelizable. remember, we have to do a lot more stuff than chafa, what with those other planes and such.
alright, let's get to it, you've got your marching orders!
so the first question -- can we eliminate the sixel-based loop in extract_color_table()
? sixels ought not have any relevance here, and it would simplify things tremendously.
i've restructured extract_color_table()
to be cell-based. it's working almost perfectly; we have a weird bit of regression in the [intro]
demo, but only following the first move (i.e. the orca is first drawn correctly, but on subsequent draws to the left, she has some cruft above her dorsal fin).
while doing this, i found what i thought to be a bug (lastrow
is being set based on it being the last row of a sixel, not a cell), and fixed it -- was i misunderstanding something, and that's the cause of the regression? let's get this ironed out before moving on.
oh ok lastrow
doesn't refer to the last row of a sixel, but the last sixel of the image, duh. and firstpix
doesn't refer to the first pixel of the image, but of each cell.
oh, motherfucker, i bet i had the firstpix=false
following a bail on transparency...yeah that's almost certainly it, motherfucker motherfucker
yep, works perfectly now, w00t w00t
alright, we now have an extract_color_table()
which works on a per-cell basis. in addition to being suitable for parallelism, it's significantly more readable, with a full level less of looping.
hrmmm. initial profiling looks...not promising. the majority of our time seems to be spent in sixel_blit_inner()
especially write_sixel_payload()
(at least for the AllRGB images). which makes sense; that's handling 409640961024, an immense 2^34 operations. ack! extract_color_table()
, inlined, primarily shows up as untracked sixel_blit()
.
(actually, we spend more time in inflate_fast()
than anything, presumably due to PNG decoding)
here's a profile for covid19.jpg
and aidsrobots.jpg
:
+ 57.90% 56.29% ncplayer libnotcurses-core.so.3.0.5 [.] write_sixel_payload ◆
+ 20.42% 14.92% ncplayer libnotcurses-core.so.3.0.5 [.] sixel_blit ▒
+ 6.09% 0.00% ncplayer [kernel.kallsyms] [k] handle_mm_fault ▒
+ 6.09% 0.00% ncplayer [kernel.kallsyms] [k] __handle_mm_fault ▒
+ 5.78% 0.08% ncplayer [kernel.kallsyms] [k] asm_exc_page_fault ▒
+ 5.70% 0.00% ncplayer [kernel.kallsyms] [k] exc_page_fault ▒
+ 5.70% 0.00% ncplayer [kernel.kallsyms] [k] do_user_addr_fault ▒
+ 5.40% 5.30% ncplayer [kernel.kallsyms] [k] clear_page_rep ▒
+ 5.30% 0.00% ncplayer [kernel.kallsyms] [k] do_huge_pmd_anonymous_page ▒
+ 4.77% 0.00% ncplayer [kernel.kallsyms] [k] __alloc_pages ▒
+ 4.77% 0.10% ncplayer [kernel.kallsyms] [k] get_page_from_freelist ▒
+ 4.73% 0.00% ncplayer [kernel.kallsyms] [k] alloc_pages_vma ▒
+ 3.42% 0.00% ncplayer [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe ▒
+ 3.42% 0.00% ncplayer [kernel.kallsyms] [k] do_syscall_64 ▒
2.27% 2.27% ncplayer libc-2.33.so [.] __vfprintf_internal
and AllRGB:
+ 37.88% 37.30% ncplayer libz.so.1.2.11 [.] inflate_fast
+ 22.40% 21.88% ncplayer libnotcurses-core.so.3.0.5 [.] write_sixel_payload
+ 8.87% 6.60% ncplayer libnotcurses-core.so.3.0.5 [.] sixel_blit
+ 4.06% 0.06% ncplayer [kernel.kallsyms] [k] asm_exc_page_fault
+ 4.04% 0.01% ncplayer [kernel.kallsyms] [k] handle_mm_fault
+ 4.01% 0.01% ncplayer [kernel.kallsyms] [k] __handle_mm_fault
+ 3.99% 0.01% ncplayer [kernel.kallsyms] [k] exc_page_fault
+ 3.99% 0.00% ncplayer [kernel.kallsyms] [k] do_user_addr_fault
+ 3.81% 0.00% ncplayer [unknown] [.] 0000000000000000
+ 3.71% 0.00% ncplayer [kernel.kallsyms] [k] do_huge_pmd_anonymous_page
+ 3.70% 3.66% ncplayer [kernel.kallsyms] [k] clear_page_rep
+ 3.04% 0.01% ncplayer [kernel.kallsyms] [k] __alloc_pages
+ 3.03% 0.04% ncplayer [kernel.kallsyms] [k] get_page_from_freelist
+ 3.01% 0.00% ncplayer [kernel.kallsyms] [k] alloc_pages_vma
+ 2.71% 2.65% ncplayer libz.so.1.2.11 [.] adler32_z
+ 1.83% 0.00% ncplayer [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 1.83% 0.00% ncplayer [kernel.kallsyms] [k] do_syscall_64
1.50% 1.49% ncplayer libz.so.1.2.11 [.] inflate
+ 1.27% 0.82% ncplayer libc-2.33.so [.] __memmove_avx_unaligned_erms
+ 1.19% 0.00% ncplayer [kernel.kallsyms] [k] __irq_exit_rcu
+ 1.19% 0.01% ncplayer [kernel.kallsyms] [k] __softirqentry_text_start
+ 1.08% 0.00% ncplayer libpthread-2.33.so [.] __libc_write
so for the former case, infinite ideal parallelism in extract_color_table()
can eliminate only 20% of runtime, and in the latter case, it's an even smaller ~9%.
well, good thing we ran that. so the big wins are going to come from parallelizing write_sixel_payload()
, where we could knock out ~60% of runtime in the former case and ~22% in the latter.
so, what is write_sixel_payload()
? we iterate over
the data is stored as vectors for each color, with each vector equal in size to the total number of sixels. cache performance ought be pretty reasonable, then -- we're reading left to right for the duration of a sixel line (one byte per sixel, so 64 pixels == cacheline). worst case is horizontal width congruent to 1 modulo 64, in which case we're doing a cacheline fill of 1.5% efficiency, per color, per sixel row. when width is congruent to 0 modulo 64, cacheline fills are 100% efficient.
we could raise that efficiency by storing by sixel row, and then by color within said sixel row. something to consider.
normally we would expect to make significant use of sixel RLE (not applicable to e.g. the AllRGB images, though). if we admitted dynamic lengths, we could pre-RLE this data. this helps us not at all in the worst case, though.
in the case where we have many colors (and thus the slowest case), it's unlikely that any given color will be in a sixel row. can we not preprocess these away? one bit per color per sixel row, set in the previous step, would allow us to skip the full width for that color. this might be a very nice, very easy pickup. try it. you could place these in their own bitvector for cache friendliness.
so my first thought would be parallelizing on sixel rows. a chunk is then width * colors data, each of size 1B. for 1024 colors, you hit an x86 page at 4 pixels wide (geez, no matter we're going through so many pages -- this is a huge structure!). so a thread has plenty of work. each writes to their own internal buffer, and then you just lay those into the fbuf. ought be perfect parallelism.
alright. between these last two ideas, i think we can get somewhere. beyond that, i think we need to just shrink this elephantine structure down. too big! 819KB per sixel row for 800 pixels across @ 1024 colors -- too mas!
so if we wanted to add the 1-bit "color absent" signal, that's best done in build_data_table()
, no?
so you need (colors sixel rows) bits total for this map, right? yeah. initialize it to 0. whenever you load stab->map->data
, set `map[sixelrow colors + color` to 1. a 1 here means we must inspect the sixel row for the color.
so you know we've really built a three-pass encoder, not a two-pass. can we not integrate these two pieces of shit?
ok the actionmap proposed above is looking good initially, about a 20% pickup on AllRGB, need flush out a bug though so who knows maybe it actually makes us 30000 times slower
we could raise that efficiency by storing by sixel row, and then by color within said sixel row. something to consider.
this is probably an easy pickup as well, and it would imply 100% cachefill efficiency for all but the last color in a given sixelrow. almost certainly a win. let's get down and do the do.
ok the actionmap proposed above is looking good initially, about a 20% pickup on AllRGB, need flush out a bug though so who knows maybe it actually makes us 30000 times slower
fixed bug. gainz went away but we've got some bullshit going on, along the lines of say 7194 writes to the actionmap for the first sixel row...no, that's right, we're 1199px wide * 6 == 7194. oh, duh, i had enlarged my temrinal. ok.
before, allrgb: 14.192, 14.164, 14.231 after, allrgb: 12.697, 12.708, 12.312
37.717 / 42.587 == 88.56%, so about an 11% pickup for allrgb.
before, aids+covid: 0.789, 0.798, 0.778 after, aids+covid: 0.712, 0.672, 0.683
2.067 / 2.365 == 87.40%, so a similar win there.
ok! well it's a start! we got some hats now, motherfuckers! we shall continue.
ok actiontable is committed. let's pump up the cache efficiency for write_sixel_payload()
by striping by sixelrows rather than colors. i've got a good feeling there.
reducing the actionmap from a byte per to a bit per had negligible impact. let's go ahead and do it, just to save the memory, but no perf change results.
you know, looking at this, a lot of our delay seems to be coming not from drawing or computing, but scrolling. i suspect if we scrolled prior to drawing (see #2361) we could cut a big chunk (approaching real
- user
- sys
), as that's just XTerm playing with itself
ok, imma take a walk and go buy cheap Publix newports. for when i get back, here's a rollup gameplan:
write_sixel_payload()
via changing map layout~
write_sixel_payload()
?write_sixel_payload()
some work of noble note may yet be done, not unbecoming men who strove with Gods!
it occurs to me that ncls
ought be significantly faster than ncplayer
for a bunch of images like this. and indeed, indeed it is. it handles the AllRGB directory in about 2s. not quite relevant, but it did make me happy to see.
hrmmm not much of a pickup from changing the map layout -- even in a case chosen for it (width % 64 was large, terminal was large), any improvement was lost in noise.
equations for layout change:
- size_t dsize = sizeof(*stab->map->data) * colors * stab->map->sixelcount;
+ size_t dsize = sizeof(*stab->map->data) * colors * lenx * ((leny + 5) / 6);
and
+ long mapidx = sixelrow * map->colors * lenx + i * lenx;
in place of i * map->sixelcount + m
.
oh and
- stab->map->data[cidx * stab->map->sixelcount + pos] |= (1u << (sy - visy));
+ const long mapidx = sixelrow * colors * lenx + cidx * lenx + (visx - begx);
+//fprintf(stderr, "MAPIDX WRITE: %ld\n", mapidx);
+ stab->map->data[mapidx] |= (1u << (sy - visy));
- size_t dsize = sizeof(*stab->map->data) * colors * stab->map->sixelcount; + size_t dsize = sizeof(*stab->map->data) * colors * lenx * ((leny + 5) / 6);
hrmmm, oughtn't these be the same? wasn't that the whole point -- it's all just a layout change?
sizeof(stab->map->data) colors * stab->map->sixelcount;
yeah, killing this diff leaves everything unchanged
oh hrmm our profile shape has actually changed a good bit from earlier work. here's allrgb/balloon.png
:
+ 42.52% 42.52% ncplayer libz.so.1.2.11 [.] inflate_fast ◆
+ 9.96% 7.52% ncplayer libnotcurses-core.so.3.0.5 [.] sixel_blit ▒
9.82% 9.82% ncplayer libnotcurses-core.so.3.0.5 [.] write_sixel_payload ▒
+ 5.61% 0.27% ncplayer [kernel.kallsyms] [k] asm_exc_page_fault ▒
+ 5.32% 0.00% ncplayer [kernel.kallsyms] [k] handle_mm_fault ▒
+ 5.31% 0.00% ncplayer [kernel.kallsyms] [k] exc_page_fault ▒
+ 5.31% 0.03% ncplayer [kernel.kallsyms] [k] do_user_addr_fault ▒
+ 5.30% 0.01% ncplayer [kernel.kallsyms] [k] __handle_mm_fault ▒
+ 4.71% 4.68% ncplayer [kernel.kallsyms] [k] clear_page_rep ▒
+ 4.61% 0.00% ncplayer [unknown] [.] 0000000000000000 ▒
+ 4.17% 0.00% ncplayer [kernel.kallsyms] [k] do_huge_pmd_anonymous_page ▒
+ 4.03% 0.03% ncplayer [kernel.kallsyms] [k] __alloc_pages ▒
+ 4.00% 0.17% ncplayer [kernel.kallsyms] [k] get_page_from_freelist ▒
+ 3.96% 0.00% ncplayer [kernel.kallsyms] [k] alloc_pages_vma ▒
+ 2.74% 2.74% ncplayer libz.so.1.2.11 [.] adler32_z ▒
+ 2.42% 0.00% ncplayer [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe ▒
+ 2.42% 0.00% ncplayer [kernel.kallsyms] [k] do_syscall_64 ▒
+ 2.13% 0.63% ncplayer libc-2.33.so [.] __memmove_avx_unaligned_erms ▒
+ 2.11% 0.00% ncplayer [unknown] [.] 0x000055f02ff91770 ▒
2.07% 2.07% ncplayer libz.so.1.2.11 [.] inflate ▒
+ 1.73% 1.73% ncplayer libz.so.1.2.11 [.] inflate_table ▒
+ 1.41% 0.00% ncplayer [unknown] [.] 0x000055f02ff916d8 ▒
+ 1.29% 0.00% ncplayer libpthread-2.33.so [.] __libc_write ▒
+ 1.29% 0.00% ncplayer [kernel.kallsyms] [k] ksys_write ▒
+ 1.29% 0.00% ncplayer [kernel.kallsyms] [k] vfs_write ▒
+ 1.29% 0.00% ncplayer [kernel.kallsyms] [k] new_sync_write ▒
+ 1.29% 0.00% ncplayer [kernel.kallsyms] [k] file_tty_write.constprop.0 ▒
+ 1.26% 0.93% ncplayer [kernel.kallsyms] [k] n_tty_write ▒
1.22% 1.22% ncplayer libswscale.so.6.5.100 [.] 0x0000000000020918 ▒
+ 1.22% 0.00% ncplayer libswscale.so.6.5.100 [.] 0x00007f7111308918 ▒
+ 1.11% 0.03% ncplayer [kernel.kallsyms] [k] clear_huge_page ▒
+ 1.02% 0.09% ncplayer libc-2.33.so [.] __memset_avx2_unaligned_erms
so we're now burning a decent amount of time, relatively, in sixel_blit()
, which means extract_color_table()
i'm pretty sure...yeah, puilling it out of the inline
yields:
+ 41.91% 41.91% ncplayer libz.so.1.2.11 [.] inflate_fast
9.48% 9.48% ncplayer libnotcurses-core.so.3.0.5 [.] write_sixel_payload
+ 7.55% 7.52% ncplayer libnotcurses-core.so.3.0.5 [.] extract_color_table
+ 5.77% 0.21% ncplayer [kernel.kallsyms] [k] asm_exc_page_fault
+ 5.60% 0.00% ncplayer [kernel.kallsyms] [k] handle_mm_fault
+ 5.53% 0.00% ncplayer [kernel.kallsyms] [k] exc_page_fault
+ 5.53% 0.00% ncplayer [kernel.kallsyms] [k] do_user_addr_fault
+ 5.49% 0.06% ncplayer [kernel.kallsyms] [k] __handle_mm_fault
+ 5.18% 0.00% ncplayer [unknown] [k] 0000000000000000
so extract_color_table()
isn't as heavy as the entirety of write_sixel_payload()
, but it's substantial. we'll likely end up parallelizing both of them.
as for what's taking all the time in write_sixel_payload()
:
Percent│ ↓ jge 22d
│ mov 0x4(%rsp),%ebx
│ mov 0x8(%r14),%rsi
│ unsigned char crle = 0; // character being repeated
│ xor %r8d,%r8d
│ int seenrle = 0; // number of repetitions
│ xor %ecx,%ecx
│ if(write_rle(&printed, i, f, seenrle, crle, &needclosure)){
│ lea 0x18(%rsp),%r12
│ ↓ jmp e8
│ nop
│ if(map->data[i * map->sixelcount + m] == crle){
│ d0: cmp %al,%r8b
0.64 │ ↓ jne 1b0
│ ++seenrle;
│ add $0x1,%ecx
│ for(int m = p ; m < map->sixelcount && m < p + lenx ; ++m){
2.57 │ dc: add $0x1,%ebx
0.64 │ cmp %edx,%ebx
1.28 │ ↓ jge 10a
│ e3: cmp %r15d,%ebx
0.64 │ ↓ je 10a
│ if(map->data[i * map->sixelcount + m] == crle){
│ e8: mov %edx,%eax
│ imul %r13d,%eax
29.58 │ add %ebx,%eax
│ cltq
0.96 │ movzbl (%rsi,%rax,1),%eax
│ if(seenrle){
52.73 │ test %ecx,%ecx
│ ↑ jne d0
│ for(int m = p ; m < map->sixelcount && m < p + lenx ; ++m){
│ add $0x1,%ebx
│ if(map->data[i * map->sixelcount + m] == crle){
│ movzbl %al,%r8d
│ seenrle = 1;
│ mov $0x1,%ecx
│ for(int m = p ; m < map->sixelcount && m < p + lenx ; ++m){
│ cmp %edx,%ebx
│ ↑ jl e3
│ if(crle){
│10a: test %r8b,%r8b
0.64 │ ↓ jne 1f0
│ needclosure = needclosure | printed;
0.32 │ mov 0x1c(%rsp),%eax
│ mov (%r14),%edi
│ for(int i = 0 ; i < map->colors ; ++i){
│11a: add $0x1,%r13d
│ needclosure = needclosure | printed;
│ or %eax,0x18(%rsp)
│ for(int i = 0 ; i < map->colors ; ++i){
│ cmp %r13d,%edi
│ ↑ jg 75
│12b: mov %r14,%rax
│ mov %r15d,%r14d
over a set of smaller images from data
, the profile isn't too surprising:
27.45% 27.39% ncplayer libnotcurses-core.so.3.0.5 [.] write_sixel_payload ◆
+ 19.59% 19.54% ncplayer libnotcurses-core.so.3.0.5 [.] extract_color_table ▒
+ 17.61% 2.23% ncplayer libc-2.33.so [.] __memset_avx2_unaligned_erms ▒
+ 15.87% 0.01% ncplayer [kernel.kallsyms] [k] handle_mm_fault ▒
+ 15.86% 0.04% ncplayer [kernel.kallsyms] [k] __handle_mm_fault ▒
+ 15.47% 0.05% ncplayer [kernel.kallsyms] [k] asm_exc_page_fault ▒
+ 15.41% 0.00% ncplayer [kernel.kallsyms] [k] exc_page_fault ▒
+ 15.41% 0.00% ncplayer [kernel.kallsyms] [k] do_user_addr_fault ▒
+ 15.13% 0.01% ncplayer [kernel.kallsyms] [k] do_huge_pmd_anonymous_page ▒
+ 14.96% 14.88% ncplayer [kernel.kallsyms] [k] clear_page_rep ▒
+ 11.98% 0.04% ncplayer [kernel.kallsyms] [k] __alloc_pages ▒
+ 11.93% 0.06% ncplayer [kernel.kallsyms] [k] get_page_from_freelist ▒
+ 11.86% 0.01% ncplayer [kernel.kallsyms] [k] alloc_pages_vma ▒
+ 3.61% 0.09% ncplayer [kernel.kallsyms] [k] clear_huge_page ▒
+ 3.36% 0.00% ncplayer [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe ▒
+ 3.36% 0.00% ncplayer [kernel.kallsyms] [k] do_syscall_64 ▒
+ 2.10% 0.00% ncplayer libswscale.so.6.5.100 [.] 0x00007f40dd051918 ▒
2.09% 2.09% ncplayer libswscale.so.6.5.100 [.] 0x0000000000020918 ▒
2.00% 2.00% ncplayer libc-2.33.so [.] __vfprintf_internal ▒
+ 1.96% 0.01% ncplayer libpthread-2.33.so [.] __libc_write ▒
+ 1.95% 0.00% ncplayer [kernel.kallsyms] [k] ksys_write ▒
+ 1.94% 0.00% ncplayer [kernel.kallsyms] [k] vfs_write ▒
+ 1.94% 0.00% ncplayer [kernel.kallsyms] [k] new_sync_write ▒
+ 1.94% 0.01% ncplayer [kernel.kallsyms] [k] file_tty_write.constprop.0 ▒
+ 1.89% 1.54% ncplayer [kernel.kallsyms] [k] n_tty_write
goddamn that's a lot of soft page faults
here's an interesting result: when i don't stretch the images to occupy the full terminal (generally yielding far smaller images), write_sixel_payload()
drops far, far below extract_color_table()
, which begins to rudely dominate. makes sense. so takeaways:
extract_color_table()
ought improve all caseswrite_sixel_payload()
is strongly coupled to pixel count+ 25.80% 25.80% ncplayer libnotcurses-core.so.3.0.5 [.] extract_color_table ◆
+ 21.11% 0.00% ncplayer [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe ▒
+ 21.11% 0.00% ncplayer [kernel.kallsyms] [k] do_syscall_64 ▒
+ 13.99% 13.99% ncplayer libz.so.1.2.11 [.] inflate_fast ▒
+ 13.41% 0.00% ncplayer [kernel.kallsyms] [k] vm_mmap_pgoff ▒
+ 13.15% 0.00% ncplayer libc-2.33.so [.] __mmap ▒
+ 12.87% 0.00% ncplayer [kernel.kallsyms] [k] __mm_populate ▒
+ 12.87% 0.00% ncplayer [kernel.kallsyms] [k] populate_vma_page_range ▒
+ 12.87% 0.44% ncplayer [kernel.kallsyms] [k] __get_user_pages ▒
+ 12.54% 0.00% ncplayer [kernel.kallsyms] [k] handle_mm_fault ▒
+ 12.40% 1.36% ncplayer [kernel.kallsyms] [k] __handle_mm_fault ▒
+ 9.67% 0.00% ncplayer [unknown] [k] 0000000000000000 ▒
7.39% 7.23% ncplayer libnotcurses-core.so.3.0.5 [.] write_sixel_payload ▒
+ 7.10% 0.99% ncplayer [kernel.kallsyms] [k] alloc_pages_vma ▒
+ 6.24% 0.49% ncplayer [kernel.kallsyms] [k] __alloc_pages ▒
+ 5.44% 0.35% ncplayer [kernel.kallsyms] [k] get_page_from_freelist ▒
+ 4.39% 4.39% ncplayer [kernel.kallsyms] [k] clear_page_rep ▒
+ 4.04% 3.92% ncplayer libnotcurses-core.so.3.0.5 [.] paint
btw welcome back paint()
; for the first time, we're seeing the non-bitmap elements of the render core on profiles.
yeah let's go back to the plan of parallelizing extract_color_table()
first.
alright, i've got extract_color_table()
ready to go for multiple threads. i've gone ahead and inserted all necessary locking, blocking, and signaling. it all added negligible time in the singly-threaded case. let's see what happens when we let 32 cores get at it...
alright, threaded it out. results are poor so far, but this is strictly PoC with very coarse locking. we're running immediately into contention walls, as expected. the firmament is now there, though. i'm going to sleep for a bit. we'll develop upon this theme tomorrow. well, today, but later.
alright. i've now got sixel's extract_color_table()
arbitrarily data-parallelized, and it works with any number of threads, so far as i can tell. the performance curves i'm getting are disheartening; 2 threads are slower than 1, and in general N+1 are slower than N. we've almost certainly got gross contention (this seems self-evident, though i ought go ahead and prove it with profiling).
i mean, we've got a single mutex in the qstate
, and we hold it across the length of insert_color()
and while picking a new workchunk. workchunk selection was originally lock-free, but that wasn't behaving as expected -- i need to look at that and figure out what the problem was (we had an atomic long chunkstaken
on which this turned). we'll need either more granular locking in insert_color()
, or a lockfree solution (maybe possible, maybe not).
so that's disheartening. but! this whole effort has massively cleaned up and improved this code, and i think we'll be able to get somewhere with the parallelism quickly now.
also, about those soft page faults...
ALLOCING 40
ALLOCING 24288
ALLOCING 8192
ALLOCING 4800
ALLOCING 10560
ALLOCING 178 112 000 <-----------
ALLOCING 2640
ALLOCING 25392
we're suffering a 178M allocation on a 111K image (this is coupled to the scaled size). 178M is 46K pages at 4K per page. grotesque. still, i'm surprised that this shows up so prominently in our profiles....
mutrace: Showing 4 most contended mutexes:
Mutex # Locked Changed Cont. tot.Time[ms] avg.Time[ms] max.Time[ms] Flags
9 2743279 232173 231205 259.141 0.000 0.014 Mx.--.
10 83 2 0 0.006 0.000 0.001 M-.--.
5 7 2 0 0.002 0.000 0.001 Mx.--.
1 3 2 0 0.011 0.004 0.005 M-.--.
||||||
/|||||
Object: M = Mutex, W = RWLock /||||
State: x = dead, ! = inconsistent /|||
Use: R = used in realtime thread /||
Mutex Type: r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
Mutex Protocol: i = INHERIT, p = PROTECT /
RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC
yep, that's 259ms of contention (btw, mutrace
is fantastic shit!)
we're suffering a 178M allocation on a 111K image (this is coupled to the scaled size). 178M is 46K pages at 4K per page. grotesque. still, i'm surprised that this shows up so prominently in our profiles....
you know, quad-channel 3970x@3000 ought have about 90GB/s of memory bandwidth, and 178M would therefore expect to take about 2ms to move, which is quite plausible here (assuming this ever actually goes out to memory, which at least a significant portion of it surely does). yeah, we really want to shrink that allocation. let's think of a good sparse solution. we're already using the action bitmap -- what if we only allocated for entries in the action map? that could cut things down by two orders of magnitude easy.
so with regards to more granular locking on insert_color()
, i don't think we can do the front end locklessly. we need a lock per N elements, where N=1 for the most granular locking. the backend, though (any path that pulls down a new qnode), can probably be done losslessly, which is good since that needs be shared across the object.
soooooooo let's start with, hell, a lock per static qnode, and see where that takes us.
ok, i've now got active parallelism, but barely. still massive contention on the _free
counters. we've got to make those lockless, or we're pretty much done here.
yeah we're churning locks hard:
50.41% ncplayer [kernel.kallsyms] [k] native_queued_spin_lock_slow
11.39% ncplayer libz.so.1.2.11 [.] inflate_fast
6.35% ncplayer libpthread-2.33.so [.] __pthread_mutex_lock
2.60% ncplayer [kernel.kallsyms] [k] delay_halt_mwaitx
2.02% ncplayer [kernel.kallsyms] [k] futex_wake
1.94% ncplayer libpthread-2.33.so [.] __pthread_mutex_unlock_userc
1.93% ncplayer libnotcurses-core.so.3.0.5 [.] write_sixel_payload
1.79% ncplayer libnotcurses-core.so.3.0.5 [.] qstate_work
alright, i want to switch away from that for a minute, and consider sparse data structures for our core data table. it really shouldn't be much if at all larger than the actual image source; we're currently three orders of magnitude larger. it must be intensely sparse.
so as a first step, simply converting to an array of linked lists (array has one list per sixel row) ought significantly cut down the total size. the only difficulty would be finding a colors' row while building it up. if we really need o(1) lookup there, just gather them at the end of a sixelrow's processing. we only need an additional word of state per sixelrow. in the best case we cut space to 1/colors + a word per sixelrow; in the worst case we add the word per sixelrow and cut nothing (this would be a pathological image).
shitfire! rather than doing all that, let's just do a sixelrow at a time. at 1024 pixels across, that's an even meg for 1024 colors. hell, that'll likely all fit in LLC and never need be written out to memory at all. we just need to interleave build_data_table()
with write_sixel_payload()
, which will be a slight mess, but well worth it to never hit RAM! yessssssssssssssssssssssss.
In the course of stunting on https://github.com/hpjansson/chafa/issues/27, I discovered that
chafa
1.8.0 renders AllRGB/balloon.jpg in about half the time ofncplayer
on my 3970X. Representative timings:vs
in addition, the
chafa
output is slightly better -- look at the purple sections, and their borders with the blue sections to their right. we have some weird green pixels in there, unlikechafa
.ncplayer
:chafa
:note that this is on amd64, where
chafa
AFAIK uses hand-coded SIMD. i'm not sure if that's true on other architectures, but who cares; i tolerate defeat on no platform. well, i suppose there's only one thing to do now.THERE CAN BE ONLY ONE