genodelabs / genode

Genode OS Framework
https://genode.org/
Other
1.07k stars 252 forks source link

Decreased packet throughput, between block-file-system and part_block #4616

Closed ttcoder closed 1 year ago

ttcoder commented 2 years ago

This is with Genode 21.08

Still working on FUSE file system support, optimizing for speed. Turns out my previous attempts at optimizations were the smaller part of the problem : the lion's share of CPU cycles are spent in packet_stream.h, in support of a vfs plug-in like "<block name=/sda>":

Indeed, playing with GENODE_LOG_TSC I 'peeled the onion' until I got this:

[init -> part_block]  TSC submit: 3827K (10 calls, last 46124)
[init -> part_block]  TSC acknowledge_completed: 5025K (40 calls, last 15292)
[init -> basic-layers]  TSC     wait_for_signal(): 265M (10 calls, last 16192K)
[init -> basic-layers]  TSC    packet_stream.h : rx(): 269M (10 calls, last 18660K)
[init -> basic-layers]  TSC   get_acked_packet: 273M (10 calls, last 21366K)
[init -> basic-layers]  TSC _block_io: 313M (10 calls, last 27662K)

That seems to show that part_blk is not at fault (it only spends cycles in the Kilos, tops) whereas wait_for_signal() amounts to Mega-cycles and Giga-cycles, all told. Just in the above snippet alone, _block_io() takes 313 M to complete, of which 265 M (the majority's share) are spent waiting for the signal.

I'm in unfamiliar territory here so not sure what code is involved inbetween both ends, i.e. between wait_for_signal() and part_block. EDIT: also wasn't sure what to trace/log, as part_block implements an interface different from the 'packet' one I've been studying the past couple days, so I picked submit() and acknowledge_completed() on a feel-so.

Will try to get familiar with it and continue digging, but figured now would be a good time to post a ticket with an interim SitRep :-)

I looked at genodelabs/genode#2263 but not sure if any of it would explain my performance shortcoming here.

EDIT: reproducibility wise, luckily all the performance issues I'm seeing are reproducible both on bare metal and in Qemu, so I'm in a good place, debugging this in Qemu with good turn-around times, so don't hesitate if you have ideas I can experiment with..!

ttcoder commented 2 years ago

Note1: the buffer-count of block-fs is configurable, so I increased it from 1 to 128 buffers. It helps, but it won't go beyond a 40% speed-up.

Note2: for reference, one of the sequences of reads() issued by the ntfs driver (for mounting a partition):

[init -> basic-layers] read 1024 bytes, seek() = 174411776
[init -> basic-layers] read 1024 bytes, seek() = 174412800
[init -> basic-layers] read 4096 bytes, seek() = 174411776
[init -> basic-layers] read 4096 bytes, seek() = 8192
[init -> basic-layers] read 1024 bytes, seek() = 174417920
[init -> basic-layers] read 1024 bytes, seek() = 174422016
[init -> basic-layers] read 131072 bytes, seek() = 12288
[init -> basic-layers]  TSC   Catch: 347K (1 calls, last 347K)
[init -> basic-layers] read 1024 bytes, seek() = 174414848
[init -> basic-layers]  TSC   submit_packet: 12858K (10 calls, last 232K)
[init -> basic-layers]  TSC     wait_for_signal(): 205M (10 calls, last 13490K)
[init -> basic-layers]  TSC    packet_stream.h : rx(): 210M (10 calls, last 16239K)
jschlatow commented 2 years ago

@ttcoder Since you are working with release 21.08, I suspect you are witnessing the implicit blocking semantics of the packet-stream interface that has been deprecated with release 22.02. Please have a look at #4390.

ttcoder commented 2 years ago

Looking at the changes listed in #4390, I see block_file_system.h was changed/adjusted indeed -- though if that component is just an "innocent downstream bystander" this adjustment might not be the critical bit :

https://github.com/genodelabs/genode/commit/df2e7fa8421020e16e6883a9f5aef24e305905fd

The (way bigger) change to part_block is presumably? the one that ought to improve performance for my scenario, allowing part_block to respond more quickly to block-fs requests:

https://github.com/genodelabs/genode/commit/3995d2f4a20c07566237581cc38d0f388a638d6e

I have to upgrade to newer Genode anyway, so I will do so in order to benefit from the above changes, and report back. Thanks!

EDIT: I'm now setup for building 21.08 with the up-to-date toolchain, feels good to compile with Gcc10 easily. Next week I'll catch-up on Genode up to 22.x and come back to this ticket, fingers crossed.

ttcoder commented 2 years ago

I've successfully upgraded to Genode 22.02, and did a before-and-after using a minimal scenario, without part_block or a vfs server. Unfortunately I'm still seeing high TSC counts in block_file_system.h, in both Genode revs:

Genode 21.11 <local changes>
[init -> basic-layers]  TSC _block_io: 7848M (1390 calls, last 3607K)
Genode 22.02 <local changes>
[init -> basic-layers]  TSC _block_io: 8552M (1390 calls, last 4306K)

That's on Qemu though -- I'm going to test on bare metal, look for a speed-up there, before drawing conclusions and coming back asking for help :-)

nfeske commented 2 years ago

Thanks @ttcoder for updating, so we can talk about the same code. Please note that fd9648f9192dbe24552a69f02d95cc4f3f78845f is still likely missing in your branch.

Please keep in mind that the _block_io call includes the I/O latency of the device. So the accumulated costs do not correspond to the CPU cycles consumed but also the CPU cycle-time waited for I/O. To get a grasp on the CPU consumption, you may include the top component in your scenario, which show the stats of the CPU users every few seconds. For I/O-bounded workloads, one would expect the idle threads taking the top spots.

Looking at the numbers, do you think that the sum of 1390 calls is reasonable given your test scenario? Does this value loosely correspond to the number of file-system operations issued? Or are we observing an unreasonable split of transactions into tiny operations due to the limits of one of the involved I/O buffers (e.g., at the file-system-session client, or block-session client)?

ttcoder commented 2 years ago

Ok I think this one might be solved now that I upgraded to 22.02 indeed (also applied the https://github.com/genodelabs/genode/commit/fd9648f9192dbe24552a69f02d95cc4f3f78845f patch to be apples to apples, but after adding some tracing I see the try's catch() is almost never exercised anyway).

Tried on bare metal (T530), with my real-world jamfiles (that configure <block> with 256 cache blocks instead of the default one) and despite the high TSC counts, it takes just a few seconds (in most cases) for BFilePanel to display the file listing. That's without the patch from #4603 ! Impressive improvement already, compared to the same real-world testing with 21.08. So indeed the TSC count does not map to speed, in and of itself. Will also have to include 'top' to my scenarios, hopefully soon.

However the ugprade to 22.02 brings a couple new problems, but I suspect they will be solved with further upgrades, to 22.05 and 22.08 :

For usb_host, I get a few errors like this:

/GenCharlie/nimble_TMP_XPND/_d_hog_gcc10/genode/contrib/linux-b00fd4b8dc40ca35d3a9f8f27bfd7ec8c671bd4b/src/linux/include/linux/page-flags-layout.h:6:10: fatal error: generated/bounds.h: No such file or directory
    6 | #include <generated/bounds.h>

EDIT: I suspect the dde/linux process might be using (and assuming the existence of) linux headers outside of the contrib/ directory. I'm building on Haiku, hence no linux header files. Will try to confirm and file a ticket (in case it's of interest, even to those who build in the normal linux environement?)

Might come back and close this ticket in the not-too-distant future.

ttcoder commented 1 year ago

So there was some remaining slowness, but turns out it had nothing to do with Genode!

The multi-seconds delay on file ops was due to a runaway thread, executing my quick and dirty implementation of Hai-OS semaphores (namely, acquire_sem() was spending most of its time busy-looping in a while( error == EAGAIN ) loop). So the issue was not being IO-bound, but CPU-bound, due to a "CPU hog" slowing everything down. Norman : I should have followed your advice sooner, regarding launching "top" ! When I finally did, last week, saw "pthread.3" hogging 85% of CPU in Qemu, activated pthread verbosity to understand which thread that was, and zero'ed in on acquire_sem() fairly quickly. The BFilePanel class works great now. When I double-click a directory to drill down, it now takes less than a second for the folder to start listing its contents, versus 5-10 before. Any remaining optimization to be done is now far down my todo-list, instead of top priority. I'm a happy camper.

Closing.

nfeske commented 1 year ago

Thanks a lot for updating and closing the issue. It is always relieving when a mystery like this is solved.