Using ZFS on desktop with stock kernel is bad experience

poelzi commented 2 years ago

Describe the bug

ZFS on a desktop system with default kernel which is compiled with PREEMTIVE_VOLUNTARY causes a system with terrible lagg, short hangs and very bad realtime behaviour. This is easily so see with jackd and mixxx for example.

If the kernel is compiled with these changes, the system behaves much better:

boot.kernelPatches = [ {
        name = "enable RT_FULL";
        patch = null;
        extraConfig = ''
            PREEMPT y
            PREEMPT_BUILD y
            PREEMPT_VOLUNTARY n
            PREEMPT_COUNT y
            PREEMPTION y
            '';
     } ];

Steps To Reproduce

Steps to reproduce the behavior:

Do any ZFS file io
Run mixxx + jackd for example
Observe the stuttering and underuns

Expected behavior

More behaviour similar to other filesystems

Additional context

Upstream ticket: https://github.com/openzfs/zfs/issues/13128

Notify maintainers

@wizeman @hmenke @jcumming @jonringer @fpletz @globin

Metadata

[user@system:~]$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 5.16.20, NixOS, 21.11 (Porcupine)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.4`
 - channels(poelzi): `"home-manager-21.11, nixos-21.05.4726.530a53dcbc9"`
 - channels(root): `"nixos-21.11.335665.0f316e4d72d"`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`

Mindavi commented 2 years ago

Hmm, so according to the flag documentation this is recommended for desktop usage:

"Select this if you are building a kernel for a desktop system."

https://www.linuxtopia.org/online_books/linux_kernel/kernel_configuration/re152.html

poelzi commented 2 years ago

Yes. I'm not sure how much my nvidia card plays also a role in this mess of a pc, but switching to a different preemtive setting just feels so much smoother. We should at least provide a linux kernel derivative that is with desktop settings and warn the user if zfs is enabled and the default kernel is used or document to switch kernels.

poelzi commented 2 years ago

Using the rt kernel is not always an option unfortunately. The nvidia driver doesn't like it and the open driver is just not good enough on hidpi multimonitor setups. When I need super low latencies, I use cpuset cgroups to isolate one core and disable hyperthreading on that. Then move the audio thread there. This is good enough ;)

poelzi commented 2 years ago

@Mindavi https://www.linuxtopia.org/online_books/linux_kernel/kernel_configuration/re153.html is the good option

Shawn8901 commented 2 years ago

i am also super interested in having a responsive system with zfs. when my system gets over some level of load the amount of short lags and sound stuttering (esp. on recording) heavily increases. Before nix i was on an arch install and at least i havent noticed similar things before. Got a 2700X on the machine here, which should be able to handle my workloads very easy. Sadly i did not (yet) take a deep dive on the issue. At least what i could say when i had tried it, the zen kernel did serve less stuttering, or at least i did notice less of it (but like said havent done benchmarks or something to measure).

But at some point i switched back as i currently dont understand how to keep zfs modules and zen kernel in sync, and the zfs module has a nice variable for pointing to compatible kernel versions, which is useful for newbies like me. :)

hmenke commented 2 years ago

Ubuntu's kernel, which officially supports ZFS is compiled with the same options.

$ uname -a
Linux ubuntu 5.13.0-39-generic #44~20.04.1-Ubuntu SMP Thu Mar 24 16:43:35 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ awk '/^#/ { next } /PREEMPT/ { print }' /boot/config-5.13.0-39-generic 
CONFIG_PREEMPT_VOLUNTARY=y
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_DRM_I915_PREEMPT_TIMEOUT=640

Are you sure it's not something else on your system?

Of course this is anecdotal but I have not had latency problems running ZFS on a desktop. (with sufficient RAM!) But then again, I'm not doing realtime audio either.

Shawn8901 commented 2 years ago

for me personally i would not bet on that's exact that setting. i noticed sound shuttering in situations of higher IO load, when having higher memory pressure (eG having ~ 30% free memory oh the system, which has 32G), which i did not have on my old installation. So thats just observation on my side but sounds similar to what OP has described on the upstream ticket.

But for me the combination of having memory pressure AND IO load is needed for that to happen, which differs a otc to OPs description in detail or better, adds an additional layer which may be the root cause.

Sadly i hadn't yet the time for a kernel compilation with the suggested settings to check if that also "fixes" my issue, or if its something different, but a system which freaks out on IO when having "just" 10G RAM left free sounds like a bad experience for me... :)

bryanasdev000 commented 2 years ago

for me personally i would not bet on that's exact that setting. i noticed sound shuttering in situations of higher IO load, when having higher memory pressure (eG having ~ 30% free memory oh the system, which has 32G), which i did not have on my old installation. So thats just observation on my side but sounds similar to what OP has described on the upstream ticket.

But for me the combination of having memory pressure AND IO load is needed for that to happen, which differs a otc to OPs description in detail or better, adds an additional layer which may be the root cause.

Sadly i hadn't yet the time for a kernel compilation with the suggested settings to check if that also "fixes" my issue, or if its something different, but a system which freaks out on IO when having "just" 10G RAM left free sounds like a bad experience for me... :)

You can try Liquorix, Xanmod or Zen kernel patch sets to test, they are all available on nixpkgs, note that support from ZFS or NVIDIA may lag a bit.

I run ZFS in some of my machines mainly with Xanmod or Zen and I do not see any lag, only when I am with heavy IO operation on some old HDDs.

Shawn8901 commented 2 years ago

You can try Liquorix, Xanmod or Zen kernel patch sets to test, they are all available on nixpkgs, note that support from ZFS or NVIDIA may lag a bit.

As mentioned in my very first post, i did try ZEN at some point, and that gave much better experience. but i switched back to stock, as i find config.boot.zfs.package.latestCompatibleLinuxPackages to be very handy to ensure to have a compatible kernel installed. And it this is sadly no ZEN.

How do you ensure that zfs modules are in a valid version, or is that something i dont have to care to much? As i am using ZFS on root Its crucial to have it working at the end.

About the user experience on some IO, when scrubing kicks in (which is of course high IO and some lag is expected) but the system is not even usable any longer as its mostly unresponsive. The zfs pool is hosted on a Samsung SSD 860 EVO which is not really some old HDD.

Not sure if its really the same "issue" poelzi bothers or described, as this is not really near RT scenarios, and i dont want to overtake the issue here, with my "problems", but a short feedback how you properly handle the kernel matching zfs would be handsome.

edit: i now understand how the kernelPackages are tied to the kernel in nixos, so question is now cleared

jcumming commented 2 years ago

Scheduling under load is a difficult problem to solve.

Can you isolate where the block latency is coming from? iostat -x and zpool iostat -vl are good debugging tools to identify if the latency is coming from the kernel or the device.

Shawn8901 commented 1 year ago

I somehow forgot about the issue and was today talking about it. I nailed it quite clear down to write IO by sending around zfs datasets in my network. When the affected PC is the sender everything works fine, as soon as it its the receiver sound sometimes begins to shutter.

The iostats look like the following:

Every 2.0s: zpool iostat -vl                                                                                                                                               pointalpha: Tue Feb 14 18:09:27 2023

                                                       capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim
pool                                                 alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait
---------------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
rpool                                                 461G   467G     28     82   535K  4.89M    1ms  163ms  384us  754us  197us     1s    4ms   12ms    2ms      -
  ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0K403903D-part2   461G   467G     28     82   535K  4.89M    1ms  163ms  384us  754us  197us     1s    4ms   12ms    2ms      -
---------------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----

Every 2.0s: iostat -x                                                                                                                                                      pointalpha: Tue Feb 14 18:11:02 2023

Linux 6.1.7-xanmod1 (pointalpha)        02/14/23        _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          19.00    0.01    3.80    0.19    0.00   77.01

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
sda             30.66    540.40     0.06   0.20    0.45    17.63   84.26   5212.12     0.79   0.93    0.57    61.86    0.00      0.00     0.00   0.00    0.00     0.00    1.21    2.89    0.07   3.56
sr0              0.00      0.00     0.00   0.00    6.00     2.22    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
zd0              0.01      0.25     0.00   0.00    0.13    27.62    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
zd16             0.01      0.18     0.00   0.00    0.08    24.87    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00

I am absolute no expert in reading this but i see a high f_await on iostat and high syncq_wait on zpool iostat. When another machine is the receiver (where i sadly can not test the behavior as its a server and i dont know how to verity it) those two numbers are a lot lower. But i am not sure how to interpret the numbers to be honest. From %util the device should be chilling.

IvanVolosyuk commented 1 year ago

> Every 2.0s: zpool iostat -vl                                                                                                                                               pointalpha: Tue Feb 14 18:09:27 2023
> 
>                                                        capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim
> pool                                                 alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait
> ---------------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
> rpool                                                 461G   467G     28     82   535K  4.89M    1ms  163ms  384us  754us  197us     1s    4ms   12ms    2ms      -
>   ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0K403903D-part2   461G   467G     28     82   535K  4.89M    1ms  163ms  384us  754us  197us     1s    4ms   12ms    2ms      -
> ---------------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
>

163 milliseconds to do the write, when ~1 millisecond the actual drive latency. That's clearly CPU bound. On non-preemptive kernel all of this are running in high priority zfs io threads doing a lot of compression/encryption/checksumming. I wonder, what is the recordsize, compression algorithm, encryption settings for the system? It might be not enough cond_resched() calls in one or more of the corresponding code paths or zfs kernel thread priority is too high for the audio thread to be able to preempt it.

Before switching to preemptive kernel I was playing with zfs module parameters like the following with unconsistent level of success:

spl.spl_taskq_thread_bind=0
spl.spl_taskq_thread_priority=0

This should be set on the module load I think. Binding the threads may be especially bad for audio as usually only one cpu core handles audio interrupts and if zfs occupies that core and doesn't preempt in time it will cause stuttering.

Shawn8901 commented 1 year ago

163 milliseconds to do the write, when ~1 millisecond the actual drive latency. That's clearly CPU bound. On non-preemptive kernel all of this are running in high priority zfs io threads doing a lot of compression/encryption/checksumming. I wonder, what is the recordsize, compression algorithm, encryption settings for the system?

 zfs get recordsize,compression,encryption rpool
NAME   PROPERTY     VALUE           SOURCE
rpool  recordsize   128K            default
rpool  compression  zstd            local
rpool  encryption   off             default

It is kept on default beside using zfs as compression.

Before switching to preemptive kernel I was playing with zfs module parameters like the following with unconsistent level of success:
spl.spl_taskq_thread_bind=0
spl.spl_taskq_thread_priority=0
This should be set on the module load I think. Binding the threads may be especially bad for audio as usually only one cpu core handles audio interrupts and if zfs occupies that core and doesn't preempt in time it will cause stuttering.

ill try out if that results in a better experience.

edit:

Most of my bad UX has been resolved by setting

udev.extraRules =  ''
      ACTION=="add|change", KERNEL=="sd[a-z]*[0-9]*|mmcblk[0-9]*p[0-9]*|nvme[0-9]*n[0-9]*p[0-9]*", ENV{ID_FS_TYPE}=="zfs_member", ATTR{../queue/scheduler}="none"
'';

Which i found here:

EDIT 2 The udev change is in nixos since 23.11

Artturin commented 1 year ago

https://github.com/NixOS/nixpkgs/pull/250308

magnetophon commented 10 months ago

@poelzi Have you tried the proposed solution? Can we close the issue?

numkem commented 10 months ago

@poelzi Have you tried the proposed solution? Can we close the issue?

I've tried it before it was merged and even after and it didn't make a difference for me.

My principal problem is using atuin with a zfs root where the shell would hang while atuin does an insert in SQLite. It's been already referenced above.

The real-time patch at the top made the most difference but still I see it from time to time.

nazarewk commented 10 months ago

This is actually ZFS bug causing ftruncate hangups (affecting everything using sqlite as a database, not just atuin) as noted in https://github.com/atuinsh/atuin/issues/952#issuecomment-1537884120 which links to https://github.com/openzfs/zfs/issues/14290

there are 2 "fixes" so far:

put the sqlite database files on tmpfs and synchronize them (with litestream?) to persistent storage as described in https://github.com/atuinsh/atuin/issues/952#issuecomment-1645436046
disable sync on the dataset holding sqlite database https://github.com/atuinsh/atuin/issues/952#issuecomment-1783676117

generic-github-user commented 7 months ago

I'm experiencing the same issue -- though (as far as I can tell) not only with Atuin, also Firefox, Konsole, KDE Plasma, etc... for reference, I have compression, deduplication, and encryption all disabled. Neither setting autotrim=on nor adding boot.kernelPackages = config.boot.zfs.package.latestCompatibleLinuxPackages; to my configuration seems to have made any difference. I would be happy to create a new dataset with sync=disabled for specific applications if I could isolate them, but it seems at this point that the issue is system-wide.

generic-github-user commented 7 months ago

Some other things I have tried (unsuccessfully) -- I'm interested in hearing if any of these worked for others, and what other options I should consider:

adding boot.kernelParams = [ "elevator=none" ];
setting sync=disabled for all datasets and rebooting
the udev modifications mentioned by Shawn8901
setting boot.kernelPackages = pkgs.linuxPackages_zen;
switching from KDE plasma to GNOME
updating spl_taskq_thread_bind and spl_taskq_thread_priority

At this point I am beginning to doubt my problem is with ZFS itself (or the desktop environment for that matter), though I'm not sure where else I should be looking.

Shawn8901 commented 7 months ago

* the `udev` modifications mentioned by Shawn8901

fyi in case someone else comes around the udev changes, they are applied by default since https://github.com/NixOS/nixpkgs/pull/250308 (https://github.com/NixOS/nixpkgs/issues/169457#issuecomment-1705486693) that should be in nixos stable since 23.11. So for that part there should not be the need to do any manual changes.

generic-github-user commented 7 months ago

Thanks, I wasn't aware of that. Rather embarrassingly, the main issue for me turned out to be a power-saving setting my laptop had automatically enabled without me noticing. Opening files/applications still lags sometimes, but it usually resolves itself after the first time (so I assume this is caching-related).

illode commented 6 months ago

Opening files/applications still lags sometimes, but it usually resolves itself after the first time (so I assume this is caching-related).

That could be https://discourse.nixos.org/t/plasma-emojier-too-slow-episode-iv/40130, so it might be fixed in Plasma 6.

SuperSandro2000 commented 6 months ago

I've also ran into this. Turned out I had battery saver on.

NixOS / nixpkgs