starting from kernel 5.10.x latency test fails systematically on multiple platforms

bondagit commented 1 year ago

If I try the latency test application (test/latency.c) using a Linux Kernel 5.9.x with alsa-lib v1.2.4 on the following audio card: PCH [HDA Intel PCH], device 1: ALC269VC Analog [ALC269VC Analog] (for example) I get the following result:

sudo ./latency -C hw:1 -P hw:1 -f S16_LE -r 48000 -c 2  -s 10 -m 128 -M 128 
Scheduler set to Round Robin with priority 99...
Playback device is hw:1
Capture device is hw:1
Parameters are 48000Hz, S16_LE, 2 channels, non-blocking mode
Poll mode: no
Loop limit is 480000 frames, minimum latency = 128, maximum latency = 128
Hardware PCM card 1 'HDA Intel PCH' device 0 subdevice 0
Its setup is:
  stream       : PLAYBACK
  access       : RW_INTERLEAVED
  format       : S16_LE
  subformat    : STD
  channels     : 2
  rate         : 48000
  exact rate   : 48000 (48000/1)
  msbits       : 16
  buffer_size  : 128
  period_size  : 64
  period_time  : 1333
  tstamp_mode  : NONE
  tstamp_type  : MONOTONIC
  period_step  : 1
  avail_min    : 64
  period_event : 0
  start_threshold  : 2147483647
  stop_threshold   : 128
  silence_threshold: 0
  silence_size : 0
  boundary     : 4611686018427387904
  appl_ptr     : 0
  hw_ptr       : 0
Hardware PCM card 1 'HDA Intel PCH' device 0 subdevice 0
Its setup is:
  stream       : CAPTURE
  access       : RW_INTERLEAVED
  format       : S16_LE
  subformat    : STD
  channels     : 2
  rate         : 48000
  exact rate   : 48000 (48000/1)
  msbits       : 16
  buffer_size  : 128
  period_size  : 64
  period_time  : 1333
  tstamp_mode  : NONE
  tstamp_type  : MONOTONIC
  period_step  : 1
  avail_min    : 64
  period_event : 0
  start_threshold  : 2147483647
  stop_threshold   : 128
  silence_threshold: 0
  silence_size : 0
  boundary     : 4611686018427387904
  appl_ptr     : 0
  hw_ptr       : 0
Trying latency 128 frames, 2666.667us, 2.666667ms (375.0000Hz)
Success
Playback:
*** frames = 480128 ***
  state       : RUNNING
  trigger_time: 1907393.672215
  tstamp      : 0.000000
  delay       : 80
  avail       : 48
  avail_max   : 64
Capture:
*** frames = 480000 ***
  state       : RUNNING
  trigger_time: 1907393.672215
  tstamp      : 0.000000
  delay       : 0
  avail       : 0
  avail_max   : 16
Maximum read: 16 frames
Maximum read latency: 333.333us, 0.333333ms (3000.0000Hz)
Hardware sync
Playback time = 1907393.672215, Record time = 1907393.672215, diff = 0

The tests succeeded and final state is RUNNING

If I try the same tests on the same audio device and distribution but with a Linux Kernel >= 5.10.x it fails systematically and I get:

sudo ./lantency -C hw:1 -P hw:1 -f S16_LE -r 48000 -c 2  -s 10 -m 128 -M 128 
Scheduler set to Round Robin with priority 99...
Playback device is hw:0
Capture device is hw:0
Parameters are 48000Hz, S16_LE, 2 channels, non-blocking mode
Poll mode: no
Loop limit is 480000 frames, minimum latency = 128, maximum latency = 128
Hardware PCM card 0 'HDA Intel PCH' device 0 subdevice 0
Its setup is:
  stream       : PLAYBACK
  access       : RW_INTERLEAVED
  format       : S16_LE
  subformat    : STD
  channels     : 2
  rate         : 48000
  exact rate   : 48000 (48000/1)
  msbits       : 16
  buffer_size  : 128
  period_size  : 64
  period_time  : 1333
  tstamp_mode  : NONE
  tstamp_type  : MONOTONIC
  period_step  : 1
  avail_min    : 64
  period_event : 0
  start_threshold  : 2147483647
  stop_threshold   : 128
  silence_threshold: 0
  silence_size : 0
  boundary     : 4611686018427387904
  appl_ptr     : 0
  hw_ptr       : 0
Hardware PCM card 0 'HDA Intel PCH' device 0 subdevice 0
Its setup is:
  stream       : CAPTURE
  access       : RW_INTERLEAVED
  format       : S16_LE
  subformat    : STD
  channels     : 2
  rate         : 48000
  exact rate   : 48000 (48000/1)
  msbits       : 16
  buffer_size  : 128
  period_size  : 64
  period_time  : 1333
  tstamp_mode  : NONE
  tstamp_type  : MONOTONIC
  period_step  : 1
  avail_min    : 64
  period_event : 0
  start_threshold  : 2147483647
  stop_threshold   : 128
  silence_threshold: 0
  silence_size : 0
  boundary     : 4611686018427387904
  appl_ptr     : 0
  hw_ptr       : 0
Trying latency 128 frames, 2666.667us, 2.666667ms (375.0000Hz)
Failure
Playback:
*** frames = 90720 ***
  state       : XRUN
  trigger_time: 967.658489
  tstamp      : 0.000000
  delay       : 0
  avail       : 216
  avail_max   : 216
Capture:
*** frames = 90592 ***
  state       : XRUN
  trigger_time: 967.658489
  tstamp      : 0.000000
  delay       : 0
  avail       : 88
  avail_max   : 88
Maximum read: 16 frames
Maximum read latency: 333.333us, 0.333333ms (3000.0000Hz)
Hardware sync

I get an XRUN during the test execution. I can reproduce the same error on multiple boards for multiple audio cards and on different Linux distribution just changing the Linux kernel version. The problem was originally reported at: AES67 Linux daemon: Latency test fails with Linux kernel starting from 5.10.0x

I am interested in testing with low latency (-m 128 -M 128). My guess is that something changed in the Linux audio core and the latency application has to be adapted. Any idea ?

perexg commented 1 year ago

It may not be an audio issue at all. Some other task / kernel driver may eat CPU time preventing the low-latency operation. From the result:

   avail       : 216
   avail_max   : 216

It means that the playback is missing 88 samples (216-128) at the under-run check.

You can trace the syscalls using strace -r -o trace.txt like strace -r -s 256 -o trace.txt ./latency -C hw:2 -P hw:2 -f S16_LE -r 48000 -c 2 -m 128 -M 128 -X:

     0.000027 write(1, "[146516us] POS: p=72 (min=40, rt=817us) c=8 (max=40, rt=-183us)\n", 64) = 64
     0.000030 ioctl(5, SNDRV_PCM_IOCTL_READI_FRAMES, 0x7ffcc4c7c000) = 0
     0.000029 ioctl(4, SNDRV_PCM_IOCTL_WRITEI_FRAMES, 0x7ffcc4c7bfd0) = 0
     0.000026 ioctl(5, SNDRV_PCM_IOCTL_HWSYNC, 0) = 0
     0.000029 ioctl(4, SNDRV_PCM_IOCTL_HWSYNC, 0x1) = 0
     0.000028 write(1, "[146660us] POS: p=72 (min=40, rt=840us) c=8 (max=40, rt=-160us)\n", 64) = 64
     0.000030 ioctl(5, SNDRV_PCM_IOCTL_READI_FRAMES, 0x7ffcc4c7c000) = 0
     0.000097 ioctl(4, SNDRV_PCM_IOCTL_WRITEI_FRAMES, 0x7ffcc4c7bfd0) = 0
     0.000027 ioctl(5, SNDRV_PCM_IOCTL_HWSYNC, 0) = 0
     0.000136 ioctl(4, SNDRV_PCM_IOCTL_HWSYNC, 0x1) = 0
     0.000270 write(1, "[147220us] POS: p=64 (min=40, rt=613us) c=8 (max=40, rt=-553us)\n", 64) = 64
     0.000059 ioctl(5, SNDRV_PCM_IOCTL_READI_FRAMES, 0x7ffcc4c7c000) = 0
     0.000042 ioctl(4, SNDRV_PCM_IOCTL_WRITEI_FRAMES, 0x7ffcc4c7bfd0) = 0
     0.000054 ioctl(5, SNDRV_PCM_IOCTL_HWSYNC, 0) = 0
*    0.000871 ioctl(4, SNDRV_PCM_IOCTL_HWSYNC, 0x1) = 0         
     0.000061 write(1, "[148305us] POS: p=16 (min=16, rt=695us) c=32 (max=40, rt=-972us)\n", 65) = 65
     0.000074 ioctl(5, SNDRV_PCM_IOCTL_READI_FRAMES, 0x7ffcc4c7c000) = 0
     0.000101 ioctl(4, SNDRV_PCM_IOCTL_WRITEI_FRAMES, 0x7ffcc4c7bfd0) = -1 EPIPE (Broken pipe)

without -X:
     0.000020 ioctl(5, SNDRV_PCM_IOCTL_READI_FRAMES, 0x7ffda8f7c9b0) = 0
     0.000025 ioctl(4, SNDRV_PCM_IOCTL_WRITEI_FRAMES, 0x7ffda8f7c980) = 0
     0.000019 ioctl(5, SNDRV_PCM_IOCTL_READI_FRAMES, 0x7ffda8f7c9b0) = -1 EAGAIN (Resource temporarily unavailable)
     0.000022 ioctl(5, SNDRV_PCM_IOCTL_READI_FRAMES, 0x7ffda8f7c9b0) = -1 EAGAIN (Resource temporarily unavailable)
     0.000023 ioctl(5, SNDRV_PCM_IOCTL_READI_FRAMES, 0x7ffda8f7c9b0) = -1 EAGAIN (Resource temporarily unavailable)
     0.000022 ioctl(5, SNDRV_PCM_IOCTL_READI_FRAMES, 0x7ffda8f7c9b0) = -1 EAGAIN (Resource temporarily unavailable)
     0.000022 ioctl(5, SNDRV_PCM_IOCTL_READI_FRAMES, 0x7ffda8f7c9b0) = -1 EAGAIN (Resource temporarily unavailable)
*    0.001841 ioctl(5, SNDRV_PCM_IOCTL_READI_FRAMES, 0x7ffda8f7c9b0) = 0
     0.000081 ioctl(4, SNDRV_PCM_IOCTL_WRITEI_FRAMES, 0x7ffda8f7c980) = -1 EPIPE (Broken pipe)

Legend: I marked with * the scheduling time gaps

Perhaps, it may be worth to do tests the the vanilla kernel and do a bisect to the commit which affects this behavior. Another way is to do the kernel profiling.

bondagit commented 1 year ago

Thanks for the info. I can reproduce this by simply changing the kernel from any version <= 5.9.x to any version >= 5.10.x: nothing else changes, just the kernel. I'm using the precompiled kernel for the Ubuntu distro (with very similar configuration) and the problem is systematically reproducing on both X86_64 and ARM64 platforms. Other people using various Linux distributions reported the same, so my guess is that the problem is related to some kernel change introduced starting from version 5.10.x impacting the low latency behaviour.

perexg commented 1 year ago

I did some debugging using ftrace (trace-cmd) and it really seems that the system scheduler blocks the latency task by a kworker task:

    latency-7395  [002] 22719.778930: hwptr:                pcmC0D0c/sub0: POS: pos=16, old=26768, base=26752, period=64, buf=128
    latency-7395  [002] 22719.778932: hwptr:                pcmC0D0c/sub0: POS: pos=16, old=26768, base=26752, period=64, buf=128
    latency-7395  [002] 22719.778934: hwptr:                pcmC0D0c/sub0: POS: pos=16, old=26768, base=26752, period=64, buf=128
    latency-7395  [002] 22719.778937: hwptr:                pcmC0D0c/sub0: POS: pos=48, old=26768, base=26752, period=64, buf=128
    latency-7395  [002] 22719.778938: applptr:              pcmC0D0c/sub0: prev=26768, curr=26800, avail=0, period=64, buf=128
    latency-7395  [002] 22719.778941: hwptr:                pcmC0D0p/sub0: POS: pos=96, old=26816, base=26752, period=64, buf=128
    latency-7395  [002] 22719.778942: applptr:              pcmC0D0p/sub0: prev=26896, curr=26928, avail=48, period=64, buf=128
    latency-7395  [002] 22719.778949: sched_switch:         latency:7395 [0] S ==> kworker/2:1:5386 [120]
     <idle>-0     [007] 22719.778971: irq_handler_entry:    irq=182 name=snd_hda_intel:card0
     <idle>-0     [007] 22719.779013: hwptr:                pcmC0D0c/sub0: IRQ: pos=48, old=26800, base=26752, period=64, buf=128
     <idle>-0     [007] 22719.779070: irq_handler_exit:     irq=182 ret=handled
     <idle>-0     [007] 22719.779073: irq_handler_entry:    irq=182 name=snd_hda_intel:card0
     <idle>-0     [007] 22719.779084: irq_handler_exit:     irq=182 ret=unhandled
kworker/2:1-5386  [002] 22719.779162: sched_wakeup:         latency:7395 [0] CPU:002
     <idle>-0     [007] 22719.780474: irq_handler_entry:    irq=182 name=snd_hda_intel:card0
     <idle>-0     [007] 22719.780524: sched_wakeup:         kworker/7:1:5655 [120] CPU:007
     <idle>-0     [007] 22719.780567: hwptr:                pcmC0D0p/sub0: IRQ: pos=48, old=26848, base=26752, period=64, buf=128
     <idle>-0     [007] 22719.780567: xrun:                 pcmC0D0p/sub0: XRUN: old=26928, base=26880, period=64, buf=128
     <idle>-0     [007] 22719.780699: irq_handler_exit:     irq=182 ret=handled
     <idle>-0     [007] 22719.780701: irq_handler_entry:    irq=182 name=snd_hda_intel:card0
     <idle>-0     [007] 22719.780729: irq_handler_exit:     irq=182 ret=unhandled
     <idle>-0     [007] 22719.780736: sched_switch:         swapper/7:0 [120] R ==> kworker/7:1:5655 [120]
kworker/7:1-5655  [007] 22719.780744: sched_switch:         kworker/7:1:5655 [120] W ==> swapper/7:0 [120]
kworker/2:1-5386  [002] 22719.790242: sched_switch:         kworker/2:1:5386 [120] R ==> latency:7395 [0]
    latency-7395  [002] 22719.790455: hw_mask_param:        pcmC0D0p:0 000/025 ACCESS 0000000000000000ffffffffffffffff 00000000000000000000000000000009
    latency-7395  [002] 22719.790455: hw_mask_param:        pcmC0D0p:0 000/025 FORMAT 0000000000000000ffffffffffffffff 00000000000000000000000000000404
    latency-7395  [002] 22719.790456: hw_mask_param:        pcmC0D0p:0 000/025 SUBFORMAT 0000000000000000ffffffffffffffff 00000000000000000000000000000001

As you can see, the latency task was rescheduled at 22719.778949 and system gave the CPU back at 22719.790242 . So the time difference is 0.011292 seconds (~11.3ms). @tiwai: FYI - could you check my interpretation, please?

It really appears like a problem in the linux scheduler or the kworker task.

perexg commented 1 year ago

I think that I found the upstream change between 5.9 and 5.10 kernels in the scheduler which causes this behavior:

https://lore.kernel.org/lkml/c596a06773658d976fb839e02843a459ed4c2edf.1479204252.git.bristot@redhat.com/

See the patch description. You can set the old behaviour using this command:

sysctl -w kernel.sched_rt_runtime_us=1000000

Note that this command may be dangerous - see the patch description. The task can eat whole CPU preventing to run other tasks.

perexg commented 1 year ago

For the archiving purposes - trace-cmd commands:

LATENCY=128 ; USLEEP=200
trace-cmd start -b 10000 -e 'sched_wakeup*' -e sched_switch -e gpio_value -e irq_handler_entry -e irq_handler_exit -e snd_pcm \
/home/redhat/git/alsa-lib/test/.libs/latency -C hw:0 -P hw:0 -f S16_LE -r 48000 -c 2 -s 10 -m "${LATENCY}" -M "${LATENCY}" -y "${USLEEP}"
trace-cmd stop
trace-cmd extract

eventually (to check workqueues):

trace-cmd start -b 10000 -e 'sched_wakeup*' -e sched_switch -e workqueue ...

Use kernelshark or trace-cmd report to analyze the trace.dat file.

bondagit commented 1 year ago

Great catch ! this seems to work for me. I will test it on some more platforms and report here. Thanks a lot for the prompt support !

tiwai commented 1 year ago

Would a different RT setup in latency program change the behavior? It sounds like a kernel regression if it shows such a latency even with the highest RT priority.

perexg commented 1 year ago

Would a different RT setup in latency program change the behavior? It sounds like a kernel regression if it shows such a latency even with the highest RT priority.

The latency.c sets the scheduler to round-robin (SCHED_RR) with the maximal priority 99. The mentioned kernel patch allows to interrupt tasks even with this highest latency with some non-realtime tasks and it seems that it does in a large time window (from the realtime perspective) - tenths of milliseconds. From my analysis, the workqueue tasks are fine (up to 1ms), but the scheduler wake the latency (busy-loop task) too late.

I added new option to latency test code to allow using SCHED_FIFO, but the results are not good, too. It seems that the scheduler penalty for the busy-loop programs is high in the current kernels and the scheduler does not use "free" CPUs rather than interrupt this busy task.

For reference (pipewire discussion): https://gitlab.freedesktop.org/pipewire/pipewire/-/issues/1271

bondagit commented 1 year ago

Would a different RT setup in latency program change the behavior? It sounds like a kernel regression if it shows such a latency even with the highest RT priority.

On the platforms I am currently testing if I change the latency application and skip the setscheduler() call at all, the latency app works fine on kernel >= 5.10.x. (I mean even without setting _kernel.sched_rt_runtimeus param). Final behaviour with the new kernel looks anyway unexpected to me.

With a kernel <= 5.9.x I have the opposite behaviour and if I skip the setscheduler() call the latency test fails.

perexg commented 1 year ago

Wow, I did not expect that. It seems that SCHED_OTHER is better than FIFO/RR with the current kernels. I added the SCHED_OTHER policy support to latency.c - -O other (75f8e2e73e03f628a4f2ba55ca8aa3e9f50cdbd9). So we can select the policy without the code modification now.

bondagit commented 1 year ago

it sounds ok to me. Just to summarise. According to my tests SCHED_OTHER is required to make latency test work with a kernel >= 5.10.x and SCHED_RR is required with a kernel <= 5.9.x

bondagit commented 1 year ago

In addition to allow the user to specify the scheduler policy, can the latency application apply a default scheduler according to the above findings ?

perexg commented 1 year ago

I think that it would be bad to change the application defaults according the kernel version. Also, it's clearly the kernel scheduler problem which should be reported to LKML (linux kernel mailing list) and scheduler maintainers - see https://github.com/torvalds/linux/blob/c3eb11fbb826879be773c137f281569efce67aa8/MAINTAINERS#L18348 .

alsa-project / alsa-lib

starting from kernel 5.10.x latency test fails systematically on multiple platforms #285