allenpais / for-6.9-bh-conversions

0 stars 0 forks source link

firewire: ohci: conversion of tasklet to work: tasklet_disable_in_atomic still present #1

Open recallmenot opened 6 months ago

recallmenot commented 6 months ago

This may be distantly related to your conversion effort: https://github.com/allenpais/for-6.9-bh-conversions/blob/69c651e0620fed46dc5719a77c2d906895c256b9/0013-firewire-Convert-from-tasklet-to-BH-workqueue.patch#L178 Currently I'm hunting down a bug in ohci.c where, after 5-30 minutes of streaming to an RME FireFace 800 audio interface via ALSA, this call to tasklet_disable_in_atomic() leads to a hardlock with a full system freeze as a different CPU is already running the tasklet, leading to infinitely waiting for tasklet_unlock_spin_wait(). Since you're doing a conversion to workqueue but leaving the call to tasklet_disable_in_atomic() in-place, would you happen to have any advice on how this could be resolved?

recallmenot commented 6 months ago

Also, wouldn't threaded irqs be more suitable for FireWire? I doubt many people are using it for external hard drives and cameras these days, those are well-served by USB now but there still are quite a lot of FireWire audio interfaces out there that are excellent. This is to say that latency may be more important for FireWire moving forward than throughput efficiency. RME, the manufacturer of the one I have still updates their Windows drivers for Interfaces that effectively came out 20 years ago. These interfaces have no issue running at sample rates of 192kHz and buffer sizes of just 32 samples if the PC can supply the blocks quickly enough. That's 6000 blocks per second. Even one block that is not on time means audible crackling. Smaller buffer sizes start to become important when one wants to record an instrument while monitoring it on headphones, maybe with digital effects like distortion on.

recallmenot commented 6 months ago

maybe @tiwai can give us guidance

tiwai commented 6 months ago

There are special handling for soft irq in firewire sound driver. Might it be that? e.g.

--- a/sound/firewire/amdtp-stream.c
+++ b/sound/firewire/amdtp-stream.c
@@ -1853,7 +1853,7 @@ unsigned long amdtp_domain_stream_pcm_pointer(struct amdtp_domain *d,
    struct amdtp_stream *irq_target = d->irq_target;

    // Process isochronous packets queued till recent isochronous cycle to handle PCM frames.
-   if (irq_target && amdtp_stream_running(irq_target)) {
+   if (0 && irq_target && amdtp_stream_running(irq_target)) {
        // In software IRQ context, the call causes dead-lock to disable the tasklet
        // synchronously.
        if (!in_softirq())
recallmenot commented 6 months ago

@tiwai thank you! yes according to the call trace that is correct. I've disabled the call to fw_iso_context_flush_completions() like you showed but I'm afraid it is necessary: I get no audio output and in the case of ALSA client, the program freezes (but no hardlock). Pulse apps just have no audio output. So I've now reverted. That the freeze only happens after 5-30 minutes tells me it works 99.9% of the time but then something goes wrong and I get a hardlock, freezing my system.

I've looked at the blame for those lines, maybe @takaswie has an idea what to try next as he tried to solve this in the past?

tasklet_disable_in_atomic() bears this warning: "Do not use in new code. Disabling tasklets from atomic contexts is error prone and should be avoided.", which is also present in the contained call to tasklet_unlock_spin_wait() where the infinite spin is happening.

Which is why I thought rewriting to threaded irqs (or workqueue if that is more suitable) might also solve this issue as disabling the tasklet like ohci_flush_iso_completions() currently does would not be necessary any more.

takaswie commented 6 months ago

Linux FireWire subsystem allows two types of contexts to process the queued payloads of isochronous packets; softIRQ (tasklet) and process contexts. The former is scheduled in hardIRQ context of 1394 OHCI driver, and the latter is called by either unit drivers or user processes. The packet processing should be in critical section. In ohci_flush_iso_completions(), the pair of tasklet_disable_in_atomic() and tasklet_enable() is used to intervene between the softIRQ contexts, and the pair of test_and_set_bit_lock() and clear_bit_unlock() is used to intervene between process contexts. The softIRQ is itself a type of IRQ context, thus softIRQ contexts and process contexts are already intervened. This feature has been introduced by Clemens Ladisch at https://git.kernel.org/torvalds/linux/c/d1bbd2097293. The drivers in ALSA firewire stack utilize the feature to reduce the time to wait for transmission, and to allow no-period-wakeup PCM runtime for ALSA PCM applications.

The irq_target is mandatory. Many devices supported by ALSA firewire stack expect corresponding drivers to recover media clock according to the series of time stamp in the transferred isochronous packets. At present, IEC 61883-1/6 packet streaming engine in ALSA firewire stack processes all of 1394 OHCI IR/IT contexts in softIRQ contexts for one of the IT context (=irq_target) when ALSA no-period-wakeup runtime is disabled. When it is enabled, these contexts are processed by ALSA PCM ioctl for all of relevant ALSA PCM playback/capture character devices. The removal of function call could break PCM runtime for some applications; e.g. PulseAudio and PipeWire.

allenpais commented 6 months ago

@recallmenot Thank you very testing.

I haven't had the opportunity to really stress test the change. Perhaps, In the case of firewire, use of threaded ira's are better. I will test this change and write back.

allenpais commented 6 months ago

@recallmenot Could you please test the following complete patch when find sometime. 0001-firewire-Convert-from-tasklet-to-BH-workqueue.patch

recallmenot commented 6 months ago

@allenpais , thank you very much for the patch! At first I didn't realize you built against a different linux branch, I was building against Linus' linux main which of course couldn't work.

This is what I ended up doing:

git clone https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git
git checkout disable_work-v1
git apply 0001-firewire-Convert-from-tasklet-to-BH-workqueue.patch
make

git did not complain while applying the patch.

but then I get these errors (copied from clangd but gcc finds the same):

drivers/firewire/ohci.c|973 col 27-36 error| Call to undeclared function 'from_work'; ISO C99 and later do not support implicit function declarations
drivers/firewire/ohci.c|973 col 45-49 error| Use of undeclared identifier 'work'
drivers/firewire/ohci.c|1098 col 24-33 error| Call to undeclared function 'from_work'; ISO C99 and later do not support implicit function declarations
drivers/firewire/ohci.c|1098 col 42-46 error| Use of undeclared identifier 'work'
drivers/firewire/ohci.c|1468 col 2-23 error| Call to undeclared function 'enable_and_queue_work'; ISO C99 and later do not support implicit function declarations
drivers/firewire/ohci.c|2622 col 2-23 error| Call to undeclared function 'enable_and_queue_work'; ISO C99 and later do not support implicit function declarations
drivers/firewire/ohci.c|3549 col 2-23 error| Call to undeclared function 'enable_and_queue_work'; ISO C99 and later do not support implicit function declarations

What am I missing here?

allenpais commented 6 months ago

@recallmenot I am sorry, I should have given you more details.

The branch that this patch applies on is disable_work-v3, Note, we need a one more patch on top of this for enable_and_queue_work(). The patch can be found at: https://lore.kernel.org/lkml/ZfIIUyip4U-hGZ4l@slm.duckdns.org/t/

recallmenot commented 6 months ago

I was building the v1 branch as the patch specified it. Now I tried:

git clone https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git
git checkout disable_work-v3
git apply 0001-firewire-Convert-from-tasklet-to-BH-workqueue.patch
git apply 0002-enable-and-queue-work.patch
make

with 0002-enable-and-queue-work.patch being the patch you said to add The branch is currently at a4c80b1573b5e448a98b3a0e5ccff0a154b9d5aa r8152: Convert from tasklet to BH workqueue

these two errors are left according to my LSP (clang) and gcc when I compile:

drivers/firewire/ohci.c|973 col 27-36 error| Call to undeclared function 'from_work'; ISO C99 and later do not support implicit function declarations
drivers/firewire/ohci.c|973 col 45-49 error| Use of undeclared identifier 'work'
drivers/firewire/ohci.c|1098 col 24-33 error| Call to undeclared function 'from_work'; ISO C99 and later do not support implicit function declarations
drivers/firewire/ohci.c|1098 col 42-46 error| Use of undeclared identifier 'work'

Is there anything else I should add / do?

allenpais commented 6 months ago

@recallmenot My bad. You also need apply https://lore.kernel.org/lkml/20240227191037.5839-1-apais@linux.microsoft.com/

recallmenot commented 6 months ago

Ah, so now finally it built.

To make this reproducible: 0001-firewire-Convert-from-tasklet-to-BH-workqueue.patch 0002-enable-and-queue-work.patch 0003-from-work.patch

git clone https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git
git checkout disable_work-v3
git apply 0001-firewire-Convert-from-tasklet-to-BH-workqueue.patch
git apply 0002-enable-and-queue-work.patch
git apply 0003-from-work.patch
make -j #cores
sudo make modules_install
make bzImage
cp arch/x86/boot/bzImage /boot/vmlinuz-6.7-wq-x86_64
sudo cp /etc/mkinitcpio.d/linux66.preset /etc/mkinitcpio.d/linux67-wq.preset
sudo nvim /etc/mkinitcpio.d/linux67-wq.preset
sudo mkinitcpio -p linux67-wq
sudo grub-mkconfig
sudo update-grub

Then in grub I chose the 6.7-wq kernel.

So I did some testing:

Playback sources were clementine (pulse), Firefox (pulse) and REAPER (ALSA).

Playback using pulse has a 10% success rate: Most of the time, the first 0.2 seconds can be heard, then the last audio buffer gets repeated indefinetely. This can be resolved by stopping audio playback, after 5 seconds the looping stops and playback can be attempted again. In case it does work, switching to another workspace will cause the same looping of the last buffer. Or just letting playback continue for long enough, one minute is the most I managed to get.

Trying to access the device directly with ALSA in an audio application like REAPER has the same success rate. After opening the audio device, usually the application freezes. In the cases I do get to play anything, after a while the program freezes and the buffer starts repeating. Switching workspaces will also cause the same freeze + repeat. Killing reaper immediately ends the repeating. At least I haven't gotten the driver to lock up my system yet but I haven't been able to get long periods of playback with it either. In current linux mainline, playback works, aside from switching workspaces sometimes causing the same freeze I described above and the inevitable full system lockup after 5-30 minutes of ALSA playback through FireFace 800 under higher processing load (likely resulting in an underrun) due to infinite spin lock from tasklet_disable_in_atomic() in ohci_flush_iso_completions(). Maybe @takaswie has an idea what this behavior could be and how we could fix it in your port to workqueue-bh.

allenpais commented 6 months ago

@recallmenot Thanks for testing it out. Let me look it a little more and write back.

recallmenot commented 4 months ago

Hi @allenpais, did you have any time to look into it again?