libgpiod seems to miss edge interrupts

brgl / libgpiod

This is a mirror of the original repository over at kernel.org. This github page is for discussions and issue reporting only. PRs can be discussed here but the patches need to go through the linux-gpio mailing list.

https://git.kernel.org/pub/scm/libs/libgpiod/libgpiod.git/

Other

309 stars 107 forks source link

libgpiod seems to miss edge interrupts #83

Closed rweickelt closed 1 month ago

rweickelt commented 5 months ago

I am using libgpiod 1.6.2 on a RPi ~3B~ 4B with Raspbian Bullseye. I have implemented an interrupt monitor strictly following this example (triggers on falling edges). Each interrupt is acknowledged by the host before the peripheral can generate a new one. The interrupt frequency is about 100-200 Hz.

My program usually runs for half a minute, sometimes longer. At some point it stops because libgpiod hangs at gpiod::line::event_wait() and doesn't recognize the interrupt. I am surprised that this is even possible. How can the interrupt be lost?

I am tempted to implement a simple polling loop instead.

brgl commented 5 months ago

What do you mean by "doesn't recognize the interrupt"? The library is just a wrapper around the kernel uAPI so it's most likely the linux driver that for some reason doesn't detect the interrupt. But once it stops, no more interrupts are triggered, is that what you mean?

rweickelt commented 5 months ago

Thanks for the quick reply.

What do you mean by "doesn't recognize the interrupt"?

I can see an edge on the GPIO line that should have triggered an event, but I don't get the event in software. It happens only rarely and after handling 1000s of interrupts successfully. I can also rule out that the RPi is overloaded with interrupts because my program has to acknowledge every interrupt before the peripheral deasserts the interrupt line.

I don't know anything about the underlying kernel uAPI and how edge detection is implemented (on the RPi). Is it done in hardware or in software? If it's done in software, then I see how we could miss an edge.

brgl commented 5 months ago

I don't know, I am not familiar with every single GPIO driver in the kernel. Do you see any warnings in the kernel log?

rweickelt commented 5 months ago

I switched to the CAPI now. Same behavior.

Maybe I am using libgpiod the wrong way? My program waits on interrupts on 2 different IO pins concurrently in 2 different threads. Is that valid?

brgl commented 5 months ago

Do you requested them separately?

rweickelt commented 5 months ago

Yes, everything is completely separated. No objects are shared.

The kernel log does not contain any warnings btw.

warthog618 commented 5 months ago

You also asked this question on Stack Overflow. Might've been nice to reference that.

Each interrupt is acknowledged by the host before the peripheral can generate a new one.

What you are seeing is consistent with the interrupt not being acked at the peripheral, no? So, how is that done? How can you be sure that is working correctly?

rweickelt commented 5 months ago

When the host sees an event (falling edge on IO line), it acknowledges the interrupt with a SPI message. After the message, the peripheral will deassert the IO line and produce a new falling edge after some time. That’s IMO safe and it works well when using a FT4222 as interface or when the host is a bare metal program on a microcontroller. It’s just an RPi that acts up.

I’ll check if I can reproduce the problem with a single peripheral as well. So far I’ve only been using two peripherals concurrently. However, it’s probably not a libgpiod problem.

warthog618 commented 5 months ago

It probably isn't a libgpiod problem, but hopefully we can still help you track it down.

The fact that the SPI and interrupt handshaking works on other platforms doesn't prove anything. New platform, new problems.

To prove it is an interrupt problem with the RPi, you need to monitor the SPI and interrupt line and demonstrate that the interrupt is deasserted, then re-asserted but you don't get an event from libgpiod. A scope on the SPI clock line and the interrupt line would probably be sufficient to show that. As I stated earlier, what you are seeing is consistent with the SPI ack failing and, especially given you are driving two concurrently (hopefully on separate buses??), that would seem to me to be the more likely culprit and I would be wanting to eliminate that as a possibility.

rweickelt commented 5 months ago

@warthog618 Here's a trace of the problem:

The trace consists of

2 peripherals connected to the same SPI bus (SCLK, PICO and POCI)
distinct CS lines A and B
distinct IRQ lines A and B
one debug output signal showing when my program sees the falling edge event on IRQ line A
the peripherals are handled in 2 different threads - no libgpiod objects are shared between the threads

A successful communication round:

Falling edge on IRQ A
RPi program sees the lgpiod event and toggles the debug pin
RPi reads out data from peripheral A and eventually sends the interrupt ack message. The peripheral A then immediately deasserts the IRQ line

Peripheral B is handled in a different thread and does exactly the same. After another communication round, we see:

Falling edge on IRQ A
RPi program sees the lgpiod event and toggles the debug pin
RPi program acknowledges the interrupt, peripheral A deasserts the IRQ line
Peripheral A asserts the IRQ line for the next round, but the host doesn't see the event and hangs in gpiod_line_event_wait() until the timeout (I switched to the CAPI now).

I am not able to reproduce this problem with only one peripheral, but only with 2.

warthog618 commented 5 months ago

Interesting. If it only happens with two peripherals then you've probably found a race somewhere - the problem is finding where. It would be good to get some debugging from the kernel to see if it is seeing that interrupt. (Hopefully Bart can suggest how best to do that.) If the kernel is seeing it then we can track down where in the kernel it is getting lost. If the kernel isn't seeing it then the problem is in the driver or hardware itself.

It would also be good to be able to reproduce the problem without additional hardware - just with pins looped back on the RPi. Unfortunately I don't have a Pi 3B to play with.

Other background questions: How do you drive the SPI (including the chip selects) and the debug pin?

Your two peripherals are synchronised so they assert their IRQs at about the same time (within the resolution of the scope image)? At around 250Hz? Reminds me of a project I worked on with two SPI ADCs, though that had a much higher data rate to the point it required separate SPI buses and bare metal rather than Linux.

brgl commented 5 months ago

Any chance that GPIO1's interrupt fires when interrupts are disabled as the one for GPIO2 is being handled?

warthog618 commented 5 months ago

Sure, but it is up to the driver to handle that situation, right? That is, interrupts are only masked while in a handler so the second interrupt will just be delayed. Though it is always possible the driver is somehow managing to accidentally clear the second while handling the first. Which is why I wanted to find out if the interrupt is getting as far as the kernel. If not, then there is definitely an issue with the driver or below, rather than somewhere in the irq/gpio path.

rweickelt commented 5 months ago

Today I found this topic which indicates that the issue discussed here might be hardware bug. A workaround is in place.

Surprising that any modern MCU would fail on such a basic task.

warthog618 commented 5 months ago

Yep, that could be it. It would be good to confirm that a patched kernel fixes your problem.

I'm more surprised that it has taken this long to be picked up and fixed. Even if it is specific to the 3B, that has been out a good while.