NACKed message locks transmit queue

rjp5th commented 2 years ago

Bug Description

Sending a CAN frame with no other devices on the network (resulting in no acknowledgement to the frame) will lock up the transmit queue. Even after another device joins the network, no messages can be transmitted from the Pico.

Test Setup

Raspberry Pi Pico connected to an MCP2562 transceiver (with external pull-up on the TX pin) with no other devices on the network. This bug also occurs if the TX and RX pins are directly connected together on the Pico with no transceiver.

At startup, can2040 is initialized, and a CAN frame (DLC of 1) is sent every second, incrementing the data byte.

Example code demonstrating this code can be found in this repository (rjp5th/can2040-loopback).

Expected Result

The first CAN frame is re-transmitted until a device on the network ACKs the frame.

Actual Result

The first CAN frame is transmitted twice. After this point no additional CAN frames are transmitted, even if another device is added to the network. Because the packet is never re-transmitted, an acknowledgment cannot be received, so the first message will never be marked as complete.

A logic trace shows this behavior below: Logic Trace Zoomed Logic Trace Wide

In the example repository, logging was added into the CAN2040 (using Trice for low-latency logging in interrupts), with the log output captured here.

Detailed Description of Events

According to the logs, everything works as expected for the first transmitted message (Target timestamp 1162751 us). Before this point, can2040 was initialized (1152092 us) and the first heartbeat message is sent. The message is sent (1162389 us), and received back (1162652 us). During the processing of the ACK bits, the parse state enters MS_DISCARD (1162721 us) as the message was not ACKed. The maytx irq is fired (1162748 us) since the bus is idle from the last transmit, so the previous frame is queued again (1162749 us). The irq routine now exits (1162751 us).

After this point, on the second transmission, the bug appears. The parse state is in MS_DISCARD from the previous frame's ACK error. data_state_line_passive is called from the EOF/inter-frame spacing before the new transmission (1162758 us). However, instead of detecting a new start-of-frame from the current transmission, the data is dropped and continues to wait for a new start-of-frame (1162759 us). Because the frame has been transmitted but not detected by the input state tracking, it continues to wait for the transmitted frame.

This results in a deadlock between the transmit and receive states, with the transmit waiting for the transmitted message to be received by the input state tracking, and the input state tracking waiting for the transmitter to send the queued message. This will never recover as tx_state=TS_QUEUED, but pio_tx_did_conflict will return false (since it was able to successfully send the frame onto the bus), so the transmit will not send another message until the previous message is received, thus locking the TX queue.

KevinOConnor commented 2 years ago

Thank you for your detailed analysis. I agree that the can2040 code could have done better here.

There seems to be two issues with the can2040 code:

The unstuf_clear_state() code successfully avoids unwanted bitstuffing and prevents unwanted line passive events during end of frame processing. However, altering stuffed_bits makes it possible for data_state_line_passive() to get out of sync with the pio "sync" state machine. In addition to the error reported in this ticket, should the second retransmit actually be ack'ed then the code may send the message a third time - likely causing an unwanted duplicate transmission.
The tx_schedule_transmit() check for an active transmission is fragile. The pio_tx_did_conflict() code only checks for a dominant/passive bit conflict. If the tx queue is fully drained and the can2040 parsing code fails to transition to a TS_CONFIRM_TX state then the tx queue could get stuck (at least until some other transmitter results in a transition to TS_ACKING_RX state).

I will look into the above.

-Kevin

KevinOConnor commented 2 years ago

I was able to reproduce the issue locally. I created PR #15 that fixes the issue for me (via addressing point 1 above).

-Kevin

rjp5th commented 2 years ago

Thank you for your quick fix! I just tested the PR and it fixes the issue with the NACK blocking TX.

I do agree with your noted issue 2, as during testing I also just hit a case where having the transceiver unplugged causes a similar bug with the TX queue blocking. Although the transceiver event is unlikely, ideally it would not fail in a similar way, and I also feel there may be more edge cases that might cause similar results. Thank you again for your help with this.

KevinOConnor commented 2 years ago

I have created PR #16 to improve the handling of point 2 above. In my testing, either PRs #15 or #16 fix the issue, but the intent is to commit both of these PRs.

-Kevin

rjp5th commented 2 years ago

I just checked and PR #16 also fixes the issue on my setup. Thanks again

KevinOConnor commented 2 years ago

I have committed PR #15 and #16 to the master can2040 branch.

-Kevin

KevinOConnor / can2040