KevinOConnor / can2040

Software CAN bus implementation for rp2040 micro-controllers
GNU General Public License v3.0
666 stars 66 forks source link

After Ack when received #30

Closed furusawata closed 1 year ago

furusawata commented 1 year ago

Hello. I have a phenomenon that I don't understand. After can2040 is in receiving state and returns Ack to other device's transmission packet, Ack-like waveform appears again.

Waveform during normal and error. (250kbps / 2 nodes) compare

This phenomenon occurs once every 700000~1000000 packets. What could be the cause?

KevinOConnor commented 1 year ago

That's certainly unexpected.

Can you provide some additional information: What can2040 revision are you using? What was the rp2040 system clock speed? In the picture above, it seems like the spurious transmit is 6 passive bits after the ack - when this error occurs does it seem to occur with that same timing? When the error occurs does it seem that there is always just one spurious dominant bit (not multiple dominant bits)?

Also, can you confirm that the application using can2040 is only calling the library functions from a single rp2040 core and that the functions are not being invoked "reentrant"?

-Kevin

furusawata commented 1 year ago

Before the additional information, write the following information: I also confirmed the phenomenon with Klipper. The hardware is Raspberry Pi Pico and MAX3051.

I am running two commands. cangen can0 -g 10 -e -I 00EF0102 -L 8 -D 0000000000000000 Device Response [EID: 00EF0201 DLC: 8 Data: 00 00 00 00 70 70 00 00]

cangen can0 -g 11 -e -I 00EA0102 -L 3 -D 2EFF00 Device Response [EID: 00FF2E01 DLC: 8 Data: 00 00 00 00 00 00 00 80]

Log

0002000278  EF0102   E      [8]  00 00 00 00 00 00 00 00    ........
0002000279  EF0201   E      [8]  00 00 00 00 70 70 00 00    ....pp..
0002000286  EA0102   E      [3]  2E FF 00                   ...
0002000287  FF2E01   E      [8]  00 00 00 00 00 00 00 80    ........
0002000289  EF0102   E      [8]  00 00 00 00 00 00 00 00    ........
0002000289  EF0201   E      [8]  00 00 00 00 70 70 00 00    ....pp..
0002000297  EA0102   E      [3]  2E FF 00                   ...
0002000298  FF2E01   E      [8]  00 00 00 00 00 00 00 80    ........
0002000299  EF0102   E      [8]  00 00 00 00 00 00 00 00    ........
0002000299  EF0201   E      [8]  00 00 00 00 70 70 00 00    ....pp..
0002000309  EA0102   E      [3]  2E FF 00                   ...
0002000309  EF0102   E      [8]  00 00 00 00 00 00 00 00    ........
0002000310  FF2E01   E      [8]  00 00 00 00 00 00 00 80    ........
0002000310  EF0201   E      [8]  00 00 00 00 70 70 00 00    ....pp..

Trigger on error klipper_1

Expand klipper_2

Expand more klipper_3

In the picture above, it seems like the spurious transmit is 6 passive bits after the ack - when this error occurs does it seem to occur with that same timing?

Yes, it's always the same timing.

When the error occurs does it seem that there is always just one spurious dominant bit (not multiple dominant bits)?

Yes, always once.

The change in "Issues #23" was irrelevant.

thank you.

KevinOConnor commented 1 year ago

Okay, thanks. I reviewed the code and did not find anything suspicious. I'll try to reproduce the issue locally.

-Kevin

KevinOConnor commented 1 year ago

I have been able to reproduce this issue locally.

dualack

I used your cangen commands: cangen can1 -g 11 -e -I 00EA0102 -L 3 -D 2EFF00 (rp2040 on can1) and cangen can0 -g 10 -e -I 00EF0102 -L 8 -D 0000000000000000 (stm32f072 on can0). Errors also reported by stm32f072 (running candlelight) as seen with candump -t z -Ddex can0,#FFFFFFFF:

 (015.186853)  can0  TX - -  00EF0102   [8]  00 00 00 00 00 00 00 00
 (015.196975)  can0  TX - -  00EF0102   [8]  00 00 00 00 00 00 00 00
 (015.197102)  can0  RX - -  2000000C   [8]  00 00 10 00 00 00 00 00   ERRORFRAME
        controller-problem{}
        protocol-violation{{tx-recessive-bit-error}{}}
 (015.198474)  can0  RX - -  00EA0102   [3]  2E FF 00

In a ~15 minute capture I got 27 errors. I was able to capture about 10 of those on my logic analyzer. All occurrences of error seem to be when can2040 acks a packet and has a packet ready for transmit. Spurious transmission always seems to be one dominant bit in the 7th slot of EOF sub-frame.

Not sure root cause at this point. I will try to track it down.

-Kevin

KevinOConnor commented 1 year ago

This appears to be a logic error in the "report" state machine handling. The following change fixes the issue for me:

--- a/src/can2040.c
+++ b/src/can2040.c
@@ -817,6 +817,9 @@ report_note_parse_error(struct can2040 *cd)
 static void
 report_line_ackdone(struct can2040 *cd)
 {
+    // An ackdone event indicates an ack is present
+    if (cd->report_state == RS_NEED_RX_ACK)
+        cd->report_state |= RS_NEED_EOF_FLAG;
     // Setup "matched" irq for fast rx callbacks
     uint32_t bits = (cd->parse_crc_bits << 8) | 0x7f;
     pio_match_check(cd, pio_match_calc_key(bits, cd->parse_crc_pos + 8));

Without the above change, it is possible that a call to can2040_transmit() that occurs during an rx message, after an ack is generated, but before the 6th eof bit is received, could lead to incorrect behavior in report_line_txpending(). Depending on the timing of the can2040_transmit() it could result in a race condition in the PIO. Advancing the state in report_line_ackdone() should avoid the issue.

-Kevin

furusawata commented 1 year ago

I fixed it and confirmed that the problem does not occur. Thanks for the great software and the quick fix.

KevinOConnor commented 1 year ago

Thanks for testing. After further thought, I believe the change at #31 is a better fix.

-Kevin

KevinOConnor commented 1 year ago

I've merged in the fix contained in PR #31. (And I also tagged that code as "v1.4.1".)

Thanks again for reporting and testing. -Kevin