SpenceKonde / DxCore

Arduino core for AVR DA, DB, DD, EA and future DU-series parts - Microchip's latest and greatest AVRs. Library maintainers: Porting help and adviccee is available.
Other
189 stars 50 forks source link

Help troubleshooting library (interrupts) #277

Closed obdevel closed 2 years ago

obdevel commented 2 years ago

I have a problem with a library I use for the MCP2515 SPI CAN controller IC. It has worked on every other core I've thrown it at (AVR, Tiny, Arm, Pico, ESP, etc) but I get a hang with DxCore. The library's ISR uses SPI to send and receive CAN messages and manage memory buffers. I agree with the general principle that ISRs should do as little as possible but in this case it's important to handle incoming messages quickly so as not to miss any traffic (the chip has only two hardware buffers and there can be 100s of messages per second). The library is generally well written and documented, and the author (who is French) is active and amenable to PRs if we can identify the problem.

The ISR is 'just' doing an SPI.beginTransaction, some SPI.transfers and some data buffer management. Nothing too exotic or obviously dangerous.

Source file: https://github.com/pierremolinaro/acan2515/blob/master/src/ACAN2515.cpp

Basically, depending on the contents of a status register, the chip either has a new message, an available output buffer, or some error condition.

The ISR isr() starts at line 618 and calls isr_core() at line 625. I've have hacked in some debug code to flash an LED and it definitely gets into isr_core() but hangs somewhere thereafter.

I've tried both the 'old' and 'new' versions of attachInterrupt(). It is also possible to explicitly poll the chip by calling isr_core() from user code and this works fine. So, the problem is only when executing this code from an interrupt context.

Is there anything obviously dodgy in the code path ? Any ideas for further debugging ?

Thanks.

SpenceKonde commented 2 years ago

Oh krutz, seriously? But wait, you say even with tools submenu set to use the old attach Interrupt mode (which should literally have the exact same code as before I started making a mess of that) this doesn't work? o_O

And you are using 1.4.10? (1.4.9 has a horrible bug in pin interrupts, though it should only impact the new implementation)

As a test, I would also suggest trying the last 1.3.x version. If that works, then we know there's a problem with the new interrupt attaching code.

If not - then we have a less critical but far more surprising bug. In either event, the goal is to get it to work using new interrupt mode, Use digitalWriteFast(LED_PIN,HIGH); as your virtual test probe (assuming pin LED_PIN is connected to an led that turns on when the pin is driven high) to find the exact line in the library that it hangs on. That will tell us a great deal about what is going on. That line is a MUCH better test than digitalWrite() since it executes in a single clock cycle. If you have a bunch of leds, you can expedite the process of finding the line it hangs on with a construction like: void isr_core() { digitalWriteFast(LED_PIN1,HIGH); someotherline digitalWriteFast(LED_PIN2,HIGH); anotherline digitalWriteFast(LED_PIN3,HIGH); thirdline digitalWriteFast(LED_PIN4,HIGH);

If it doesn't always happen, or doesn't happen on the first pass, use CHANGE instead of HIGH to toggle the pin.

digitalWriteFast, even with "CHANGE" compiles to a single instruction, unlike digitalWrite() which compiles to.... a lot more than that.... because digitalWrite can handle the pin or what state to write the pin at runtime, and knows about PWM and how to know whether a pin has PWM (if PORTMUX.TCAROUTEA & 0x07 == digitalPinToPort(pin) for TCA0, (digitalPinToPort(pin)==0 && !(PORTMUX.TCAROUTEA&0x31) || digitalPinToPort(pin)==6 && PORTMUX.TCAROUTEA&0x31 == 0x11)) for TCA1 otherwise check the lookup table) and how to turn that off (and if it's a TCD0 pin we're writing, and we saw that it was on, we also have to clear the pin invert enable). Really it's a grotesquely long process that digitalWrite embarks on, and easily changes the behavior of the code you're investigating.

I don't have any hardware to test this on, see....

On Thu, Apr 28, 2022 at 8:19 PM obdevel @.***> wrote:

I have a problem with a library I use for the MCP2515 SPI CAN controller IC. It has worked on every other core I've thrown it at (AVR, Tiny, Arm, Pico, ESP, etc) but I get a hang with DxCore. The library's ISR uses SPI to send and receive CAN messages and manage memory buffers. I agree with the general principle that ISRs should do as little as possible but in this case it's important to handle incoming messages quickly so as not to miss any traffic (the chip has only two hardware buffers and there can be 100s of messages per second). The library is generally well written and documented, and the author (who is French) is active and amenable to PRs if we can identify the problem.

The ISR is 'just' doing an SPI.beginTransaction, some SPI.transfers and some data buffer management. Nothing too exotic or obviously dangerous.

Source file: https://github.com/pierremolinaro/acan2515/blob/master/src/ACAN2515.cpp

Basically, depending on the contents of a status register, the chip either has a new message, an available output buffer, or some error condition.

The ISR isr() starts at line 618 and calls isr_core() at line 625. I've have hacked in some debug code to flash an LED and it definitely gets into isr_core() but hangs somewhere thereafter.

I've tried both the 'old' and 'new' versions of attachInterrupt(). It is also possible to explicitly poll the chip by calling isr_core() from user code and this works fine. So, the problem is only when executing this code from an interrupt context.

Is there anything obviously dodgy in the code path ? Any ideas for further debugging ?

Thanks.

— Reply to this email directly, view it on GitHub https://github.com/SpenceKonde/DxCore/issues/277, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTXEWZKDLUCZY37WXK6VX3VHMTH3ANCNFSM5UUEXUVA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--


Spence Konde Azzy’S Electronics

New products! Check them out at tindie.com/stores/DrAzzy GitHub: github.com/SpenceKonde ATTinyCore https://github.com/SpenceKonde/ATTinyCore: Arduino support for all pre-2016 tinyAVR with >2k flash! megaTinyCore https://github.com/SpenceKonde/megaTinyCore: Arduino support for all post-2016 tinyAVR parts! DxCore https://github.com/SpenceKonde/DxCore: Arduino support for the AVR Dx-series parts, the latest and greatest from Microchip! Contact: @.***

SpenceKonde commented 2 years ago

@obdevel any assistance you can provide here would be greatly valued.

obdevel commented 2 years ago

It's been a holiday weekend here and I had to assemble a board with enough external IO pins to attach some LEDs, and have a 328P-based board for comparison.

Long story short, it dies somewhere in SPI.transfer but the ISR calls this method multiple times and it doesn't hang on the previous calls.

My guess is there's an insidious overwrite hiding somewhere in my code or the library I'm using, which is only triggered by this core and compiler combination. My code, which is itself a library used by 100s of people across 1000s of boards, has been stable for a few years (it's a higher-level messaging protocol based on CAN bus). Annoyingly, no other hardware/library/compiler combination has exposed this problem so far.

I'll try some code refactoring to see if I can move the problem around. I can use a different MCP2515 library for the time being. It's less functional but does the job and C++ (yay) means I can do this by changing two lines in my code. It still does SPI transfers in an ISR but I've soak tested it and it's all fine.

Thanks for the debugging ideas. I'll close this issue for now and reopen it if anything new comes to light.

SpenceKonde commented 2 years ago

This is a very interesting issue for exactly the reasons you're mentioning.

Long story short, it dies somewhere in SPI.transfer but the ISR calls this method multiple times and it doesn't hang on the previous calls.

Okay, well stick leds on a bunch of pins, set them as outputs, and digitalWriteFast one of them high before each SPI transfer. Now you've caught the code in the act of failing, you can further instrument it and examine it more closely. There may very well be a bug in the core contributing here

SpenceKonde commented 2 years ago

I'd really like my core to not break code that works on "normal" arduino boards. if tjere os a bug of any sort in the code or any files supplied with it, this is a high priority issue

obdevel commented 2 years ago

So, the code hangs in SPI.transfer. Using an LED to debug (pinMode call in setup), the LED lights but doesn't extinguish:

  /*
  * The following NOP introduces a small delay that can prevent the wait
  * loop from iterating when running at the maximum speed. This gives
  * about 10% more speed, even if it seems counter-intuitive. At lower
  * speeds it is unnoticed.
  */

  asm volatile("nop");

  SPI_MODULE.DATA = data;
  digitalWriteFast(PIN_PC3, HIGH);
  while ((SPI_MODULE.INTFLAGS & SPI_RXCIF_bm) == 0);  // wait for complete send
  digitalWriteFast(PIN_PC3, LOW);
  return SPI_MODULE.DATA;                             // read data back
}

I have 5 LEDs attached, so I can visualise the code flow in the library when a new CAN message is received and the chip asserts the interrupt, i.e.

isr() -> isr_core() -> handleRXBInterrupt() -> bitModify2515Register() -> SPI.transfer() -> never returns

Explicitly polling the chip (calling isr_core() directly) succeeds when there is no message waiting but hangs if there is (= more SPI transfers). Also, there is plenty of SPI comms in the begin() routines earlier on.

I've swapped out both the AVR128DA28 and the MCP2515 chips, but I can't discount poor PCB layout affecting the SPI bus, although never had a problem on countless other boards, or with the other CAN library on this board. Also, the problem is absolutely repeatable.

FWIW, the SPI chip select pin is PIN_PD7 and the interrupt is attached to PIN_PF1.

Next steps ?

SpenceKonde commented 2 years ago

So PC3 ends up high.... in the hung state? just before calling SPI.transfer, can you: Serial.printHex(SPI.CTRLA); Serial.print(':'); Serial.printHexlbn(SPI.CTRLB); Serial.print(':'); Serial.printHex(SPI.INTCTRL); Serial.print(':'); Serial.printHex(SPI.INTFLAGS); Serial.flush();

(use the highest baud rate that works I wouldsugest to minimize the timing impact.

SpenceKonde commented 2 years ago

I want to know exactly how that SPI port is configured.

obdevel commented 2 years ago

SPI.transfer will return if I add a timeout, so it's otherwise just spinning in that while loop forever, e.g.

unsigned long c = 0UL;
while ((SPI_MODULE.INTFLAGS & SPI_RXCIF_bm) == 0 && (c < 100UL)) (++c);

It then hangs elsewhere in the ISR but debugging by moving digitalWriteFast calls around and recompiling is painfully slow, so it'll take some time to get more info.

I can only get one data byte to print when it hangs, even at 500000 baud, so I compiled the four items in one at a time and ran it a few times.

Normal operation, when called from main code, before and after SPI.transfer:

Before: 21:04:00:00 After: 21:04:00:80

When called from the interrupt context, the second item, SPI_MODULE.CTRLB = 00. That's the only difference. So,

Before: 21:00:00:00 After: unknown

I'll carry on debugging the library ISR with LEDs.

SpenceKonde commented 2 years ago

Badabing!

config is clearing the SS Disable bit, and SS is set as an input; if it floats low, SPI leaves master mode, everyting hits fam

(you only get one byte out even if you put a Serial.flush() after all the Serial.print statements, becfore the line that crashes it? That is unexpected. If you flush, it should send eveything in the serial buffer before proceeding.

SpenceKonde commented 2 years ago

Oh but not if the interrupt can't fire because it's in an infintie while loop...

obdevel commented 2 years ago

I've been leaving the external SS pull-up resistor out of my recent designs on the expectation that the chip or the code will be driving that signal correctly. Can't find an official SPI spec to get a definitive opinion on this and most references don't mention it (e.g. MCP appnotes). But rather like doing SPI stuff in an ISR, it may be inadvisable but it should still work, and does with other cores. Anything else to test ?

SpenceKonde commented 2 years ago

Yeah there will be something to test as soon as I commit my fix. I'm about to post a request for perverse pieces of code with no dependencies on libraries not included with the core, ut which to exercise as much of the core a possible (tjere was no SPI.begomtTramsactpom tets.

There's sort of a package of fixes to libraries coming to both cores soon. Logic. SPI. Wire,. amd some amalaog stiff

On Mon, May 16, 2022 at 8:54 PM obdevel @.***> wrote:

I've been leaving the external SS pull-up resistor out of my recent designs on the expectation that the chip or the code will be driving that signal correctly. Can't find an official SPI spec to get a definitive opinion on this and most references don't mention it (e.g. MCP appnotes). But rather like doing SPI stuff in an ISR, it may be inadvisable but it should still work, and does with other cores. Anything else to test ?

— Reply to this email directly, view it on GitHub https://github.com/SpenceKonde/DxCore/issues/277#issuecomment-1128285992, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTXEW2KF66OIKNCBEWC7UTVKLU4VANCNFSM5UUEXUVA . You are receiving this because you modified the open/close state.Message ID: @.***>

--


Spence Konde Azzy’S Electronics

New products! Check them out at tindie.com/stores/DrAzzy GitHub: github.com/SpenceKonde ATTinyCore https://github.com/SpenceKonde/ATTinyCore: Arduino support for all pre-2016 tinyAVR with >2k flash! megaTinyCore https://github.com/SpenceKonde/megaTinyCore: Arduino support for all post-2016 tinyAVR parts! DxCore https://github.com/SpenceKonde/DxCore: Arduino support for the AVR Dx-series parts, the latest and greatest from Microchip! Contact: @.***