Intermittent i2c errors, byte count incorrect and missing bytes in onReceive interrupt handler

earlephilhower / arduino-pico

Raspberry Pi Pico Arduino core, for all RP2040 and RP2350 boards

GNU Lesser General Public License v2.1

2.02k stars 421 forks source link

Intermittent i2c errors, byte count incorrect and missing bytes in onReceive interrupt handler #979

Closed chrisckc closed 1 year ago

chrisckc commented 1 year ago

I am posting this issue here first as I am using Arduino-Pico, but this could well be an issue in the underlying Pico SDK or even a hardware issue. I am going to do further testing but I will describe what I have tried so far below.

This is a bit long winded, mainly because writing it out is a good way for me to make sure I have covered everything I can think of and has helped me reach the conclusion that there must actually be an issue here somewhere.

Ok, so I have noticed some strange behaviour when sending data from one Pico to another using the i2c interface. I am sending 2 arrays of data from one pico to the other, the first array is 84 bytes long and the second is 48 bytes long, I understand that the wire library implementation for the Pico uses a 128 byte transmit buffer so this is within the buffer size.

The format of the data I am sending isn't really relevant, but it helped me spot the issue. The data I am sending is 2 arrays of 32bit float values, the first array is 21 floats which equals 84 bytes and the second array is 12 floats which equals 48 bytes. For each float array I send each group of 4 bytes sequentially using the write function until all the values have been placed in the buffer before calling endTransmission to send the data. The sequence I use for each array is like this: beginTransmission(slave_address); write(byte1); write(byte2); write(byte3); write(byte4); write(byte1); …. Rest-of-bytes … endTransmission(); So basically I am sending 2 separate i2c bus transmissions, the first one being 84 bytes followed immediately by another of 48 bytes.

On the receiving pico i check the expected size of the data as reported in the onReceive interrupt then depending on the size i then copy the bytes into one of 2 correctly sized byte arrays for later processing in the main loop.

In the main loop I convert the bytes from each array back into 2 arrays of floats so that I have the original float arrays, the first one being 21 floats = 84 bytes. And the second one being 12 floats = 48 bytes. I am using i2c1 (Wire1) on both Pico’s as this makes the pins better suit my requirements and layout.

When testing I started to notice that the second array on the receiving side was not being populated sometimes, digging further I found that this was because the onReceive interrupt sometimes only reported 47 bytes for the second transmission rather than 48 bytes, resulting in the transmission being ignored by the routine inside the onReceive interrupt . To see what was missing I modified the code so it filled the second array even if the received bytes was less then the expected value of 48. What I found was that on the occasions when only 47 bytes was reported, all of the floats were corrupted, being way off value, close to their max values. This indicates that the missing data was at the start of the i2c transmission which messed up the expected boundaries of the 4 byte groups which make up every float value.

So far I had mainly been using serial print debugging for this issue so used some more spare pins as debug signal outputs and connected up the scope to find out what was ging on. On the receiving side I setup one of the new debug pins to output a low pulse when the bytes received was less than expected for the second i2c transmission.

ScreenImg-31 The purple trace is the i2c clock, Blue SDA, yellow low pulse is the error marker from the debug pin, green is the data processing and error detection code running.

The pair of transmissions are here, rather than sending the second immediately after the first, in this trace they are separated by a 1mS gap which is one of the things I tried changing later. ScreenImg-30

By trigging off the debug pin I was able to determine from measuring the clock pulses for the address and data transfers that the correct number of bytes was being sent on the wire despite what was being reported in the onReceive interrupt. I also had some serial debugging on the sending Pic to report any errors from the endTransmission call, for which none were ever reported.

The error rate was high, around 50% of the second transmissions reported 47 bytes rather than 48. I was using an i2c clock of 1MHz (fast mode plus) as stated in the documentation as the fastest supported speed. I had also changed the pull up resistors from 4.7k to 1k to help shape the pulses, this also increased the actual clock speed as I was only getting around 700KHz (presumably due to clock stretching), with 1k pull-ups I am getting 868KHz. The 1k pull-ups made no difference to the error rate, but the rising edge slope was now better.

ScreenImg-29

I was sending the pair of transmissions at a frequency of 50Hz, so at 20 mS intervals, this left plenty space between each pair of i2c transmissions and I could confirm via use of debug output pulses that everything had finished its work well before the next onReceive interrupt. Some of the processing work was being done in core1 using loop1() so I eliminated this and ran everything in core0, commenting out loop1(), this made no difference.

On the scope I noticed that there was quite a delay between the end of the clock pulse block and the start of the onReceive interrupt (around 220uS). This meant that the next i2c transmission was arriving before the interrupt had fired. This resulted in a short break in the SCL pulses during the interrupt handling, I thought maybe this was the issue, but why was it only intermittent?

To rule this out, on the transmitting side, I added a 500uS delay between the first and second transmissions to see what difference it would make, this made the error rate drop to around 10%. I was able to confirm on the scope that all processing related to the first transmission was complete before the second transmission started, but the missing data was still occurring. I then increased this to 600uS just for the heck of it to create a larger gap and found the error rate dropped further to around 5%.

Having previously had issues with random interrupt firing delays on pins when using the USB-UART serial comms on a Pico, due to that also using interrupts, I wondered if the USB-UART serial communications on the receiving side was having an effect as there was some serial debugging output being done in between the first and second transmission blocks. Previously this would have overlapped with the reception of the second transmission until I added the 500uS delay on the transmitter. I moved this serial debug output code completely outside of the pair of transmission receptions so that it occurred in the dead 20ms space between each pair of transmissions, this further reduced the error rate to around 3%

Stage 2:

So far I have been able to confirm that the transmitting side was working correctly and the scope showed the correct number of clock pulses. Having spent a whole bunch of time debugging this issue and making a mess of the code branch with all the extra debugging stuff I decided to create a test harness project for this with a dedicated sketch for each pico.

The test harness I created to debug this issue is here on my GitHub account: https://github.com/chrisckc/TestHarness-I2C

For the test harness created 2 sequential float arrays (1.1, 1.2, 1.3 etc.) of the same sizes I was using in my project. I dumped them out on the serial port as their respective byte arrays in hex and binary format to aid debugging on the scope. I used a lower 10Hz frequency for transmission of the 2 arrays and decoded them back into floats on the receiving side as before. This time I also allowed for missing data in the first transmission to be processed I had purely been focusing on what was going on in the second transmission so far. I also maintained the delay between the first and second transmissions, increasing it to 1000uS

The only real differences I noticed with the new test code was that I sometimes had 2 missing bytes rather than 1 missing, also the missing byte(s) are at the end rather than at the beginning as only the last float value was corrupted.

My test code has confirmed that the issue is still there, and is also occurring in the first transmission, but only at a round half the error rate of the second transmission. I also used the protocol decoder on my scope to check the bytes against the hex values that I output over serial debugging. By triggering off a low pulse from a debug pin indicating missing data I was able to confirm that all of the the bytes were indeed correct and present, despite what the onReceive interrupt reported.

The 48 bytes transmitted in the second transmission are as follows, separated by dashes and comma for clarity: ( float1byte1-float1byte2-float1byte3-float1byte4,float2byte1-float2byte2-float2byte3-float2byte4, .... ) 66-66-46-40,CD-CC-4C-40,33-33-53-40,9A-99-59-40,00-00-60-40,66-66-66-40,CD-CC-6C-40,33-33-73-40,9A-99-79-40,00-00-80-40,33-33-83-40,66-66-86-40

Here is the protocol decoder output from the wire: ScreenImg-26 In the above scope image, an error has been detected on the second transmission of the pair, the yellow trace low pulse is the error marker, purple is i2c clock and blue is i2c SDA, green is the marker for the code which processes to data and also generates the yellow pulse if there is an error. As can be seen, the protocol decoder has decoded the data, showing the data in the second transmission. The last byte is 0x40 which is clearly visible in the blue trace. The interrupt handler only reported 47 bytes for this transmission and the decoded float value array had a corrupted last value due to the 0x40 not being received.

Ok, so now armed with the test harness and having confirmed that the correct bytes are always being sent, regardless of an error being detected at the receiver, I tried the following:

Reduced the transmission rate (the gap between each pair of transmissions), to 1Hz, errors still occurred around same percentage, I changed it back to 10Hz for the rest of the changes below.
Reduced the i2c clock to 400KHz (actually about 365KHz as measured), errors were now down to around 0.5 to 1%
Changed some serial debug output so that it occurred in-between the first and second transmission blocks as it was originally, errors now increased to around 1 to 2%
Reduced the i2c clock further to 100KHz (actually about 95 KHz as measured), errors were now down to around 0.007% and I had to leave it running for a while to get a decent amount of failures to generate the average failure percentage.

The circuit that I originally encountered this issue on is built on a protoboard with soldered wires, the connections between the i2c pins on each pico are very short (around 15mm) however there is an ADC connected to the transmitting Pico which is also being clocked by it using the Pico’s internal 12MHz clock output on a pin. The receiving pico was actually a PicoW and I also wanted to rule that out of the picture, even though there was no Wifi code being used at this stage.

Stage 3:

To rule out my hardware configuration I rebuilt a simple version on a breadboard using 2 brand new Pico’s (no PicoW involved here) this time the i2c wires are bit longer at around 50mm due to the breadboard layout constraints. As before I connected the VBUS pins together so that the USB connected to one of the Pico’s also powers the other Pico. Obviously the ground pins are also connected together.

Breadboard test setup: 20221111_192715

My actual project prototyping hardware which I originally encountered this issue on: 20221112_195312

Results were the same on the breadboard setup, the issue still existed with the failure rates around the same percentages.

I then connected the VSYS pins of each Pico together and powered VSYS from 4 NiMH cells in series, around 5v. I used VSYS rather than VBUS to take advantage of the diode to avoid back-powering the USB bus on the laptop when connecting it to the USB port on the Pico.

I left it running with nothing else connected, no USB cable, no scope, no test wires sticking out of the breadboard etc. After some time had passed I connected the USB cable to the receiving Pico to check the serial output and found many errors had built up, at around the same rate as when i was powering from USB and had the scope connected. I left this running over night and it accumulated around 12,000 errors across both the first and second transmissions in the pair.

Minimal battery operated setup: 20221112_160808

The reason for the above test was to rule out any noisy USB power issues, although seeing as the 3.3v reg on the Pico is a buck-boost converter this was unlikely to make any difference, and sure enough did not.

Next steps:

Having so far not determined if this is a hardware or software issue, my next steps will be as follows:

Attempt to read the expected number of bytes inside the onReceive interrupt rather than what is supplied by the interrupt handler function, Although this would not account for data missing at the start of the transmission as observed initially.
Bypass the Arduino Wire API and make use of the underlying Pico SDK directly, while also having a good look at the Wire API implementation.
Swap out just the receiving Pico for a different dev board, say a Teensy4
Swap out just the transmitting Pico for a different dev board, say a Teensy4
As I am currently using i2c1 on both of the Pico’s (Wire1) change the code and pins to use i2c0 (Wire)
Go back to the 2 Pico’s and power them both from an external 3.3v linear reg, bypassing the switching regulator.
Play around with the amount of data being sent in each transmission so see what difference it makes.

chrisckc commented 1 year ago

Ok, so I have ruled out item 1 in my next steps by updating then code to read the expected number of bytes regardless of what is reported by the interrupt handler function.

According to the Wire API code, if Wire.read() returns -1 it means data is not available, so I have captured this error condition and outputted over serial. I can confirm that Wire.read() does indeed return -1 when attempting to read past the byte count reported by the interrupt handler.

The updated code has been pushed to a new branch: dev/readexpectedbytes https://github.com/chrisckc/TestHarness-I2C/tree/dev/readexpectedbytes

chrisckc commented 1 year ago

Ok, so back on the master branch I have updated to test harness i2c receiver code to also compile for Arduino MBED core and added an entry in the PlatformIO.ini for the target (pico-mbed)

I had to change the receiver to use i2c0 because there is no supported or easy way to use i2c1 on the Arduino MBED core. This meant I had to change the debug output pins because they clashed with the i2c0 default pins, and there is also no sensible way to change the i2c pins on the Arduino MBED core. I also had to create a Macro to deal with the missing Serial.printf function on the Arduino MBED core.

After re-wiring the breadboard for the new i2c and debug pins, and before testing it out on the MBED core i checked to make sure the issue still existed using i2c0 on Arduino-Pico, which it does, so no difference between using Wire1 and Wire, so that's another item off the checklist.

So how did it work for the MBED core? well it didn't work at all, the onReceive interrupt handler function reported seemingly random numbers for the number of bytes received, anything from 114 bytes to 256 bytes. Another issue was that none of the debug output pins are firing, not even the one triggered inside the onReceive interrupt handler, which was clearly firing in response to received data. The debug pins are all permanently high as was configured during setup(), nothing showing or triggering on the scope traces.

To assess the debug pins issue, I created another project for output testing using the same pins, it is basically the same code but with all the i2c stuff taken out and a pin interrupt configured for testing. Using this code I was able to see output on the scope for the same debug pins when compiled for the MBED core. I was also able to change the outputs during the ISR I added which is triggered off another debug pin.

I don't know why this is happening, it doesn't make much sense, the Wire API is basically the same bar some limitations, that's the whole point of having the API definition in the first place, so it should behave the same way for both targets. The debug pins not working for the MBED target is also a strange one, my testing so far seems to suggest its related to i2c being used, which makes no sense.

Note: I had previously tried the Arduino MBED core for my project but found it to be unusable for high speed interrupt handling, well not actually that high speed, only 12KHz, however even this was too much for the MBED RTOS, way to much delay and jitter in the interrupt handler timing and everything just ran significantly slower. However this should not be an issue for simply reading some i2c data sent at intervals of 10Hz as in the i2c receiver code.

chrisckc commented 1 year ago

I have now tried item 3 in my list: "Swap out just the receiving Pico for a different dev board, say a Teensy4" I have updated the master branch in my test harness, the i2c-receiver code now compiles for 3 targets, pico, pico-mbed and teensy40

I have the Teensy 4.0 wired into the breadboard with the i2c wires moved from the Pico receiver to the Teensy4.0 20221115_114548

After leaving it running for 50 minutes I only had 1 failure recorded, this is several orders of magnitude better than what I have seen so far.

Seconds: 0003014 
LoopRate: 9283477
receiveRate: 0000010
drCount: 0059537
dr1Count: 0029769 dr1SuccessCount: 0029768 dr1FailureCount: 0000001
dr2Count: 0029768 dr2SuccessCount: 0029768 dr2FailureCount: 0000000
dr1FailureRate:   0.0033592 percent
dr2FailureRate:   0.0000000 percent

So out of 29769 transmission pairs I had a single failure, this time on the first transmission, which was a missing byte at the start of the first transmission resulting in all of the float values corrupted. So far until swapping out the receiver for the Teensy4.0 I had only ever seen missing bytes at the end of each transmission.

Considering that I was powering the setup from the USB port of a thunderbolt docking station and I also had the scope connected to both i2c lines I might expect to encounter the odd failure, especially using a breadboard.

There is also the possibility that while I was away from my desk making coffee, the cat could have jumped onto my desk and trod on it, he likes to do this with electronics. That was definitely not the reason for what I have seen up to now though as I can see those failures occurring while I am set looking at it as they are so frequent.

chrisckc commented 1 year ago

Ok, so now after more than an hour I still only had a single error using the Teensy4.0 as a receiver, so I decided to crank everything up a notch...

I updated the i2c-transmitter to use a 1MHz i2c clock instead of the default 400KHz clock and increased the send rate to be 50Hz rather than 10 Hz as this was my original configuration when I spotted the issue.

Having made sure that the cat, who is now sleeping, could not get into the office when I was not there, I left the Teensy4.0 running the same code to see how it did.

With the Teensy4.0 receiver at 1MHz i2c clock, zero failures after 42 minutes and over 120,000 transmission pairs:

Seconds: 0002541
LoopRate: 8924780
receiveRate: 0000047
drCount: 0241018
dr1Count: 0120509 dr1SuccessCount: 0120509 dr1FailureCount: 0000000
dr2Count: 0120509 dr2SuccessCount: 0120509 dr2FailureCount: 0000000
dr1FailureRate:   0.0000000 percent
dr2FailureRate:   0.0000000 percent

Based on this I think I can rule out the single failure at 400KHz as a signal anomaly likely caused by disturbance of the breadboard wires, blaming the cat here.

I found that my scope, Siglent SDS1104X-E could no longer properly decode the i2c data though, only the address and then 1 or 2 bytes.

Actual clock speed was around 862KHz: ScreenImg-32

chrisckc commented 1 year ago

For the sake of consistency i reconnected the the Pico as the receiver using the same code, unmodified and compiled for the Pico using this library, Arduino-Pico. 20221115_132007

I saw failures straight away, same as before but interestingly the error rate is now less, despite that fact that I am using 1MHz i2c clock and 50 Hz send rate. The only difference here in the hardware from previous tests is that the 1k pull-up resistors are now connected to the transmitting Pico's 3.3v line rather than the receiving Pico. This was the same as when testing the Teensy4.0 as the receiver earlier with no errors at 1MHz.

After around 10 minutes I encountered 8 failures, here are the results for the Pico receiver at 1MHz:

Seconds: 0000767
LoopRate: 0598019
receiveRate: 0000048
drCount: 0072454
dr1Count: 0036227 dr1SuccessCount: 0036227 dr1FailureCount: 0000000
dr2Count: 0036227 dr2SuccessCount: 0036219 dr2FailureCount: 0000008
dr1FailureRate:   0.0000000 percent
dr2FailureRate:   0.0220830 percent

chrisckc commented 1 year ago

Pico receiver at 400KHz i2c clock, and 10Hz send rate, the output below shows around half the number of transmissions as in the previous output, it takes longer to gather the data.

3 failures, so more or less the same at slower i2c clock and send rate.

Seconds: 0001870
LoopRate: 0640086
receiveRate: 0000010
drCount: 0036822
dr1Count: 0018411 dr1SuccessCount: 0018410 dr1FailureCount: 0000001
dr2Count: 0018411 dr2SuccessCount: 0018409 dr2FailureCount: 0000002
dr1FailureRate:   0.0054315 percent
dr2FailureRate:   0.0108631 percent

chrisckc commented 1 year ago

I swapped the pull-up resistors back to the receiving Pico's 3.3v line, this seemed to increase the error rate slightly:

Seconds: 0001925
LoopRate: 0640088
receiveRate: 0000010
drCount: 0037910
dr1Count: 0018955 dr1SuccessCount: 0018953 dr1FailureCount: 0000002
dr2Count: 0018955 dr2SuccessCount: 0018950 dr2FailureCount: 0000005
dr1FailureRate:   0.0105513 percent
dr2FailureRate:   0.0263783 percent

I also noticed a failure where data was missing at the start of the second transmission, which is normally much less common.I am fairly convinced by now that this is some kind of hardware issue.

chrisckc commented 1 year ago

According to the documentation: https://datasheets.raspberrypi.com/pico/getting-started-with-pico.pdf Page 57: "Using Pico Probe" Powering one pico from another using the VSYS pins is a supported configuration, this is how I had it configured when running from batteries and still observed errors. At the moment running from USB power, the VBUS pins are connected together, the only difference being the Schottky diode in between VBUS and VSYS.

chrisckc commented 1 year ago

I would like to try powering the receiving Pico from the 3.3v line of the transmitting Pico, However according to this documentation: https://datasheets.raspberrypi.com/rp2040/hardware-design-with-rp2040.pdf Page 16:

The 3.3V pin is an output from Raspberry Pi Pico or Raspberry Pi Pico W and should not be connected to an external power source. It is intended to be used as an output to provide power to external circuits.

There is however the 3V3_EN pin which disables the onboard SMPS. Page18:

The SMPS EN pin is pulled up to VSYS by a 100kΩ resistor and made available on Pico pin 37. Shorting this pin to ground will disable the switcher and put it into a low power state.

So in theory I can disable the SMPS on the receiving Pico and power it's 3.3v line from the transmitters 3.3v line. Having both supplies active with the outputs connected together is not going to be a good idea.

The is also the power save feature on the SMPS:

GPIO23 controls the RT6150 PS (Power Save) pin. When PS is low (the default on Pico) the regulator is in Pulse Frequency Modulation mode, which, at light loads, saves considerable power by only turning on the switching MOSFETs occasionally to keep the output capacitor topped up. Setting PS high forces the regulator into Pulse Width Modulation (PWM) mode. PWM mode forces the SMPS to switch continuously, which reduces the output ripple considerably at light loads (which can be good for some use cases) but at the expense of much worse efficiency. Note that under heavy load the switcher will be in PWM mode irrespective of the PS pin state.

So it would be worth setting PS to high on both Pico's in order to reduce the output ripple, I will try this before powering via the 3.3v pin.

chrisckc commented 1 year ago

Ok, so I added the following code to put the SMPS into PWM mode.

  #ifdef ARDUINO_RASPBERRY_PI_PICO
        pinMode(23, OUTPUT);
        digitalWrite(23, HIGH);
    #endif

The above lines were added to the setup of the both the i2c-transmitter and i2c-receiver code. I used ARDUINO_RASPBERRY_PI_PICO as it is defined when using either Arduino-Pico or Arduino-MBED. Results below , as before compiled for Arduino-Pico.

I saw 11 failures after about1 hour and around 37,000 transmission pairs:

Seconds: 0003804
LoopRate: 0640098
receiveRate: 0000010
drCount: 0075146
dr1Count: 0037573 dr1SuccessCount: 0037569 dr1FailureCount: 0000004
dr2Count: 0037573 dr2SuccessCount: 0037566 dr2FailureCount: 0000007
dr1FailureRate:   0.0106459 percent 
dr2FailureRate:   0.0186304 percent

chrisckc commented 1 year ago

Update: I pulled the 3V3_EN pin low to disable the SMPS on the receiving Pico and powered it from the 3.3V output of the transmitting Pico, this made no difference, if anything slightly more errors. I also tried swapping the pull-up resistors back over to the transmitting Pico's side of the i2c wires and 3.3V output, again no real difference, to be expected as the 3.3V lines are now connected together.

UPDATE: It has been running in this configuration now for several hours, I2c clock at 400KHz and send rate at 10Hz Over a thousand failures, worse than before, this is averaged over a longer time period so more accurate.

Seconds: 0009917
LoopRate: 0508664
receiveRate: 0000010
drCount: 0196174
dr1Count: 0098087 dr1SuccessCount: 0097633 dr1FailureCount: 0000454
dr2Count: 0098087 dr2SuccessCount: 0097495 dr2FailureCount: 0000592
dr1FailureRate:   0.4628544 percent
dr2FailureRate:   0.6035458 percent

chrisckc commented 1 year ago

Here is a capture on the scope of the 3.3V line during a pair of i2c transmissions, the yellow low pulse indicates an error in the second transmission.

The noise on the 3.3V line is quite bad at around 100mV during i2c activity. ScreenImg-33

The clock and SDA data pulses are pretty clean though, in the traces below, yellow is connected to SDA, purple to Clock and blue to 3.3V: ScreenImg-34

The last resort test now is the run both Pico's from a linear regulator, disabling both of their onboard switched-mode power supplies.

chrisckc commented 1 year ago

So now I have wired in a TC1264 800mA LDO, overkill I know, but it's what I had, the SMPS is disabled on both Pico's and 3.3V pins are wired together to the LDO output complete with recommended smoothing caps.

The 3.3V line is now cleaner than before, also the voltage is now exactly 3.3v rather than 3.456v with the SMPS ScreenImg-35

I noticed previously that each Pico had a different voltage on the the 3.3V pins when I had them both using their own 3.3V lines, one Pico had 3.45v and the other 3.28V. This is a good reason not to connect them together without first disabling the onboard SMPS.

I will post the failure rate once it has been running for some time.

chrisckc commented 1 year ago

I left the new setup running overnight with batteries powering the 3.3V linear regulator which is now powering both boards via the 3.3V pin with the onboard SMPS disabled. Nothing else was connected. As with the previous test, 400KHz i2c clock and 10Hz send rate. 20221116_110551

Results still show many failures and as usual, the failure rate for the second transmission is higher than the first at 3.5%

Seconds: 0041118
LoopRate: 0508635
receiveRate: 0000010
drCount: 0813418
dr1Count: 0406709 dr1SuccessCount: 0402957 dr1FailureCount: 0003735
dr2Count: 0406709 dr2SuccessCount: 0392305 dr2FailureCount: 0014387
dr1FailureRate:   0.9183470 percent
dr2FailureRate:   3.5374186 percent

chrisckc commented 1 year ago

I have just tried again compiling for the official ArduinoCore-MBED Core to try and figure out why I am not seeing the debug pulse outputs and getting seemingly random numbers between 114 bytes to 256 bytes as the byte count inside the onReceive interrupt handler function.

I updated the i2c-receiver code to make the serial output easier to read due to the way ArduinoCore-MBED handles serial communication (only sends the serial data upon encountering a carriage return char apparently).

From looking at the output it was clear that the onReceive interrupt handler was only firing between 3 and 4 times per second by watching the drCount (data received count) go up compared to the Seconds counter. Each pair of data transmissions are being sent 10 times per second so most of the interrupts are missing.

Seconds: 0000049
LoopRate: 0048038
receiveRate: 0000003
drCount: 0000135
dr1Count: 0000135 dr1SuccessCount: 0000000 dr1FailureCount: 0000131
dr2Count: 0000000 dr2SuccessCount: 0000000 dr2FailureCount: 0000000
dr1FailureRate:  97.0370407 percent
dr2FailureRate:   0.0000000 percent

dr1: Error!!, received: 172 expected: 84 bytes! | dr1: Error!!, received: 120 expected: 84 bytes! | dr1: Error!!, received: 119 expected: 84 bytes! | dr1: Error!!, received: 181 expected: 84 bytes! |

To help debug this I slowed down the transmission rate to 1Hz and set the scope to 500mS per division. I found the debug pulses which I previously thought were missing, they were nowhere near where I expected to find them as the delay between the end of the pair of i2c transmissions and the start of the onReceive interrupt is 264 mS !!! This is why I was only seeing between 3 and 4 i2c interrupts per second, it seems that this is all that the MBED RTOS can handle?

In this zoomed out scope trace below covering several seconds, purple is the i2c clock, blue is the onReceive interrupt, green is some processing, yellow is the error marker. the purple cursors are measuring the time between the i2clock ending and the start of the interrupt: ScreenImg-36

The gap between each pair of transmissions is still set at 1mS and the Arduino-Pico core can handle this no problem with the interrupt firing within around 20 microseconds of the the end of the first transmission, as would be expected as it is running on "bare metal", or more like "bare-silicon".

As for ArduinoCore-MBED, I would expect an RTOS to add a delay to this but certainly not a quarter of a second, this is hardly "real time" for a "Real Time Operating System"

This means that the assumptions made in the i2c-receiver for testing purposes are not valid for ArduinoCore-MBED because in theory the data for both transmissions should be available inside the onReceive interrupt. If it is not all read, which is what is happening with the code as it is now then presumably there will be data left over in the buffer for the next interrupt and hence result in random numbers for the subsequent byte counts.

chrisckc commented 1 year ago

Ok, I think I am on to something, I implemented the i2c slave mode using the SDK functions, via small library I found which was written against the SDK, to save time. Using this code I am still getting errors.

This is how i2c slave mode works using the SDK: Using the i2c functions in the SDK, the receive interrupt handler is fired immediately after each byte is received on the wire, with the occasional small delay. The Wire API as implemented in Arduino-Pico is aggregating these interrupts and firing a single interrupt (onReceive) the end of the transmission to match the intended Wire API behaviour.

The receive interrupt handler as used by the SDK provides an event variable to distinguish what the interrupt is for. Inside the receive interrupt handler there needs to be a switch case statement on this event to check if the data has been received, has been requested or has finished being sent. To debug this I setup 2 debug pins on each of these conditions so I can capture when each data byes arrives and when the transmission has finished.

The problem:

What I have found so far is that when the error condition occurs, there is an extra firing of the i2c receive interrupt immediately after the interrupt which signalled the end of the transmission. The extra interrupt firing is a "data received" event, indicating that the "end of transmission" event interrupt occurred before the interrupt signalling the reception of the last byte. This looks like an out-of-order interrupt firing which is happening sometimes.

After the error condition occurs, there is then a spurious "end of transmission" interrupt at the start of the reception of the next transmission.

For successful data reception, the "end of transmission" interrupt occurs 3.4uS after the interrupt for the last byte, just slightly longer than the i2c clock period of 2.7uS

Rather than continue this here, as it is now clear that there is an underlying SDK or hardware issue, I posted an issue on the Pico-SDK repo: https://github.com/raspberrypi/pico-sdk/issues/1102

earlephilhower commented 1 year ago

Lots of debug, sorry for not responding as I was traveling until last night.

It has been a while since I wrote the I2C driver here, but IIRC there is only 1 I2C interrupt and we need to examine flag bits in a register to figure out why we got the IRQ.

My first guess is that this is a silly problem due to the order in which I check the flags in the handle.

I think I just need to move the "read data" IRQ handling bit to the top of my IRQ routine. and then the missing byte will be present. Right now if both EOT and RX bits are set, I ignore that RX bit. Race condition, but in my code not in the HW.

More concretely, why not try moving https://github.com/earlephilhower/arduino-pico/blob/2062f94adc179b23e9bcead9abc9432ecdc37a66/libraries/Wire/src/Wire.cpp#L179-L186 to the top of the function, before this https://github.com/earlephilhower/arduino-pico/blob/2062f94adc179b23e9bcead9abc9432ecdc37a66/libraries/Wire/src/Wire.cpp#L144-L150

Can you give that a try?

chrisckc commented 1 year ago

Thanks for looking at this, I think I have effectively tried what you are suggesting in this version of the i2c-receiver code which bypasses the Wire interface, using the SDK functions instead, via this helper library: https://github.com/vmilea/pico_i2c_slave

Here is the test code I created against the SDK using the above library: https://github.com/chrisckc/TestHarness-I2C/blob/a765f54aae7c3f891392bf00dcf2a20112684018/i2c-receiver-sdk/src/main-receiver.cpp#L109

I logged this separately on the Pico-SDK repo: https://github.com/raspberrypi/pico-sdk/issues/1102

The i2c_slave_handler handler function in the above code is called after each byte is received, I first check the event type to check if it is data received, this is done before the event is checked for being a finish event. What I see on the scope when I encounter a missing byte is that there is an extra firing of the interrupt immediately after the finish event.

I don't know what it looks like on the occasions when there are 2 missing bytes at the end as I have not been able to capture that yet on the scope as it is much less frequent, but I would guess that there would be 2 extra firings of the interrupt after the finish event.

I will try what you have suggested by modifying the code in the Wire API in Arduino-Pico ~~but in theory it should match what I am doing in the above code using the SDK directly.~~ EDIT: I just noticed in the helper library I am using for the SDK i2c stuff that it is also checking for a finish event before it is checking for a receive event before passing the event type to the interrupt in my code. I will also try modifying that as you have suggested. Even so, it doesn't make sense that the issue is intermittent.

I can see that the Wire API Code is just aggregating the interrupts for each byte received and presenting it as a single interrupt after the finish event is seen, which makes perfect sense but can only work if the SDK or the Silicon is behaving itself and signalling the end of the data transfer correctly.

chrisckc commented 1 year ago

Ok, I tried moving the check for the data received flag to the top of the function like this:

void TwoWire::onIRQ() {
    if (_i2c->hw->intr_stat & (1 << 2)) {
        // RX_FULL
        if (_slaveStartDet && (_buffLen < (int)sizeof(_buff))) {
            _buff[_buffLen++] = _i2c->hw->data_cmd & 0xff;
        } else {
            _i2c->hw->data_cmd;
        }
    }
.......

No difference, still getting errors:

Seconds: 0000642
LoopRate: 0609568
receiveRate: 0000010
drCount: 0012586
dr1Count: 0006293 dr1SuccessCount: 0006293 dr1FailureCount: 0000000
dr2Count: 0006293 dr2SuccessCount: 0006285 dr2FailureCount: 0000008
dr1FailureRate:   0.0000000 percent
dr2FailureRate:   0.1271254 percent

chrisckc commented 1 year ago

I have also just tried this, re-arranging it further by putting the flag checks which result in the _onReceiveCallback callback being called at the end of the check list:

void TwoWire::onIRQ() {
    if (_i2c->hw->intr_stat & (1 << 2)) {
        // RX_FULL
        if (_slaveStartDet && (_buffLen < (int)sizeof(_buff))) {
            _buff[_buffLen++] = _i2c->hw->data_cmd & 0xff;
        } else {
            _i2c->hw->data_cmd;
        }
    }
    if (_i2c->hw->intr_stat & (1 << 5)) {
        // RD_REQ
        if (_onRequestCallback) {
            _onRequestCallback();
        }
        _i2c->hw->clr_rd_req;
    }
    if (_i2c->hw->intr_stat & (1 << 6)) {
        // TX_ABRT
        _i2c->hw->clr_tx_abrt;
    }
    if (_i2c->hw->intr_stat & (1 << 10)) {
        _buffLen = 0;
        _buffOff = 0;
        _slaveStartDet = true;
        _i2c->hw->clr_start_det;
    }
    if (_i2c->hw->intr_stat & (1 << 12)) {
        if (_onReceiveCallback && _buffLen) {
            _onReceiveCallback(_buffLen);
        }
        _buffLen = 0;
        _buffOff = 0;
        _slaveStartDet = false;
        _i2c->hw->clr_restart_det;
    }
    if (_i2c->hw->intr_stat & (1 << 9)) {
        if (_onReceiveCallback && _buffLen) {
            _onReceiveCallback(_buffLen);
        }
        _buffLen = 0;
        _buffOff = 0;
        _slaveStartDet = false;
        _i2c->hw->clr_stop_det;
    }
}

Still seeing errors:

Seconds: 0000309
LoopRate: 0609619
receiveRate: 0000010
drCount: 0005928
dr1Count: 0002964 dr1SuccessCount: 0002964 dr1FailureCount: 0000000
dr2Count: 0002964 dr2SuccessCount: 0002958 dr2FailureCount: 0000006
dr1FailureRate:   0.0000000 percent
dr2FailureRate:   0.2024291 percent

chrisckc commented 1 year ago

Ok, so I also modified the interrupt handling function inside the i2c library I was using against the Pico-SDK: https://github.com/vmilea/pico_i2c_slave/blob/a25d1040950cdb4d37b50e05886c6195307f245c/i2c_slave/i2c_slave.c#L26

I re-ordered the checks to this so that the checks which call finish_transfer at the end:

static void __not_in_flash_func(i2c_slave_irq_handler)(i2c_slave_t *slave) {
    i2c_inst_t *i2c = slave->i2c;
    i2c_hw_t *hw = i2c_get_hw(i2c);

    uint32_t intr_stat = hw->intr_stat;
    if (intr_stat == 0) {
        return;
    }
    if (intr_stat & I2C_IC_INTR_STAT_R_RX_FULL_BITS) {
        slave->transfer_in_progress = true;
        slave->handler(i2c, I2C_SLAVE_RECEIVE);
    }
    if (intr_stat & I2C_IC_INTR_STAT_R_RD_REQ_BITS) {
        hw->clr_rd_req;
        slave->transfer_in_progress = true;
        slave->handler(i2c, I2C_SLAVE_REQUEST);
    }
    if (intr_stat & I2C_IC_INTR_STAT_R_TX_ABRT_BITS) {
        hw->clr_tx_abrt;
        finish_transfer(slave);
    }
    if (intr_stat & I2C_IC_INTR_STAT_R_START_DET_BITS) {
        hw->clr_start_det;
        finish_transfer(slave);
    }
    if (intr_stat & I2C_IC_INTR_STAT_R_STOP_DET_BITS) {
        hw->clr_stop_det;
        finish_transfer(slave);
    }
}

It has not been running long enough to rule out any missing data but so far it is looking good, no errors after having been running for 3 times longer than in my previous 2 posts.

Seconds: 0001681
LoopRate: 0608879
receiveRate: 0000010
drCount: 0033104
dr1Count: 0016552 dr1SuccessCount: 0016552 dr1FailureCount: 0000000
dr2Count: 0016552 dr2SuccessCount: 0016552 dr2FailureCount: 0000000
dr1FailureRate:   0.0000000 percent
dr2FailureRate:   0.0000000 percent

The only difference I can see from your code is that you handle the restart event which also triggers the _onReceiveCallback as a finished transfer for user code and you take no action for the abort event. The library I am using does not handle the restart event but also treats the abort event as a finish transfer for user code.

I will provide an update after more time has passed:

chrisckc commented 1 year ago

Oh dear, spoke to soon!

Seconds: 0002309
LoopRate: 0608858
receiveRate: 0000010
drCount: 0045533
dr1Count: 0022766 dr1SuccessCount: 0022765 dr1FailureCount: 0000001
dr2Count: 0022767 dr2SuccessCount: 0022766 dr2FailureCount: 0000001
dr1FailureRate:   0.0043925 percent
dr2FailureRate:   0.0043923 percent

chrisckc commented 1 year ago

left it running for a while, still getting errors but its the lowest failure rate I have seen so far:

Seconds: 0011096
LoopRate: 0492087
receiveRate: 0000010
drCount: 0219537
dr1Count: 0109766 dr1SuccessCount: 0109761 dr1FailureCount: 0000005
dr2Count: 0109771 dr2SuccessCount: 0109766 dr2FailureCount: 0000005
dr1FailureRate:   0.0045551 percent
dr2FailureRate:   0.0045549 percent

The boards are running from USB power at an i2c clock of 400KHz. I can try leaving them running overnight on batteries using a linear LDO in place of the SMPS regulators on the boards.

earlephilhower commented 1 year ago

By inspection I see what I've got does need that movement to check for EOT only after everything. I also need to make the check do a single atomic read of the IRQ state (like you did by copying to a local variable) per pass of the function because there's also the race condition of the EOT going high while the IRQ handler is running.

Per the RP2040 datasheet the I2S block is a standard Synopsys part so I am really inclined to think this is either a still-lurking software issue or a timing/signal one in your design. That Synopsys block has been instantiated probably 100s of milllions of times by now in other systems.

Maybe GCC is being "smart" and eliding the local variable and reading the IRQ flags on each if instead of the local atomic copy? You could try bracketing the code with

#pragma GCC push_options
#pragma GCC optimize ("O0")

...the IRQ handler goes here...

#pragma GCC pop_options

(there's probably a more "correct" way of doing this, but the O0 is a simple thing to try).

In any case, I will adjust my existing code to fix the found-by-inspection bits we just discussed.

chrisckc commented 1 year ago

Thanks, ill try that modification.

Btw. It has been running for a few more hours since my last post and I only have 1 extra failure. EDIT: I Also noticed that at least one of the failures was due to one of more bytes missing at the start of the data transfer. This is something I have always seen but was not very common.

Seconds: 0019900
LoopRate: 0492068
receiveRate: 0000010
drCount: 0393874
dr1Count: 0196934 dr1SuccessCount: 0196928 dr1FailureCount: 0000006
dr2Count: 0196940 dr2SuccessCount: 0196934 dr2FailureCount: 0000006
dr1FailureRate:   0.0030467 percent
dr2FailureRate:   0.0030466 percent

What is also different is that the number of failures for each data transfer are the same, which I have never seen before. normally the second one has more significantly failures than the first.

It looks like there are likely multiple causes for the missing data and the single atomic read of the IRQ state along with the re-ordering of the IRQ state checks has fixed one of them.

I agree about the i2c hardware, I have been reluctant to assume it was a hardware issue, but I have done my best to rule it out along the way, even going as far as bypassing the noisy SMPS regulators for a linear reg and running on batteries. Having seen this on both a soldered circuit and on a breadboard, both with short wires and also with different Pico's I hope I have ruled out any issues with my wiring.

I have actually abandoned i2c in favour of SPI for now, although it is overkill for what I need. I don't want to use the UART.

chrisckc commented 1 year ago

Ok, so I have added the gcc options around the interrupt handler, is this the correct way?

#pragma GCC push_options
#pragma GCC optimize ("O0")
static void __not_in_flash_func(i2c_slave_irq_handler)(i2c_slave_t *slave) {
    i2c_inst_t *i2c = slave->i2c;
    i2c_hw_t *hw = i2c_get_hw(i2c);

    uint32_t intr_stat = hw->intr_stat;
    if (intr_stat == 0) {
        return;
    }
    if (intr_stat & I2C_IC_INTR_STAT_R_RX_FULL_BITS) {
        slave->transfer_in_progress = true;
        slave->handler(i2c, I2C_SLAVE_RECEIVE);
    }
    if (intr_stat & I2C_IC_INTR_STAT_R_RD_REQ_BITS) {
        hw->clr_rd_req;
        slave->transfer_in_progress = true;
        slave->handler(i2c, I2C_SLAVE_REQUEST);
    }
    if (intr_stat & I2C_IC_INTR_STAT_R_TX_ABRT_BITS) {
        hw->clr_tx_abrt;
        finish_transfer(slave);
    }
    if (intr_stat & I2C_IC_INTR_STAT_R_START_DET_BITS) {
        hw->clr_start_det;
        finish_transfer(slave);
    }
    if (intr_stat & I2C_IC_INTR_STAT_R_STOP_DET_BITS) {
        hw->clr_stop_det;
        finish_transfer(slave);
    }
}
#pragma GCC pop_options

It has been running for about 50 mins on USB power, 4 errors in total so far with at least 1 of the errors being a missing byte at the start of the transfer.

Seconds: 0003031
LoopRate: 0608409
receiveRate: 0000010
drCount: 0059824
dr1Count: 0029911 dr1SuccessCount: 0029910 dr1FailureCount: 0000001
dr2Count: 0029913 dr2SuccessCount: 0029910 dr2FailureCount: 0000003
dr1FailureRate:   0.0033433 percent
dr2FailureRate:   0.0100291 percent

I will need to update the code to capture more information about bytes missing at the start or the end of each transfer.

earlephilhower commented 1 year ago

Yes, that's the right way to disable optimization on that function. I've got a PR in for the obviously sketchy bits. May not be 100% solution, but is definitely closer to correct than what's there now.

chrisckc commented 1 year ago

I know this is closed now, but here is an update: Using the previously posted code with the gcc options around the interrupt handler I have had the test setup running all day on a power bank with no failures reported.

I had previously ran it overnight on batteries but lost the results because they didn't last until the morning due to the increased power consumption of the linear voltage regulator that I wired in.

I will unwind the various things I tried related to my attempts to reduce noise and then check if it is still ok.

earlephilhower commented 1 year ago

Thanks for the update @chrisckc !

Would it be possible to pull the latest Wire.cpp with the volatile (vs. the -O0) that I've got in the core now just so we know exactly what's being run. AIUI that volatile temp variable should disallow any funny optimizations w/o requiring GCC-specific bindings I initially suggested. You've got a great setup for reproducing failures, and I'd love to make use of it. 😆

chrisckc commented 1 year ago

UPDATE: Good news: I have had the previously posted code with the gcc options around the interrupt handler running for 22 hours now with no failures.

Bad news: As requested I switched back over to the i2c-receiver project that uses the Wire API and I updated my Wire.cpp to the latest version which contains the fixes consisting of capturing the intr_stat IRQ status value into a volatile var and checking for data before any of the other checks. I am seeing failures within a few minutes minutes of running the test.

chrisckc commented 1 year ago

UPDATE: Following on from my previous post I left it running all day, using your Wire.ccp from this commit: https://github.com/earlephilhower/arduino-pico/commit/0133ecc887cf586fffdd7911f93d3b5643a95f7e

The results show a failure rate roughly comparable to using the original version of Wire.cpp, the only difference being that now all of the errors are missing bytes at the start of the transfers rather than mostly at the end.

Seconds: 0025097
LoopRate: 0492610
receiveRate: 0000010
drCount: 0496464 
dr1Count: 0248232 dr1SuccessCount: 0248225 dr1FailureCount: 0000002  dr1MissingStartByteCount: 0000002
dr2Count: 0248232 dr2SuccessCount: 0243639 dr2FailureCount: 0004588  dr2MissingStartByteCount: 0004588
dr1FailureRate:   0.0008057 percent
dr2FailureRate:   1.8482710 percent

As a side note, while this has been running I have been working on code to use SPI instead (using Arduino-Pico) and I have found the Pico in SPI Slave mode (by adding spi_set_slave() ) to misbehave in an erratic, intermittent manner as well. Again my use case may not be common. I switched to using pure SDK to create a simplified test harness to rule out the issue and still found problems. I am currently in the process of submitting an issue for that over on the pico-sdk repo.

chrisckc commented 1 year ago

As I reported earlier I managed to get a zero failure rate using the i2c-receive-sdk code inside my test harness on Github: https://github.com/chrisckc/TestHarness-I2C/tree/master/i2c-receiver-sdk that code uses this library: https://github.com/vmilea/pico_i2c_slave which I modified by reordering the IRQ status checks inside i2c_slave_irq_handler and wrapping in the suggested GCC options.

The modified version which works fine is checked into my test harness. https://github.com/chrisckc/TestHarness-I2C/blob/f38402c21cec8481d2552adb973d99f20a7ad615/i2c-receiver-sdk/lib/pico_i2c_slave/src/i2c_slave.c#L28

I have recently attempted again to get it working in Arduino-Pico by modifying the IRQ handler in Wire.cpp as follows: I re-ordered the checks, putting the ones which call the onReceiveCallbacks at the end and replaced the volatile var with the GCC options wrapping.

#pragma GCC push_options
#pragma GCC optimize ("O0")
void TwoWire::onIRQ() {
    // Make a local copy of the IRQ status up front.  If it changes while we're
    // running the IRQ callback will fire again after returning.  Avoids potential
    // race conditions
    uint32_t irqstat = _i2c->hw->intr_stat;

    // First, pull off any data available
    if (irqstat & (1 << 2)) {
        // RX_FULL
        if (_slaveStartDet && (_buffLen < (int)sizeof(_buff))) {
            _buff[_buffLen++] = _i2c->hw->data_cmd & 0xff;
        } else {
            _i2c->hw->data_cmd;
        }
    }
    // RD_REQ
    if (irqstat & (1 << 5)) {
        if (_onRequestCallback) {
            _onRequestCallback();
        }
        _i2c->hw->clr_rd_req;
    }
    // TX_ABRT
    if (irqstat & (1 << 6)) {
        _i2c->hw->clr_tx_abrt;
    }
    // START_DET
    if (irqstat & (1 << 10)) {
        _buffLen = 0;
        _buffOff = 0;
        _slaveStartDet = true;
        _i2c->hw->clr_start_det;
    }
    // RESTART_DET
    if (irqstat & (1 << 12)) {
        if (_onReceiveCallback && _buffLen) {
            _onReceiveCallback(_buffLen);
        }
        _buffLen = 0;
        _buffOff = 0;
        _slaveStartDet = false;
        _i2c->hw->clr_restart_det;
    }
    // STOP_DET
    if (irqstat & (1 << 9)) {
        if (_onReceiveCallback && _buffLen) {
            _onReceiveCallback(_buffLen);
        }
        _buffLen = 0;
        _buffOff = 0;
        _slaveStartDet = false;
        _i2c->hw->clr_stop_det;
    }
}
#pragma GCC pop_options

This made an improvement over the latest version in your latest commit, but still reported many errors, this site output after about 7 and half hours of running:

Seconds: 0027055
LoopRate: 0492224
receiveRate: 0000010
drCount: 0535548
dr1Count: 0267774 dr1SuccessCount: 0267774 dr1FailureCount: 0000000  dr1MissingStartByteCount: 0000000
dr2Count: 0267774 dr2SuccessCount: 0267681 dr2FailureCount: 0000093  dr2MissingStartByteCount: 0000093
dr1FailureRate:   0.0000000 percent
dr2FailureRate:   0.0347308 percent

again all missing bytes at that start, but the failure rate was much less.

chrisckc commented 1 year ago

FIXED! ? Comparing the zero failure rate IRQ handler used in i2c-receiver-sdk referenced above with the way the IRQ handler in Wire.cpp is operating, I can see that Wire.cpp is using the "START_DET" flag to determine that a transmission has started, reset the buffer position to zero, and allow the buffer to be filled, a completely reasonable thing to do. However the zero failure rate IRQ hander does not use that flag to determine if data can be received and passed on to user code, any data received is passed on as data received, regardless of the "START_DET" flag arriving before.

I then modified Wire.cpp further to make it behave this way by commenting out some the code as shown below. I removed the check for _slaveStartDet so that the buffer is filled regardless. I commented out the buffer position reset in the "RESTART_DET" flag check as this flag is ignored in the zero failure rate IRQ handler. I also did the same with the "START_DET" flag check to make it work in an equivalent manner.

#pragma GCC push_options
#pragma GCC optimize ("O0")
void TwoWire::onIRQ() {
    // Make a local copy of the IRQ status up front.  If it changes while we're
    // running the IRQ callback will fire again after returning.  Avoids potential
    // race conditions
    uint32_t irqstat = _i2c->hw->intr_stat;
    if (irqstat == 0) {
        return;
    }

    // First, pull off any data available
    if (irqstat & (1 << 2)) {
        // RX_FULL
        //if (_slaveStartDet && (_buffLen < (int)sizeof(_buff))) {
        if (_buffLen < (int)sizeof(_buff)) {
            _buff[_buffLen++] = _i2c->hw->data_cmd & 0xff;
        } else {
            _i2c->hw->data_cmd;
        }
    }
    // RD_REQ
    if (irqstat & (1 << 5)) {
        if (_onRequestCallback) {
            _onRequestCallback();
        }
        _i2c->hw->clr_rd_req;
    }
    // TX_ABRT
    if (irqstat & (1 << 6)) {
        _i2c->hw->clr_tx_abrt;
    }
    // START_DET
    if (irqstat & (1 << 10)) {
        //_buffLen = 0;
        //_buffOff = 0;
        _slaveStartDet = true;
        _i2c->hw->clr_start_det;
    }
    // RESTART_DET
    if (irqstat & (1 << 12)) {
        if (_onReceiveCallback && _buffLen) {
            _onReceiveCallback(_buffLen);
        }
        //_buffLen = 0;
        //_buffOff = 0;
        //_slaveStartDet = false;
        _i2c->hw->clr_restart_det;
    }
    // STOP_DET
    if (irqstat & (1 << 9)) {
        if (_onReceiveCallback && _buffLen) {
            _onReceiveCallback(_buffLen);
        }
        _buffLen = 0;
        _buffOff = 0;
        _slaveStartDet = false;
        _i2c->hw->clr_stop_det;
    }
}
#pragma GCC pop_options

UPDATE: With the above modification, I saw Zero failures over a period of 35 and half hours:

Seconds: 0128582
LoopRate: 0497734
receiveRate: 0000010
drCount: 2546002
dr1Count: 1273001 dr1SuccessCount: 1273001 dr1FailureCount: 0000000  dr1MissingStartByteCount: 0000000
dr2Count: 1273001 dr2SuccessCount: 1273001 dr2FailureCount: 0000000  dr2MissingStartByteCount: 0000000
dr1FailureRate:   0.0000000 percent
dr2FailureRate:   0.0000000 percent

It seems like there were 2 different issues causing either missing data at the end of missing data at the start, the missing data at the end was fixed first with the GCC options and flag check re-ordering, and now that the start flag is being ignored the missing data at the start of transmission has been resolved. This suggests that the Pico could be misbehaving sometimes in relation to the flags it is setting in the IRQ registers when receiving i2c data.

As a side note, I thought that the volatile declaration only made a difference for variables declared outside of an IRQ handler, so would make no difference for the irqstat var declared inside the IRQ handler? I noticed that the class private variables _slaveStartDet, _buffLen and _buffOff, which are also used in other functions, are not marked volatile, I didn't change that and managed to get zero failures so far.

Next: If I can completely undo the low noise mods and run at 1MHz instead of 400KHz without getting failures then I think I can regard the above modification as fixing the issues, although it feels hacky ignoring flags which are there for a reason.

earlephilhower commented 1 year ago

Thanks very much for your efforts here! I've got #1011 in to implement your changes, and it'll be part of the next release.

For the local volatile definition, the idea was to disallow optimization of the use of the variable...namely, GCC couldn't say "I already have this value from some other calculation, use that instead of the raw value read". Evidently it was not successful.

It's still odd, especially the missed start notifications in the HW. In any case, this is definitely not going to be the ugliest hack in the repo. :)

Thanks again!

chrisckc commented 1 year ago

Thanks, I thought that might have been the case regarding the volatile definition.

if there is enough time before the next release I would like to check if the the RESTART_DET has anything to do with the issue or if its just the start detection flag that goes awry sometimes. If it has no effect it would make sense to leave the buffer reset inside the RESTART_DET handling, such that the buffer is reset on both STOP_DET and RESTART_DET. I commented it out just to match the behaviour of the other method, which may or may not be correct in all situations.

Yeah its really odd, I have also logged this as an issue in the Pico-SDK repo: https://github.com/raspberrypi/pico-sdk/issues/1102 I referenced this issue as most of the info is here, no response yet though.

chrisckc commented 1 year ago

ok, after adding the buffer reset back into the restart section of the IRQ handler:

// RESTART_DET
    if (irqstat & (1 << 12)) {
        if (_onReceiveCallback && _buffLen) {
            _onReceiveCallback(_buffLen);
        }
        _buffLen = 0;
        _buffOff = 0;
        _i2c->hw->clr_restart_det;
    }

It has been running for 12 hours so far with no failures, suggesting that the issue must be due to the IRQ being raised with the data received flag being set (for 1 or more bytes) before it is raised with the start flag set.

There is also the possibility of both flags being set at the same time in an IRQ, however I think this can be ruled out because that would have been handled correctly by the original Wire.cpp code which checked for the start flag before checking the data received flag.

I will leave the setup running longer and then try a 1MHz i2c clock. It is also running in the original configuration using the Pico's individual onboard SMPS voltage regulators.

20221129_101131

It is left unattended powered by the power bank connected to VSYS to avoid back powering my USB port when I plugin to check the progress. The longer interconnecting wires are the i2c connections, the 1k pull-up resistors are in the red heat shrink tubing connecting across to the 3.3V pin. The shorter wires on the bottom edge are SPI connections which I have been testing and found issues with, again random and intermittent, even at 1MHz.

earlephilhower commented 1 year ago

...IRQ being raised with the data received flag being set (for 1 or more bytes) before it is raised with the start flag set.

That doesn't seem physically possible...the START condition is pretty simple and comes in before the 1st bit of the addressing, so you have 8 bit times, at least, before the data can become available. Very odd...

chrisckc commented 1 year ago

Yeah doesn't make sense but it is what seems to be happening from what I can tell from the observations, testing and code changes. It will require some debug signalling output to the scope from inside Wire.cpp to dig deeper.

Quick update: I ran the test at 1MHz (in reality about 865KHz) and saw errors, bytes missing at the start and end of the transmissions. To be sure I have reverted back to 400KHz to leave running overnight as I am using a different power bank and want to rule that out.

Note that I had previously substituted the receiving Pico on the breadboard for a Teensy4.0 and had no issues receiving data at 1MHz from the transmitting Pico. I used the exact same code based on Wire API, just changed pins and the compilation target in PlatformIO, obviously the Wire API implementation is different for that board.

I have some more Pico's arriving soon as I need another test setup to debug some issues with SPI. Looks like I may have to use the UART's after all or failing that not even use the RP2040 in my design.

chrisckc commented 1 year ago

UPDATE: With the test setup running at 1MHz as specified in the code (actually 965KHz as measured), and left running on a power bank, failure rate was 0.29%, but most of the failures are now missing bytes at the end fo the transfer, using the latest code as described up to now in this thread, code which has achieved a zero failure rate at 400KHz over a period of 35 hours.

This is the output after 10 hours on a power bank, 0.29% is the highest failure rate:

Seconds: 0035656
LoopRate: 0492027
receiveRate: 0000010
drCount: 0705846
dr1Count: 0353423 dr1SuccessCount: 0352393 dr1FailureCount: 0001026  dr1MissingStartByteCount: 0000030
dr2Count: 0352423 dr2SuccessCount: 0352393 dr2FailureCount: 0000030  dr2MissingStartByteCount: 0000000
dr1FailureRate:   0.2903037 percent
dr2FailureRate:   0.0085125 percent

This is the output after 13 hours on a 4 NiMH batteries, 0.032% is the highest failure rate:

Seconds: 0047302
LoopRate: 0491974
receiveRate: 0000010
drCount: 0936392
dr1Count: 0468270 dr1SuccessCount: 0468117 dr1FailureCount: 0000152  dr1MissingStartByteCount: 0000004
dr2Count: 0468122 dr2SuccessCount: 0468118 dr2FailureCount: 0000004  dr2MissingStartByteCount: 0000000
dr1FailureRate:   0.0324599 percent
dr2FailureRate:   0.0008545 percent

Obviously the batteries are going to provide a much cleaner voltage. It looks like we may now be in the realms of hardware limitations running at 1MHz, although the Pico Datasheet does say it supports i2c fast mode plus "the fastest operation being fast mode plus at up to 1000Kb/s."

From the spec: "The Fast-mode plus specification FM+ introduced by Philips Semiconductors (now: NXP) in April 2006 defines such a bus with a maximum speed of 1 Mhz. Unlike the high speed mode there is no additional logic to implement. Therefore, fast-mode plus devices are downward compatible with standard and fast-mode devices. In fact, many I2C slaves on the market like RAMs and EEPROMs will already tolerate higher bus frequencies."

chrisckc commented 1 year ago

I could try eliminating the breadboard which is far from ideal at higher frequencies, I could just solder very short wires between the 2 picos to see If it makes a difference.

However looking at the scope traces: ScreenImg-49

It seems pretty clean though plugged in the breadboard, I could try dropping the pull-up resistors down from 1k to 330 ohm to get a better pulse shape. As mentioned before, I swapped out the receiving pico for a Teensy4.0 at this speed and it had no issues. Also my scope is also able to decode the signals connected via jumper wires, through the probe ends, and down 1m or so of coax to the scope.

chrisckc commented 1 year ago

This is the latest set rig for this, eliminate the breadboard to see if 2 Pico's can reliably communicate using i2c at 1MHz (in reality about 865KHz) 20221206_154513 Using the same code and same 1k pull-ups as the last test posted here, I have even put 0.1uf and 10uF caps on the incoming power, for which in this case I'm using batteries. Ill leave it running and report back

chrisckc commented 1 year ago

Ok, it's been running for 5 hours and reported 54 errors so far using the above soldered version left running on batteries. Either there is still something that needs to be done on the software side or these errors at 1MHz are due to a hardware issue with the RP2040 i2c implementation? despite the datasheet calming that it supports it.

1MHz results:

Seconds: 0018188
LoopRate: 0493251
receiveRate: 0000009
drCount: 0359864
dr1Count: 0179954 dr1SuccessCount: 0179905 dr1FailureCount: 0000049  dr1MissingStartByteCount: 0000005
dr2Count: 0179910 dr2SuccessCount: 0179905 dr2FailureCount: 0000005  dr2MissingStartByteCount: 0000000
dr1FailureRate:   0.0272292 percent
dr2FailureRate:   0.0027792 percent

Note that i am seeing both types of failure at this higher speed, bytes missing at the end and the start.

I previously managed to have it running for much longer than this with no errors at 400KHz using the breadboard setup.

chrisckc commented 1 year ago

This is the Wire.ccp onIRQ() code used in the above test: compared to the checked in changes this has the buffer reset on the re-start condition added back in.

#pragma GCC push_options
#pragma GCC optimize ("O0")
void TwoWire::onIRQ() {
    // Make a local copy of the IRQ status up front.  If it changes while we're
    // running the IRQ callback will fire again after returning.  Avoids potential
    // race conditions
    uint32_t irqstat = _i2c->hw->intr_stat;
    if (irqstat == 0) {
        return;
    }

    // First, pull off any data available
    if (irqstat & (1 << 2)) {
        // RX_FULL
        //if (_slaveStartDet && (_buffLen < (int)sizeof(_buff))) {
        if (_buffLen < (int)sizeof(_buff)) {
            _buff[_buffLen++] = _i2c->hw->data_cmd & 0xff;
        } else {
            _i2c->hw->data_cmd;
        }
    }
    // RD_REQ
    if (irqstat & (1 << 5)) {
        if (_onRequestCallback) {
            _onRequestCallback();
        }
        _i2c->hw->clr_rd_req;
    }
    // TX_ABRT
    if (irqstat & (1 << 6)) {
        _i2c->hw->clr_tx_abrt;
    }
    // START_DET
    if (irqstat & (1 << 10)) {
        //_buffLen = 0;
        //_buffOff = 0;
        _slaveStartDet = true;
        _i2c->hw->clr_start_det;
    }
    // RESTART_DET
    if (irqstat & (1 << 12)) {
        if (_onReceiveCallback && _buffLen) {
            _onReceiveCallback(_buffLen);
        }
        _buffLen = 0;
        _buffOff = 0;
        _slaveStartDet = false;
        _i2c->hw->clr_restart_det;
    }
    // STOP_DET
    if (irqstat & (1 << 9)) {
        if (_onReceiveCallback && _buffLen) {
            _onReceiveCallback(_buffLen);
        }
        _buffLen = 0;
        _buffOff = 0;
        _slaveStartDet = false;
        _i2c->hw->clr_stop_det;
    }
}
#pragma GCC pop_options

chrisckc commented 1 year ago

UPDATE: I have created new test harness projects in the original test harness repo to simplify the testing and also test 2 way communication between 2 Pico's using I2C https://github.com/chrisckc/TestHarness-I2C/tree/master/I2C_receiver-new https://github.com/chrisckc/TestHarness-I2C/tree/master/I2c_sender-new These are based on sending a 255 byte buffer of sequential bytes back and forth between the 2 Pico's

I noticed that the changes made to improve this issue have broken the I2C slave request handling functionality which is used in I2C slave mode to send data back to the master Pico.

As Wire.cpp stands now with commit: https://github.com/earlephilhower/arduino-pico/commit/d94ca66e3eddd1893a450de0cc39de9234979f75 and https://github.com/earlephilhower/arduino-pico/commit/0133ecc887cf586fffdd7911f93d3b5643a95f7e for the original issue logged here (I2C data reception) I am still seeing errors. The error rate reported by I2C_receiver_new is around 0.45%, even when there is no USB serial being used during I2C data reception.

The changes I made to Wire.cpp as described in this new issue: https://github.com/earlephilhower/arduino-pico/issues/1100 Have reduced the error rate to Zero (when no USB serial being used during I2C data reception)

It is not clear why this has an impact on I2C data reception in my tests as the "Repeated Start" condition is not used in that scenario. It is only used during I2C "requests" and to stop another devices from grabbing hold of the I2C bus, which is a requirement when making a register read request to an I2C slave device.

chrisckc commented 1 year ago

Note that when USB serial is outputted during I2C data reception, the error rate is around 5 to 10%, this is using my changes which resulted in a zero error rate with no USB Serial communication during data reception.

To test this change this setting to true inside I2C_receiver-new.cpp #define DEBUG_SERIAL_DURING_I2C_RECEIVE (true) // Set to false to prevent USB Serial debug output during I2C data reception, using USB serial during I2C reception causes data errors.

This issue is also being covered here: https://github.com/raspberrypi/pico-sdk/issues/1102

USB Serial also affects the ability of the slave Pico to respond to requests: Changing this inside I2C_receiver-new.cpp: #define DEBUG_SERIAL_DURING_I2C_RESPONSE (true) // Set to false to prevent USB Serial debug output during I2C response (sending data back to the master), using USB serial while responding causes data errors on the master. results in errors being reported by the master after it receives its response from the slave and verifies the data.

earlephilhower commented 1 year ago

...Note that when USB serial is outputted during I2C data reception, the error rate is around 5 to 10%...

That is expected. It's not legal to do things which may block, like writing to USB or storing to flash, while in an ISR. In fact, the Newlib reent struct is not safe to use so even things like sprintf and some transcendentals are verboten, technically.

chrisckc commented 1 year ago

The writing to USB is not being done inside the receive ISR, yes that would be a no-no for any MCU, rather is being done in the main loop, here: https://github.com/chrisckc/TestHarness-I2C/blob/16646e075904ed074e17a9afb48e9a7e0bf90574/I2C_receiver-new/src/I2C_receiver-new.cpp#L353

I added that code in the main loop to just print dots while it is waiting for I2C data to arrive, this results in a large failure rate at 1MHZ, even with the Wire.cpp changes I posted in the new issue I logged. At 400 KHz it works fine.

I guess having both USB related interrupts and the I2C receive interrupts happening together is too much for the RP2040 to handle at high I2C speeds.

earlephilhower commented 1 year ago

Ah, good to hear. Have you tried doing the I2C slave work only on the 2nd core? The USB IRQ is pretty heavyweight as you're seeing...

chrisckc commented 1 year ago

Ill try the second core next

To put this in perspective, I have also tried to get the same I2C slave code working on other Arduino implementations without success, so its great that it works as it should do in Arduino-Pico.

Since the workaround in Wire.cpp for the IRQ issues that seem to exist in the hardware has been implemented, it is reliable at high speed if care is taken not to mix in other interrupts.

Other attempts with Arduino implementations: For STM32F411, it needed 400KHz and a change in the order of setting up the Wire interface to get the interrupt to fire, but it seems it only uses a 32 byte buffer, after receiving first 32 bytes it just stops receiving, no more interrupts.

ESP8266 was a no go and I2C is done in software.

ESP32 looked like it should work but failed due to crashes caused by interrupt timeouts, presumably due to the Wire API requirement of having to read all of the data buffer inside the receive interrupt, this doesn't seem to bother the RP2040.

Teensy4 worked flawlessly in previous tests even with simultaneous USB, so have re-tested that yet. A bit of an unfair comparison though being an M7 600MHz.

earlephilhower commented 1 year ago

Core1 is completely unused unless you do something, so if it's physically possible then it should be fine. You'll need to be careful, as usual, since now you've got multicore and need to handshake/mutes/etc. properly when communicating between the two.

You can also buy a few %age points of speed by using the __not_in_flash()(sp? double check w/the core) decorator to place the code itself into the dual-port SRAM which is way faster than a cache miss on the XIP cache.

There might also be spots in the library where I myself forgot to decorate functions, so if you find them let me know and we can fix it. XIP(flash) from an interrupt will work but it's definitely less than optimal for HS operation.