adafruit / circuitpython

CircuitPython - a Python implementation for teaching coding with microcontrollers
https://circuitpython.org
Other
4.04k stars 1.2k forks source link

I2C Bus error leaves board unrecoverable without power down #2635

Open Panometric opened 4 years ago

Panometric commented 4 years ago

In my example, in a loop reading the device as fast as possible, and doing other unrelated things..

while 1:
     acceleration = cpx.acceleration
    .....

You can get an error:


Traceback (most recent call last):
  File "code.py", line 63, in <module>
  File "adafruit_circuitplayground/circuit_playground_base.py", line 261, in acceleration
  File "adafruit_lis3dh.py", line 159, in acceleration
  File "adafruit_lis3dh.py", line 328, in _read_register
  File "adafruit_lis3dh.py", line 327, in _read_register
  File "adafruit_bus_device/i2c_device.py", line 82, in readinto
OSError: [Errno 5] Input/output error

Press any key to enter the REPL. Use CTRL-D to reload.soft reboot

Once this occurs, the I2C device bus is stuck because the device is holding the bus. The board never recovers unless you power it down. Every restart just issues a RuntimeError. This could happen on any I2C device.


Auto-reload is on. Simply save files over USB to run them or enter REPL to disable.
code.py output:
Traceback (most recent call last):
  File "code.py", line 3, in <module>
  File "adafruit_circuitplayground/__init__.py", line 29, in <module>
  File "adafruit_circuitplayground/express.py", line 75, in <module>
  File "adafruit_circuitplayground/express.py", line 72, in __init__
  File "adafruit_circuitplayground/circuit_playground_base.py", line 110, in __init__
RuntimeError: SDA or SCL needs a pull up

The industry standard response to this is to bit-bang single clocks onto SCLK until the bus is released by the slave. This should be done at start, or even on error to recover the bus silently. It usually requires disconnecting the peripheral temporarily, using the SCLK os GPIO and then reconnecting.

ladyada commented 4 years ago

yeah this is a thing that happened with CODAL & the LIS3DH. very odd that this chip is particularly wierded out sometimes! @dhalbert see email thread "I2C lockup" with peli and folks

dhalbert commented 4 years ago

Deferring this to after 5.0.0. It's a very good idea, but we want to vet it against a bunch of I2C devices first, since there may be a few that are pathological.

CODAL fix: https://github.com/lancaster-university/codal-samd/blob/cplay_master_i2c_hack/src/ZI2C.cpp#L10

Another discussion: https://www.raspberrypi.org/forums/viewtopic.php?t=241491

creston-bob commented 4 years ago

In a project using the Microchip PIC32MZ microcontroller I discovered that the first two versions (v1 & v2) have a hardware bug in the silicon that causes the I2C to hang occasionally. I dug into the matter and found the errata sheets indicated that v1 & v2 had this flaw but it was apparently fixed in v3. After trying out all Microchip's suggestions for fixing the hung I2C from within my program code (and failing), I eventually concluded that the ONLY possible fix was a power reset (which works reliably). I have also discovered that without doing anything, eventually the I2C will recover by itself but it can take many hours before that happens. As a solution, I have an Arduino board listening to the data stream coming from the PIC to detect the problem and rectify it. So how do I detect the I2C malfunction? Simple ... I've programmed a PIC32 onboard timer interrupt that checks the SCL and SDA pins after every transaction ... if the pins are both high (idle mode) then the I2C transaction was successful but if either pin is low then we have a malfunction and the integer 1 is appended to the data being sent to the server. The integer is sliced off the data after being checked to see it it's a 1 or 0. This scheme works extremely well but has made me wary of I2C malfunctions from any silicon. I suspect that's what is happening in this particular circumstance and this is how you fix it.

creston-bob commented 4 years ago

I forgot to mention that the Arduino board I mentioned in my previous comment energizes a relay to power reset the PIC32MZ board. The power is connected through the relay's NC contact which is opened for a few seconds to cause the power reset.

kevinjwalters commented 4 years ago

I just got this by pressing reset button on a CLUE (alpha) then trying to load clue object:

Adafruit CircuitPython 5.0.0-rc.0 on 2020-02-26; Adafruit CLUE nRF52840 Express                                                    with nRF52840
>>>
>>>
>>>
>>> import board
>>> from adafruit_clue import clue
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "adafruit_clue.py", line 886, in <module>
  File "adafruit_clue.py", line 172, in __init__
RuntimeError: SDA or SCL needs a pull up
>>>
>>>
>>>
soft reboot

Auto-reload is on. Simply save files over USB to run them or enter REPL to disable.
code.py output:
Traceback (most recent call last):
  File "code.py", line 46, in <module>
  File "adafruit_clue.py", line 886, in <module>
  File "adafruit_clue.py", line 172, in __init__
RuntimeError: SDA or SCL needs a pull up

Press any key to enter the REPL. Use CTRL-D to reload.
Adafruit CircuitPython 5.0.0-rc.0 on 2020-02-26; Adafruit CLUE nRF52840 Express with nRF52840
>>>
kevinjwalters commented 4 years ago

Another example of a CLUE in forums being discussed with @caternuson: Adafruit Forums: clue_display_sensor_data.py not working. Reading from the accelerometer (LSM6DS33) could be implicated. Same RunTimeError after the problem:

    Adafruit CircuitPython 5.0.0 on 2020-03-02; Adafruit CLUE nRF52840 Express with nRF52840
    >>>import board
    >>> i2c = board.I2C()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    RuntimeError: SDA or SCL needs a pull up
kevinjwalters commented 4 years ago

Why does the stack trace show _read_register twice??

File "adafruit_lis3dh.py", line 328, in _read_register
File "adafruit_lis3dh.py", line 327, in _read_register

If that's from this code,

323    def _read_register(self, register, length):
324        self._buffer[0] = register & 0xFF
325        with self._i2c as i2c:
326            i2c.write(self._buffer, start=0, end=1)
327            i2c.readinto(self._buffer, start=0, end=length)
328            return self._buffer

the return statement is on line 328. Does this result from some sort of optimisation around return statements?

Presumably CP handles return statements within with ok?

This does not look like #2056 btw.

jepler commented 4 years ago

@kevinjwalters yes, it seems to be an implementation detail of micropython/circuitpython that a 'with' statement creates an extra line in a traceback. I don't think anything about that repeated line in the traceback is important to the problem at hand.

kevinjwalters commented 4 years ago

Got another case of something similar here, copied some update .py files onto a CLUE and it seemed to get stuck running the code, control-c shows this stack trace (tried it three times):

Adafruit CircuitPython 5.0.0 on 2020-03-02; Adafruit CLUE nRF52840 Express with nRF52840
>>>
>>>
>>>
soft reboot

Auto-reload is on. Simply save files over USB to run them or enter REPL to disable.
code.py output:
Traceback (most recent call last):
  File "code.py", line 47, in <module>
  File "adafruit_clue.py", line 886, in <module>
  File "adafruit_clue.py", line 207, in __init__
  File "adafruit_lsm6ds.py", line 220, in __init__
  File "adafruit_bus_device/i2c_device.py", line 68, in __init__
  File "adafruit_bus_device/i2c_device.py", line 166, in __probe_for_device
KeyboardInterrupt:

Press any key to enter the REPL. Use CTRL-D to reload.

It recovered after a power-cycle (unplugging USB).

kevinjwalters commented 3 years ago

I've got some code reading the LIS3MDL magnetometer (only) frequently (around 1000 samples a second) and that has behaving a bit strangely at times with the value freezing, noted in https://github.com/adafruit/Adafruit_CircuitPython_LIS3MDL/issues/4

It's now got worse and code hasn't run despite a few control-c and reloads and is stuck here each time:

Adafruit CircuitPython 5.3.1 on 2020-07-13; Adafruit CLUE nRF52840 Express with nRF52840
>>>
soft reboot

Auto-reload is on. Simply save files over USB to run them or enter REPL to disable.
code.py output:
Traceback (most recent call last):
  File "code.py", line 157, in <module>
  File "adafruit_lis3mdl.py", line 229, in __init__
  File "adafruit_bus_device/i2c_device.py", line 68, in __init__
  File "adafruit_bus_device/i2c_device.py", line 170, in __probe_for_device
KeyboardInterrupt:

This has restored it to a working state:

Adafruit CircuitPython 5.3.1 on 2020-07-13; Adafruit CLUE nRF52840 Express with nRF52840
>>> from microcontroller import reset
>>> reset()
creston-bob commented 3 years ago

As I commented a while back, an I2C error is usually caused by a transaction not returning to the IDLE state with BOTH Clock and Data in a high state. If your I2C circuit gets hung or “freezes”, get out your multimeter and see if the two lines are BOTH in a high state (IDLE) or if one is high and other is low (not in IDLE). This is often an error at the silicon level of the microcontroller or sensor component and CANNOT be fixed with any type of soft reset — no Control-C or software twiddling will return the I2C circuit to an IDLE state — ONLY A HARD POWER REBOOT of the microcontroller and/or the sensor will fix this problem because it reinitializes the component at the silicon level.

In a weather station I built a few years ago, a bug in the I2C circuit of a Microchip PIC32 microcontroller randomly caused a couple of sensors not to be read because the PIC I2C circuit would fail to return to IDLE following a transaction. I experimented relentlessly with this until I discovered the above reality which turned out to be a documented silicon bug for that generation of microcontroller which, I believe, is now fixed.

It’s easy to have the microcontroller read the state of the I2C circuit after each transaction to check if the I2C circuit is in an IDLE state and to signal to the upstream destination that the PIC needs a reboot. The upstream device (a single board computer) would then open a relay contact to break the power to the PIC for a couple of seconds — voila! — I2C freeze gone. Still works reliably to this day, a few years later.

On Oct 22, 2020, at 11:15 AM, kevinjwalters notifications@github.com wrote: I've got some code reading the LIS3MDL magnetometer (only) frequently (around 1000 samples a second) and that has behaving a bit strangely at times with the value freezing, noted in adafruit/Adafruit_CircuitPython_LIS3MDL#4 https://github.com/adafruit/Adafruit_CircuitPython_LIS3MDL/issues/4 It's now got worse and code hasn't run despite a few control-c and reloads and is stuck here each time:

Adafruit CircuitPython 5.3.1 on 2020-07-13; Adafruit CLUE nRF52840 Express with nRF52840

soft reboot

Auto-reload is on. Simply save files over USB to run them or enter REPL to disable. code.py output: Traceback (most recent call last): File "code.py", line 157, in File "adafruit_lis3mdl.py", line 229, in init File "adafruit_bus_device/i2c_device.py", line 68, in init File "adafruit_bus_device/i2c_device.py", line 170, in __probe_for_device KeyboardInterrupt:

Panometric commented 3 years ago

@creston-bob There is a soft solution. An I2C slave is simply a set of shift registers with a state machine. By adding clocks until the SDA goes high, the device can be recovered, and typically no data is lost.

With the psuedo code below I can typically recover the bus in < 1mS, without a reset, it sometimes takes longer and is somewhat device dependent on the size of transactions it allows. This code is being used for an LSM6DSO XL/Gyro

  //Set SCL to GPIO Out, Open collector
  //Set SDA to GPIO In
  solved= IS SDA High?
  for(tries=0;tries<=11 && !solved;tries++)
  {
        for(clocks=1; clocks <28; clocks++ )
        {
            /// Write SCL Low/Hi
            WritePin SCL low for 25-100uS;
            WritePin SCL high for 25-100uS;

            solved=Read SDA Pin
            if (solved) break;
        }
        if (solved) break;
  }
  /// Sets SCL/SDA back to I2C peripheral
 // Reset the I2C peripheral
creston-bob commented 3 years ago

That’s nice if you have I2C circuitry that is accessible from machine code or higher … no doubt this approach you have will work just fine for some microcontrollers but it won’t work universally with all. In the Microchip case I encountered they had posted several software solutions that might work to overcome the problem but the PIC32 chip I was using did NOT respond to ANY of the proposed solutions. That’s why they eventually posted a silicon error for that chip version that could only be fixed via a power reset. The silicon WAS fixed in the next version of the chip. Microcode within a chip can rarely overcome fundamental malfunctions in the underlying silicon. If the silicon is locked up in an error state it will most likely have to be rebooted (hard reset, not soft reset) to get back to normal functionality.

So the answer is to try everything possible in your program, including the sample code in this thread, to recover but a hard reset might be the only solution. Or changing to a different microcontroller part number that doesn’t have the problem. ;-)

On Oct 23, 2020, at 10:59 AM, Mike Mitchell notifications@github.com wrote:

@creston-bob https://github.com/creston-bob There is a soft solution. An I2C slave is simply a set of shift registers with a state machine. By adding clocks until the SDA goes high, the device can be recovered, and typically no data is lost.

With the psuedo code below I can typically recover the bus in < 1mS, without a reset, it sometimes takes longer and is somewhat device dependent on the size of transactions it allows. This code is being used for an LSM6DSO XL/Gyro

//Set SCL to GPIO Out, Open collector //Set SDA to GPIO In solved= IS SDA High? for(tries=0;tries<=11 && !solved;tries++) { for(clocks=1; clocks <28; clocks++ ) { /// Write SCL Low/Hi WritePin SCL low for 25-100uS; WritePin SCL high for 25-100uS;

      solved=Read SDA Pin
      if (solved) break;
  }
  if (solved) break;

} /// Sets SCL/SDA back to I2C peripheral // Reset the I2C peripheral

dhalbert commented 3 years ago

We are planning to try to detect I2C bus hangups at a low level and do a toggling forced reset as necessary. Some of this could be done in a port-independent way, but some of the timeouts need to be done in the low-level drivers for each port. In some cases we have to modify the manufacturer-supplied libraries. See https://github.com/adafruit/circuitpython/issues/2635#issuecomment-589076598.

creston-bob commented 3 years ago

You shouldn’t have any problem detecting a bus hangup … as I mentioned, you can do that with a multimeter … but the trick is how to get the I2C port reset back to an IDLE state. Some microcontrollers will likely be easy with a reset signal (or your code-level example), others not so easy. In the mentioned PIC32 case, there simply wasn’t any form of reset signal that overcame the hangup of the I2C silicon circuitry. Obviously, that’s a rare case but it was interesting and instructive. ;-)

Since I2C takes two to tango, the slave device can also cause a hangup in some cases. Recovering an IDLE condition at the microcontroller may, or may not, recover the sensor’s functionality, it depends on its internal design.

Just some thoughts, hope they’ll be helpful. ;-)

On Oct 23, 2020, at 12:23 PM, Dan Halbert notifications@github.com wrote:

We are planning to try to detect I2C bus hangups at a low level and do a toggling forced reset as necessary. Some of this could be done in a port-independent way, but some of the timeouts need to be done in the low-level drivers for each port. In some cases we have to modify the manufacturer-supplied libraries. See #2635 (comment) https://github.com/adafruit/circuitpython/issues/2635#issuecomment-589076598.

kyrreaa commented 1 year ago

Just thought I'd let you know that the LSM6DSO has both I2C and SPI support and I am using it with SPI. I still get the hang. I suspect it happens if my code is at some stage of communicating with the chip while being OTA upgraded causing a reset of the MCU or being reprogrammed with a debugger at the "wrong time". I have not been successful in recovering without interrupting the LSM6DSO power as it does not have a reset pin.

Panometric commented 1 year ago

@kyrreaa This should not be the same with SPI. A SPI devices bus state is reset every time the CS transitions. So as long as you are toggling CS, it should be OK. I2C devices get stuck in a state because they only have the two wires. The only way to fix it is to clock them back into idle, with an unknown number of clock cycles.

dhalbert commented 1 year ago

For problematic devices that can hang, it's good to be able to power-cycle them. We have controllable I2C power on a numbe of boards, for power-saving reasons. Or, if the device is fairly low power consumption, you could power them from a GPIO pin.

kyrreaa commented 1 year ago

Normally I'd agree with you @Panometric, but real world experience has thought me differently. This also why I do have transistor-controlled supply to some SPI or I2C devices that lack reset pin (@dhalbert). On I2C it is even harder as they can be back-powered by the I2C pullups making it very annoying. Feeding extra clock cycles seem to work for some devices on I2C but not all in my experience. In my case the CS is indeed being controlled and I have verified this with a oscilloscope. Yet, once the device stops responding it is done.

It would be interesting to narrow down exactly when some devices hang like this, but that requires a lot of time and usually that is not an item available in abundance.

ilikecake commented 7 months ago

Was this ever implemented in Circuit Python? I am seeing a stuck bus issue with I2C using: Adafruit CircuitPython 8.2.8 on 2023-11-16; Adafruit Feather ESP32S3 4MB Flash 2MB PSRAM with ESP32S3

In a very specific situation, the clock line gets stuck low. In my case, I am trying to initialize an SHT40 sensor, but no sensor is present on the bus.

try:
    sht = adafruit_sht4x.SHT4x(i2c)
except:
    print("No SHT40 device detected")

This causes the code to crash next time I attempt to use the bus

Traceback (most recent call last):
  File "code.py", line 209, in <module>
  File "i2c_expanders/digital_inout.py", line 55, in switch_to_output
  File "i2c_expanders/digital_inout.py", line 99, in direction
  File "i2c_expanders/PCA9555.py", line 129, in iodir
  File "i2c_expanders/i2c_expander.py", line 80, in _read_u16le
OSError: [Errno 116] ETIMEDOUT

I can fix this specific error by trying to write again to the bus:

while not i2c.try_lock():
    pass
try:
    i2c.writeto(0x00, b"") 
except:
    pass
i2c.unlock()

This code does not actually seem to send a byte of data on the bus as I expected it would, but it does fix the problem in this case. However, this fix seems very specific, and a more general fix to detect a stuck bus and automatically correct it would be great. However, I am pretty sure that a general fix would have to happen at a level below the python code.

Good I2C transaction: image

Bad I2C transaction: image

gedeondt commented 4 months ago

I think I have a similar issue. Trying to use focaltouch library with my m5 cores3 device. It works for a few seconds and then:

OSError: [Errno 116] ETIMEDOUT

I have tried to modify the original code of the library and a can improve a bit repeating reads and things like that but I dont get it to work properly.

It fails mainly when you try to read more than 6 bytes in a row.

After some consecutive fails the feedback changes and it says that there is a problem with de pullup resistors.

dhalbert commented 4 months ago

@gedeondt Please try CircuitPython 9.1.0-beta.1 if you have not already. Espressif has fixed some ESP32-S3 I2C bugs. There is another fixed bug in the works but it is not yet backported to any ESP-IDF releases.

gedeondt commented 4 months ago

Thanks @dhalbert. I am using the 9.1.0-beta. Ok so I will wait for the bug to be backported. I was wandering If it could be a hardware malfunction but the DEMO app that came installed worked perfectly regarding the touchscreen so I guess it is not.