adafruit / circuitpython

CircuitPython - a Python implementation for teaching coding with microcontrollers
https://circuitpython.org
Other
4k stars 1.18k forks source link

I2C bus error after while - OSError: [Errno 5] Input/output error #3376

Closed eLEcTRiCZiTy closed 3 years ago

eLEcTRiCZiTy commented 3 years ago

Hi, I used Adafruit Feather STM32F405 Express board with CircuitPython 6.0.0-alpha.3.

When I read continually range from the VL53L0X sensor, then after while this error occurs:

Traceback (most recent call last): File "code.py", line 54, in File "/lib/adafruit_vl53l0x.py", line 600, in getRange File "/lib/adafruit_vl53l0x.py", line 324, in _read_u8 File "/lib/adafruit_vl53l0x.py", line 323, in _read_u8 File "adafruit_bus_device/i2c_device.py", line 102, in write OSError: [Errno 5] Input/output error

Continually means the loop contains the starting of range-finding for 6 sensors without delay, then I read the range from them, some delay ~100ms, and the loop starts again. Each sensor have a different address, and they are on the same I2C bus.

Note: I use .py version of the sensor library because of splitting "range" function into "start range-finding" and "readout the range". The same error appears when using a stable version of the VL53L0X library compiled or not.

I think this will happen with any other I2C device when someone tries to read out data quickly.

Simmilar problem: issues 2635 edit: maybe only visually

tannewt commented 3 years ago

Could you please post your code.py so we might be able to reproduce?

I suspect the STM HAL is returning a more useful error code to us. Are you interested in helping debug this? Thanks!

eLEcTRiCZiTy commented 3 years ago

Hi, use the example code:

https://github.com/adafruit/Adafruit_CircuitPython_VL53L0X/blob/master/examples/vl53l0x_simpletest.py

set the delay at the end of loop to 0.1 sec, or lower. This should be enough, and the error should manifest itself in a reasonable time. Try code for a few minutes.

Because of this I2C bug I can not use Circuitpython. I use different development environment now, so I can not test this issue again to provide a small example. If issue will not be reproducible with "simpletest" example with small delay, then I will try to provide the example of this issue.

dhalbert commented 3 years ago

Just to know: have you tried this on any other CircuitPython boards? The #2635 bug is peculiar to the I2C device, as you saw.

eLEcTRiCZiTy commented 3 years ago

@dhalbert No, I do not have any other Circuitpython board.

edit: At first glance, the #2635 problem looked similar, but it probably not.

kevinjwalters commented 3 years ago

I'm not sure if it's the same thing, but my first venture into i2c also involved this error message and sensor but on a different board. In my case I had removed a 100ms pause when translating the code and that seemed to trigger it after a while. Gemma M0 gives OSError: 5 on i2c.writeto to tinyLiDAR. Perhaps it's worth verifying the VL53L0X isn't in one of the slower ranging modes?

eLEcTRiCZiTy commented 3 years ago

@kevinjwalters I used default settings. The distance measurement took approx. 35ms (I can't remember exactly). So, the 100ms delay should be perfectly fine. Wrong assumption. Reading of property sensor.range has all necessary delays inside. So, the user can use any delay, and the program must work.

eLEcTRiCZiTy commented 3 years ago

I found some time. The simplest code is:

import time
import board
import busio
import adafruit_vl53l0x

i2c = busio.I2C(board.SCL, board.SDA)
vl53 = adafruit_vl53l0x.VL53L0X(i2c)

while True:
    print("Range: {0} mm".format(vl53.range))
    time.sleep(0.1)

serial output: log.txt

Sometimes it takes longer. With shorter time.sleep(...), the error pops up in a shorter time. For example, with time.sleep(0.05) it take about 10 seconds for an error to pop up.

mdroberts1243 commented 3 years ago

I can confirm this problem is with the STM32F405 Feather and 6.0 alpha (latest). I can reproduce the problem quite easily with the AS7341 sensor and its simpletest. Reduce the sleep time to have the problem manifest quicker. It is even easier to reproduce the problem using the SH1107 quirk and the SH1107 128x64 OLED FeatherWing. When using the OLED I just move a lable around on the screen pseudo randomly to quicken the

I've seen I/O errors on write and write_then_readinto.

I have not been able to reproduce the problem with the Grand Central M4 Express and CP 6.0 alpha, successfully running the SH1107 display all night long (Playing the game-of-life).

I could not reproduce the problem with "Adafruit CircuitPython 5.3.1 on 2020-07-13; Feather STM32F405 Express with STM32F405RG". Using the AS7341 simpletest running all night long.

So it seems to be specific to the STM Feather and the latest CP 6.0 alpha.

DavePutz commented 3 years ago

I did some testing, and it looks like this error is happening due to some real inconsistencies in the return values from HAL_GetTick(). The errors I am seeing are caused by timeouts in I2C_WaitOnBTFFlagUntilTimeout(). Instrumenting that function in st_driver/STM32F4xx_HAL_Driver/Src/stm32f4xx_hal_i2c.c I see times when the the value from HAL_GetTick() is less than the starting ticks (and it's not due to overflow, as it is a 32-bit int and the values were far below the max). Other times I got values roughly 10,000 ticks larger than the start; but calling HAL_GetTick() again when we returned in common_hal_busio_i2c_write() showed smaller (and valid) values. Right now I do not have a good explanation for this, unless we are affecting the clock somewhere else in the code.

hierophect commented 3 years ago

I don't quite know how I missed this issue, but I have a possible idea of where it could be. We need to override the HAL_GetTick() system functions in order to return time properly with low power code, and this replacement function actually performs slightly differently directly after startup vs after the clocks have been initialized. Immediately after startup, HAL_GetTick must use the SystemTick, or it is impossible to correctly initialize the Low Speed External oscillator (LSE), which is required to actually use the RTC to acquire non-systick system time. After the LSE is initialized, we switch to RTC timekeeping and disable the Systick so that the low power code can take effect. However, at this tradeoff, the numbers returned by these two functions will not be consistent.

It's possible something else in the core is messing with the Systick or some other element of this system. I can hop on a debugger this week and see if I can replicate this issue.

DavePutz commented 3 years ago

I thought that as well, but the instrumentation I've been putting in shows that the issue occurs when we are just using the RTC. I can see that port_get_raw_ticks() is sometimes returning a value that is exactly 1024 less than a prior returned value. Since they are unsigned ints, that means that the timeout tests fail ( i.e. if current is < previous the difference is a very large value). It sounds like the 'seconds' variable is off by 1, but haven't been able to prove that.

On Mon, Oct 5, 2020 at 2:01 PM hierophect notifications@github.com wrote:

I don't quite know how I missed this issue, but I have a good idea of where it could be. We need to override the HAL_GetTick() system functions in order to return time properly with low power code, and this replacement function actually performs slightly differently directly after startup vs after the clocks have been initialized. Immediately after startup, HAL_GetTick must use the SystemTick, or it is impossible to correctly initialize the Low Speed External oscillator (LSE), which is required to actually use the RTC to acquire non-systick system time. After the LSE is initialized, we switch to RTC timekeeping and disable the Systick so that the low power code can take effect. However, at this tradeoff, the numbers returned by these two functions will not be consistent.

It's possible something else in the core is messing with the Systick or some other element of this system. I can hop on a debugger this week and see if I can replicate this issue.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/adafruit/circuitpython/issues/3376#issuecomment-703827261, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFNJKEWAESOPAMHTGZBDQFDSJIJZZANCNFSM4Q4QCQSA .

DavePutz commented 3 years ago

This is not only an I2C problem. The issue is in port_get_raw_ticks(), which can return a value less than the previous value. This simple Python script can demonstrate using monotonic_ns, which directly calls port_get_raw_ticks() :

import time

a=time.monotonic_ns()
while True:
    b=time.monotonic_ns()
    if b<a:
        print(f"Got {b} was {a}")
    a=b

This does not reproduce the issue as quickly as the I2C bus does, but I get output like Got 11268000061036 was 11268976562500 The difference of 976,501,464, when divided by the 976562 factor in common_hal_time_monotonic_ns(), comes out to right about 1 second. This matches the instrumentation I put in HAL_GetTick() that showed values being returned that were 1024 millseconds less than previous values. Reading the ST documentation at https://www.st.com/resource/en/application_note/dm00226326-using-the-hardware-realtime-clock-rtc-and-the-tamper-management-unit-tamp-with-stm32-microcontrollers-stmicroelectronics.pdf I see that it talks on page 55 about needing to clear the RSF bit, and then wait until it is set again before reading the calendar registers. @tannewt - is this perhaps what we are seeing?

jepler commented 3 years ago

The reference manual 26.3.6 has a number of caveats about how the RTC must be read for the content of the 3 registers to be synchronized. Maybe we're "[making] read accesses to the calendar in a time interval smaller than 2 RTCCLK periods"? It does seem that "BYPSHAD control bit is cleared" (it's the power on value and I don't see us changing it)

tannewt commented 3 years ago

@DavePutz @jepler Yup, that seems like it could be the issue.

mdroberts1243 commented 3 years ago

The game of life has gone through several generations on SH1107 OLED with the STM RTC fix. Much much further than I've ever seen before. This fix seems to fix the I2C issues I've been seeing.

veng1 commented 3 years ago

I'm seeing a similar error on a PyPortal where upon start up, an I2C UVC sensor may or may not work. And it appear random. It has worked for hours at a time and then just refuse to start with a "Value Error: No I2C device at address:39"

Which brings up a question: How does one decide if everything is up to date? It seems like a daily reload of CP and all libraries shouldn't be required every time there might be a suspicion that the libraries are out of date. Is there tool that can look at a CP disk and advise if anything needs an update?

tannewt commented 3 years ago

@veng1 I suspect the PyPortal is unrelated to this because it is a SAMD51. Check out circup for updating: https://github.com/adafruit/circup