dotnet / iot

This repo includes .NET Core implementations for various IoT boards, chips, displays and PCBs.
MIT License
2.16k stars 582 forks source link

Error 121 performing I2C data transfer #832

Closed tshaug closed 3 years ago

tshaug commented 4 years ago

Hi everybody,

I am using a Mlx90614 sensor and sporadically receive an exception running on RaspPi. This issue seems to be similar to #163

My code looks like this:


var settings = new I2cConnectionSettings(busId, i2CAddress);
using (I2cDevice i2c = I2cDevice.Create(settings))
{
   using Mlx90614 sensor = new Mlx90614(i2c);
   logger.Debug(" Got Mlx90614 sensor object");
   Iot.Units.Temperature irtemperature = sensor.ReadObjectTemperature();
   Iot.Units.Temperature ambientTemperature = sensor.ReadAmbientTemperature();

   logger.Debug($"Reading Mlx90614 temperatures done");
}

I am creating the I2C device object every 5 seconds, because I have to create antoher I2c device object with different settings (for BME280) and I don't know if it is ok to have multiple at the same time.

Expected behavior no error ;

Actual behavior sporadically I receive the following exception: 2019-10-30 22:33:14.869 +01:00 [INF] Error while using Mlx90614Reader: Error 121 performing I2C data transfer. System.IO.IOException: Error 121 performing I2C data transfer. at System.Device.I2c.UnixI2cDevice.ReadWriteInterfaceTransfer(Byte writeBuffer, Byte readBuffer, Int32 writeBufferLength, Int32 readBufferLength) at System.Device.I2c.UnixI2cDevice.Transfer(Byte writeBuffer, Byte readBuffer, Int32 writeBufferLength, Int32 readBufferLength) at System.Device.I2c.UnixI2cDevice.WriteRead(ReadOnlySpan1 writeBuffer, Span1 readBuffer) at Iot.Device.Mlx90614.Mlx90614.ReadTemperature(Byte register) at Iot.Device.Mlx90614.Mlx90614.ReadAmbientTemperature() at Herzonaut.ObservingConditions.Raspi.Sensor.Mlx90614Reader.GetSensorDataInternal(I2cDevice i2cDevice) in C:\d\dn\Herzonaut\git\master\Herzonaut.ObservingConditions.Raspi.Sensor\Mlx90614Reader.cs:line 30 at Herzonaut.ObservingConditions.Raspi.Sensor.AbstractI2CSensorReader`1.GetSensorData() in C:\d\dn\Herzonaut\git\master\Herzonaut.ObservingConditions.Raspi.Sensor\AbstractI2CSensorReader.cs:line 39

Versions used System.Device.Giop 1.0.0 Iot.Device.Bindings 1.0.0

Add following information:

Runtime Environment: OS Name: Windows OS Version: 10.0.17763 OS Platform: Windows RID: win10-x64 Base Path: C:\Program Files\dotnet\sdk\3.0.100\

Host (useful for support): Version: 3.0.0 Commit: 7d57652f33

.NET Core SDKs installed: 2.1.701 [C:\Program Files\dotnet\sdk] 2.1.801 [C:\Program Files\dotnet\sdk] 2.1.802 [C:\Program Files\dotnet\sdk] 2.2.301 [C:\Program Files\dotnet\sdk] 2.2.401 [C:\Program Files\dotnet\sdk] 2.2.402 [C:\Program Files\dotnet\sdk] 3.0.100 [C:\Program Files\dotnet\sdk]

.NET Core runtimes installed: Microsoft.AspNetCore.All 2.1.2 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All] Microsoft.AspNetCore.All 2.1.12 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All] Microsoft.AspNetCore.All 2.1.13 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All] Microsoft.AspNetCore.All 2.2.6 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All] Microsoft.AspNetCore.All 2.2.7 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All] Microsoft.AspNetCore.App 2.1.2 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 2.1.12 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 2.1.13 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 2.2.6 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 2.2.7 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 3.0.0 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.NETCore.App 2.1.12 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 2.1.13 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 2.2.6 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 2.2.7 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 3.0.0 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.WindowsDesktop.App 3.0.0 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]

Runtime Environment: OS Name: raspbian OS Version: 10 OS Platform: Linux RID: linux-arm Base Path: /home/pi/astro/dotnet3/sdk/3.0.100/

Host (useful for support): Version: 3.0.0 Commit: 7d57652f33

.NET Core SDKs installed: 3.0.100 [/home/pi/astro/dotnet3/sdk]

.NET Core runtimes installed: Microsoft.AspNetCore.App 3.0.0 [/home/pi/astro/dotnet3/shared/Microsoft.AspNetCore.App] Microsoft.NETCore.App 3.0.0 [/home/pi/astro/dotnet3/shared/Microsoft.NETCore.App]

To install additional .NET Core runtimes or SDKs: https://aka.ms/dotnet-download

pgrawehr commented 4 years ago

Even though I think it shouldn't make a difference, it is clearly not necessary to recreate the device each time. You should be able to talk to different devices simultaneously (at least as long as you are performing everything on the same thread - not sure about thread safety of these classes)

tshaug commented 4 years ago

Hi Patrick,

thanks for your comment. So I will change my code accordingly. BTW: I tested today with a RaspPi 3B, the initial test has been with RaspPi 4 : both create such exceptions (I have not expected differently, but who knows...). (I also replaced the sensors)

Cheers Thomas

pgrawehr commented 4 years ago

I don't have this exact sensor module, but I've done quite extensive tests on I2C (i.e with high-troughput data to LCD displays) and never seen any exceptions (unless I disconnect the bus). I need to do some more I2C tests soon, so maybe I can reproduce it.

joperezr commented 4 years ago

Hello @tshaug thanks for logging this issue! Your code looks correct to me, and to answer your other question:

I am creating the I2C device object every 5 seconds, because I have to create antoher I2c device object with different settings (for BME280) and I don't know if it is ok to have multiple at the same time.

You don't need to create an I2c every time, as you are allowed to have more than one at the same time. Are you sure this device is connected correctly and in the correct address? In order to quickly test if this is the problem, run the following command from your terminal window: i2cdetect 1 and see if you can spot your device connected at i2CAddress. This would be the first step in order to better diagnose what the problem is here.

tshaug commented 4 years ago

Hi @joperezr thanks you for your feedback.

I have executed i2detect 1: pi@ThomasRaspPi:~ $ i2cdetect 1 WARNING! This program can confuse your I2C bus, cause data loss and worse! I will probe file /dev/i2c-1. I will probe address range 0x03-0x77. Continue? [Y/n] Y 0 1 2 3 4 5 6 7 8 9 a b c d e f 00: -- -- -- -- -- -- -- -- -- -- -- -- -- 10: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 20: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 30: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 40: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 50: -- -- -- -- -- -- -- -- -- -- 5a -- -- -- -- -- 60: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 70: -- -- -- -- -- -- 76 --

So it is in my opinion returning the correct values (90 and 118 decimal).

As I have written I tested with different sets of sensors and RaspPis and it happens in all combinations. And it only happens sporadically

Cheers Thomas

joperezr commented 4 years ago

And so I suppose that in your code, the variable i2cAddress is set to either 0x5a or 0x76?

Do you have other i2c devices that you can try on the same bus just to make sure that this is not a problem with your bus? (the fact that you have tested in multiple RPis makes me think this is not the problem)

@ZhangGaoxing would you know what might be going on? this sounds like something with the binding specifically as if it was something with i2cDevice I believe we would have seen this already for other devices.

ZhangGaoxing commented 4 years ago

MLX90614 default I2C address is 0x5A. And it seems that i2cdetect has detected the corresponding address. Is your I2cDevice parameter set correctly?

ZhangGaoxing commented 4 years ago

Oh, I see. MLX90614 is an SMBus device. Try this to set up your Raspberry Pi https://www.raspberrypi.org/forums/viewtopic.php?f=44&t=15840&sid=8c2fff5ce4395676800b4587f2a71b4e&start=25 I mentioned this in #452

tshaug commented 4 years ago

Hi @ZhangGaoxing sorry for the late reply. thank you for your explanation. I will check the provided links. And test with my RaspPis

Cheers Thomas

tshaug commented 4 years ago

HI @ZhangGaoxing ,

I looked into the post and also at my RaspPi: image

To me it seems that my Raspi has a bmc2835, which does not understand the mentioned combine file.

Is my understanding of #452 correct that with bmc2835 I don't need to configure anything specific? and the sensor binding should work out of the box. Because I still sporadically receive these errors.

Sorry maybe I am confused by the mentioned tickets.

Thanks, Thomas

pgrawehr commented 4 years ago

I have seen this error now also once, while reading from the I2C bus (from an ADS1115, to be precise) at a high frequency. Looks a bit like this error can happen if reading very quickly from the bus. I don't have a way of forcibly reproducing it yet, but I might try later.

tshaug commented 4 years ago

Hi Patrick,

thanks for the info. at the moment I am reading every 5 seconds. At which frequency are you reading from the bus?

cheers Thomas

pgrawehr commented 4 years ago

When it happened, I was reading a short value with at least 1kHz. But it happened only once, and I was trying even higher read rates.

tshaug commented 4 years ago

interesting...

krwq commented 4 years ago

@pgrawehr, @tshaug any chance you can try out running i.e. in a loop outside/inside of the using statement and see if there is something you can make to repro this?

Perhaps there are some docs which can explain when can this fail or we could add some diagnostics to help us figure out if this is something we do incorrectly or i.e. driver issue.

Are you guys up to date with kernel updates for your raspbian? (check version before and after you update and try reproing again perhaps)

pgrawehr commented 4 years ago

Will do, but I'll probably only find time on Saturday. It's of course possible that this error 121 just happens occasionally when a transmission error occurs, since the exact same error happens if you just disconnect one of the I2C wires.

krwq commented 4 years ago

@pgrawehr that's possible - no hurry since this doesn't seem to be consistently broken and rather occasionally fails. If you find that this is just transmission error we need to figure out what happens with corrupted frame and if this is something we should be retrying on our side (only if it's safe to retry) or let the error bubble up and let the user decide

tshaug commented 4 years ago

Hi everybody, sorry for the late replay (I have been busy demonstration my sensor prototype and code at a .Net conference ;-)) I have found one issue with my software how i am accessing the Mlx90614 temperature values:

protected override Mlx90614SensorData GetSensorDataInternal() {
Iot.Units.Temperature irtemperature = sensor.ReadObjectTemperature(); Iot.Units.Temperature ambientTemperature = sensor.ReadAmbientTemperature();
logger.Debug($"Reading Mlx90614 temperatures done"); return new Mlx90614SensorData(SensorDataQuality.Good, irtemperature.Celsius, ambientTemperature.Celsius); }

I discovered that always the readAmbientTempature() call failed. So to me it seems that this is kind of a high frequence call as @pgrawehr did, because it is the second call immediately after the first to the sensor I have now changed to have a 200ms delay between the two calls to the sensor .

This at least eased the situation: image during the test session only two times an exception occured. This is not super perfect but ok for me.

If you like we can close the topic.

Cheers and thanks again Thomas

krwq commented 4 years ago

@tshaug

I discovered that always the readAmbientTempature() call failed

Do you mean it failed with error 121 on the transmission level? I'm wondering if this is sensor specific (i.e. still processing previous requests and not giving ACK on the line) or something else. Considering the delay is fixing the issue this sounds like sensor specific problem... If that's the case I'm voting this is Mlx90614 bug (vs I2cDevice bug as title suggests)

pgrawehr commented 4 years ago

When I tried again last week, I ran I2C transfer operations for several hours at a very high rate (basically a ReadValue in an infinite, untimed loop). I was not able to reproduce the problem. So it may really be a sensor-specific issue with a missing ACK or something and so is an intermittent issue.

The question is whether we should internally handle this problem with a few retries or let the user handle it?

@tshaug If you are able to reproduce the problem consistently, can you try whether a retry works or whether the bus is in some undefined state after this exception?

krwq commented 4 years ago

@pgrawehr we should start with digging in the spec if there is something we can do to handle this gracefully but if there is nothing in there I suggest we dig more into why the tiny delay is making the reading more reliable - perhaps there is some delay which can get us to close to 100% correctness - if still can't get there then retries are ok I guess...

tshaug commented 4 years ago

@pgrawehr: in my option a retry should work (at least I am able to "re-use" the bus/sensor 5 seconds later). I will test with the following code:

`protected override Mlx90614SensorData GetSensorDataInternal() {

        Iot.Units.Temperature irtemperature = sensor.ReadObjectTemperature();

        // I sometimes receive = (zero) values, let add some delay before reading ambient temperature 
        Task.Delay(TimeSpan.FromMilliseconds(200)).Wait();

        Iot.Units.Temperature ambientTemperature;
        try
        {
            ambientTemperature = sensor.ReadAmbientTemperature();

        }
        catch (IOException ioException)
        {
            logger.Debug($"Reading Mlx90614 temperatures retry: {ioException.Message}");

            // see:https://github.com/dotnet/iot/issues/832
            Task.Delay(TimeSpan.FromMilliseconds(100)).Wait();
            ambientTemperature = sensor.ReadAmbientTemperature();
        }

        logger.Debug($"Reading Mlx90614 temperatures done");
        return new Mlx90614SensorData(SensorDataQuality.Good, irtemperature.Celsius,
            ambientTemperature.Celsius);
    }`

But only this evening. I will let you know about the results afterwards

Cheers Thomas

tshaug commented 4 years ago

Yesterday I did some extensive testing. As mentioned I implemented a simple retry mechanism in my Sensor client. I ran the sensor app and it's client for more than 8 hours: image (object temperature and ambient temperature - The peaks are when I held my hand right before the Mlx90614 sensor)

So everything works quite well. I later on analysed the logs which I write at the Rasp Pi. Here I discovered 3 errors (sorry time and date of rasp Pi is wrong - I didn't realized when I started the test session): 1) one time my retry failed (= 2 consecutive errors) : 2019-12-01 20:17:37.839 +01:00 [DBG] Start using Mlx90614Reader 2019-12-01 20:17:39.086 +01:00 [DBG] Reading Mlx90614 ambient temperature retry: Error 110 performing I2C data transfer. 2019-12-01 20:17:39.193 +01:00 [ERR] Reading Mlx90614 ambient temperature retry also failed

2) Mlx90614 sensor fails one time while reading Object temp (which I have not seens so far - no retry at the moment implemented: 2019-12-01 20:17:45.292 +01:00 [INF] Error while using Mlx90614Reader: Error 121 performing I2C data transfer. System.IO.IOException: Error 121 performing I2C data transfer. at System.Device.I2c.UnixI2cDevice.ReadWriteInterfaceTransfer(Byte writeBuffer, Byte readBuffer, Int32 writeBufferLength, Int32 readBufferLength) at System.Device.I2c.UnixI2cDevice.Transfer(Byte writeBuffer, Byte readBuffer, Int32 writeBufferLength, Int32 readBufferLength) at System.Device.I2c.UnixI2cDevice.WriteRead(ReadOnlySpan1 writeBuffer, Span1 readBuffer) at Iot.Device.Mlx90614.Mlx90614.ReadTemperature(Byte register) at Iot.Device.Mlx90614.Mlx90614.ReadObjectTemperature() at Herzonaut.ObservingConditions.Raspi.Sensor.Mlx90614Reader.GetSensorDataInternal() in C:\d\dn\Herzonaut\git\master\Herzonaut.ObservingConditions.Raspi.Sensor\Mlx90614Reader.cs:line 27 at Herzonaut.ObservingConditions.Raspi.Sensor.AbstractI2CSensorReader`1.GetSensorData() in C:\d\dn\Herzonaut\git\master\Herzonaut.ObservingConditions.Raspi.Sensor\AbstractI2CSensorReader.cs:line 38

3) BME280 (which I use to messure temp (as well) and pressure & Humidity) also fails one time with the same error: 2019-12-01 20:17:40.209 +01:00 [INF] Error while using Bme280Reader: Error 121 performing I2C data transfer. System.IO.IOException: Error 121 performing I2C data transfer. at System.Device.I2c.UnixI2cDevice.ReadWriteInterfaceTransfer(Byte writeBuffer, Byte readBuffer, Int32 writeBufferLength, Int32 readBufferLength) at System.Device.I2c.UnixI2cDevice.Transfer(Byte writeBuffer, Byte readBuffer, Int32 writeBufferLength, Int32 readBufferLength) at System.Device.I2c.UnixI2cDevice.WriteByte(Byte value) at Iot.Device.Bmxx80.Bmxx80Base.Read8BitsFromRegister(Byte register) at Iot.Device.Bmxx80.Bmxx80Base.SetTemperatureSampling(Sampling sampling) at Herzonaut.ObservingConditions.Raspi.Sensor.Bme280Reader.GetSensorDataInternal() in C:\d\dn\Herzonaut\git\master\Herzonaut.ObservingConditions.Raspi.Sensor\Bme280Reader.cs:line 23 at Herzonaut.ObservingConditions.Raspi.Sensor.AbstractI2CSensorReader`1.GetSensorData() in C:\d\dn\Herzonaut\git\master\Herzonaut.ObservingConditions.Raspi.Sensor\AbstractI2CSensorReader.cs:line 38

To me the thrid error is the most interesting one. It seems that for the sensors I use sometimes this Error 121 is happening. But the system is still working afterwards so it is not a big deal for me.

krwq commented 4 years ago

@tshaug, the BME280 error is in fact interesting. I have couple lying around and one of them connected all the time and haven't seen any issue so far.

Do you have both sensors connected to the same PI at the same time? Is it possibly related with them reading/writing at the same time? Wondering if this isn't some I2C threading issue which we should fix on our side

rhuneai commented 4 years ago

I'd like to preface my comment with the fact that I am very new to i2c, raspbian and the MLX90614.

I have seen this issue as well, exactly as described above. When querying very fast to the sensor, the 121 exception is thrown occasionally. I would like to add, however, that I can repro this in Python using the smbus library. I don't know how similar this is to System.Device.I2c.

This python script can repro the error. With a quick test it failed 43 out of 50 times, 7 times it worked and returned expected data:

import smbus
BUS = smbus.SMBus(1)
DEVICE_ADDRESS = 0x5a
temp = BUS.read_word_data(DEVICE_ADDRESS, 0x07) * 0.02 - 273.15
emis = BUS.read_word_data(DEVICE_ADDRESS, 0x24) / 65535
print(temp)
print(emis)

The error only ever occurs on the second read, and testing with a longer script over 20000 iterations the failure rate was 70 %.

>>> %Run mlx.py
Traceback (most recent call last):
  File "/home/pi/Documents/mlx.py", line 5, in <module>
    emis = BUS.read_word_data(DEVICE_ADDRESS, 0x24) / 65535
OSError: [Errno 121] Remote I/O error

With only a print statement in between the reads I didn't see a failure over many thousands of iterations:

import smbus    
BUS = smbus.SMBus(1)
DEVICE_ADDRESS = 0x5a
i=0

while True:
    temp = BUS.read_word_data(DEVICE_ADDRESS, 0x07) * 0.02 - 273.15
    print(i)
    emis = BUS.read_word_data(DEVICE_ADDRESS, 0x24) / 65535
    i += 1

My setup is:

I am leaning towards the issue being with the sensor not being able to keep up, but my knowledge here is very limited.

krwq commented 4 years ago

Ok, We're likely gonna start seeing this on CI so might be worth at least wrap the exception in something more convenient... Perhaps we should at least throw some exception type we could use for retry (i.e. I2cException or something)... other option perhaps could be ProtocolException - or do we leave it as is? cc: @joperezr

joperezr commented 4 years ago

We could probably eithe3r Wrap the Exception in order to provide a better message, but usually when you get one of these it means that the comunnication on the I2c Bus can't find the sensor, so retrying won't really help at all. In CI for example, I expect that once we see this in one of the devices, it will fail the test every single run on that machine until we go to the lab and re-connect the sensor correctly.

pgrawehr commented 4 years ago

@joperezr Unfortunatelly, it's not that easy. You are right that error 121 hapens when the device does not answer, but as we've seen from several reports now (mine included) it can also happen intermittently. For reasons that are not exactly understood yet, sometimes the error pops up after the system had been running fine for minutes or even hours and goes away again as it came.

pgrawehr commented 4 years ago

Just made an interesting observation. I was seeing quite a lot of these errors with a particular ADS1115. There are a total of 7 chips connected to this bus (two ADS1115, two MCP23017, a BMP280, a BME680 and an LCD display). The problematic ADS, that I'm reading at about 1Hz, reported failures about every 2nd or 3rd attempt, the second ADS, that I'm reading at 5Hz, reported failures about every 100th time, all the other sensors very rarely had errors, even though especially the display is written at high rates. Replacing the pressumably broken ADS didn't help, but disabling it in the software (so only using all the other devices) fixed all problems, including the sporadic errors the other sensors had. So I assumed that the device might somehow interfere on the bus with the other devices. And the only way this can normally happen is if the devices don't use the correct device addresses. -

The ADS1115 comes on a breakout board with a pull-down resistor on the ADDR line, which defines the address to use (0x20 - 0x24 for this chip). Leaving it externally open normally puts the address to 0x20, but apparently not reliable enough. Noise can interfere with the input (likely because the pin can be connected to SDA or SCL to get a different address), it seems. Bottom line: Don't leave the address pins of any I2C device open, even if they're equiped with pullups or pulldowns on a breakout. Hardwiring the line to ground fixed all the issues - or lets say improved it significantly, I haven't run it for long enough yet.

krwq commented 4 years ago

I'm not sure what to do about this issue other than to perhaps add optional retry logic...

pgrawehr commented 4 years ago

Yea, I guess all one can do is add retries (this works fine for me, even though I still have plenty of these errors happening). We could consider an auto-retry feature.

Ellerbach commented 4 years ago

In all the sensors I have, I always add retry and overall catch mechanism, data cleaning. Nothing is perfect, errors can come "from the wires", from the sensor itself, software and no measurement is never ever correct anyway. They are all approximation of the real world :-)

pgrawehr commented 4 years ago

True... We could either add auto-retry or at least update the documentation (which one...?) to make clear that these exceptions can and will ocassionally happen. It seems that the issue is mostly about getting such an exception after an application ran flawlessly for hours.

Ellerbach commented 4 years ago

We could either add auto-retry

Some bindings where it happens often already have those in place. So I won't over do it.

at least update the documentation (which one...?)

Yes, this is clearly what we can do. I would say in the main binding page. I would say, right after the Binding Distribution section. something like good practices when working with embedded devices:

krwq commented 4 years ago

perhaps since we consider errors as something normal we might want to consider adding TryWrite/TryRead methods to avoid try catches. I think the right place to put this would be once we have raspi-spi.md file similar to pwm version

Ellerbach commented 3 years ago

Will close this issue as we've added documentation on this behavior. Also we've added specific documentation on spi and I2C as well to enable them. And mentioned the retry mechanism. Feel free to reopen if needed.