arduino / ArduinoCore-avr

The Official Arduino AVR core
https://www.arduino.cc
1.21k stars 1.04k forks source link

I2C/Wire library: Make Wire library non-blocking #42

Closed wmarkow closed 2 years ago

wmarkow commented 5 years ago

This is a placeholder issue to cover an arduino/Arduino/issues/1476 improvement.

It looks like arduino/ArduinoCore-avr repository is the correct one to cover that case, doesn't it?

wmarkow commented 5 years ago

I have just integrated my proposal from arduino/Arduino/issues/1476.

matthijskooijman commented 5 years ago

Possible implementations:

Looking at these, I suspect that the version by @wmarkow and my own might be the best starting points, but neither completely solve the problem AFAICS.

One particular challenge I've found (but I'm not sure if it is really properly documented yet), is that the AVR TWI hardware is particularly sensitive to noise that makes it look like there is a second master on the bus. Since I²C is a multi-master bus, when the hardware detects an external start condition, or an arbitration error, it will assume there is another master active and hold off any bus activity until it sees a stop condition. However, when there is no other master but just noise on the bus, this will probably never happen.

When the hardware ends up in such a state, there is no way to actually detect this state (i.e. no status bit or anything), other than detecting a timeout (i.e. detecting that the hardware hasn't finished in reasonable time, so probably never started in the first place). There is a arbitration lost interrupt that can detect the start of this state in some cases (but not the end). The only way I've found to recover from this situation is to disable and re-enable the TWI hardware.

So adding timeouts is probably the only way to fix this. However, just having hardcoded timeouts as most of these implementations (including my own) have is problematic, because:

I suspect the only really correct way to handle this is to let the sketch specify custom timeout values (or perhaps specify the max clock stretching time, whether multi-master should be supported and if so, the max transaction time of other masters). However, this requires API changes that made implementing this a lot more tricky, which is why I didn't submit my fixes for inclusion (and instead also ended up with hardcoded timeouts tailored to my specific case without multi-master and with limited clock stretching).

VladimirAkopyan commented 4 years ago

Custom timeout values are perfectly fine, and I believe API change is justified. Current situation leads to permanent lock-up of the microcontroller, and is completely unacceptable. An inexperienced developer may know nothing about this issue and it will manifest itself many months after hardware has been designed, built and installed . Sometimes people do use Arduino for serious project because that's all they can do to solve a problem.

greyltc commented 4 years ago

These lockups are murdering me right now! I've just tried https://github.com/IanSC/Arduino-Wire.h and I very much do not recommend it, seems to slow bus comms to a crawl without solving the issue. I'll go on down the list...

greyltc commented 4 years ago

I've just tried https://github.com/3devo/ArduinoLibraryWire on my I2C lockups and it doesn't solve them either :-( Maybe I'm using it wrong though? I didn't see any changes to Wire.h so I didn't do anything different in my interface to the library.

greyltc commented 4 years ago

https://github.com/wmarkow/Arduino/tree/issue_%231476 seems to prevent my firmware from locking up! :tada:

greyltc commented 4 years ago

I've been looking at these lockups in my scope very closely for a few days now. I have an idea what the root cause of them might be in my application (one master = Mega2560 rev3, one slave = TI's ADS122C04).

See https://github.com/3devo/ArduinoLibraryWire/issues/1 for my details with scope traces if you're interested. I have some ideas for changing the Wire library to prevent them in the first place, but I haven't been able to figure out how to actually implement those fix ideas in the code yet.

wmarkow commented 4 years ago

Hello @greyltc,

https://github.com/wmarkow/Arduino/tree/issue_%231476 seems to prevent my firmware from locking up! tada

I just wanted to propose you to check out my code. Good that it works for you. However not everything seems to be covered there. It is nice that you give a few more cases to take a look into:

* I do a `Wire.setClock(400000);` before I get started, but after the unlock procedure, that's forgotten and the bus runs at 100kHz

Indeed, when a timeout condition is met, then I restart the TWI hardware (twi_disable() and twi_init()). In this review is suggested that it should not restart TWI but return a result code instead (or set some flag indicating a timeout failure). The user can check the flag later in the main loop, and reinit TWI in his way (like setting the clock to 400000). In my case that would help but I'm not sure if it works for you, when you need to recover from timeout failure very fast, so you can make your ADC conversion. Imagine the case: you do not know exactly in which part of your code the timeout will happen. The timeout is set to 10ms (for example). Lets assume there is like 20 another TWI operations somewhere in the code between the timeout and your ADC conversion code. I have the felling that all of those 20 TWI operations may end up with the timeout (but I'm not 100% sure), so it will take at least 20 x 10ms = 200ms before your ADC code will be executed. In my solution I can wait those 200ms and reinit TWI later in loop. For you - the Wire library may go into a "timeout failure" state and may/should not execute any TWI operations (all API methods may/should return immediatelly with a correct result code) until you reinit it somewhere in main loop. That's only a proposed solution.

* I'm in a loop fetching values from an ADC with great speed. I'd love to be able to set the timeout in microseconds instead of milliseconds. I can be pretty sure I'm in a lock state in my application when 100us goes by without traffic on the bus and I want to do everything I can to recover from a lock up quickly so I can miss as few ADC conversions as possible!

Yes, my code sets the timeout in millis but it seems to be no problems to rework it into microseconds.

greyltc commented 4 years ago

@wmarkow, I took your https://github.com/wmarkow/Arduino/tree/issue_%231476 branch and changed the timeout argument to microseconds and made any changes the user might have made to slave address or bitrate (the only two register values exposed by the Wire library) re-applied after the reset and put it in PR arduino/Arduino#107

Hyperdimensionals commented 4 years ago

I'm using a DS3231 external clock and noticed it 'freezes' when I disconnect power to the component. I want the program to continue looping even if the clock is disconnected. I narrowed it down to a Wire.endTransmission() line in the module code, and then found this page.

Forgive my possible ignorant novice question, but which implementation listed by @matthijskooijman works best for this issue? Or is this a situation where there's some line of code I can add that will check whether the clock device is connected/powered before reaching the Wire.endTransmission() line? A lot of this discussion is over my head so I wasn't sure if this has a simpler solution than the issues discussed above?

Mahdi-Hosseinali commented 4 years ago

I'm using a DS3231 external clock and noticed it 'freezes' when I disconnect power to the component. I want the program to continue looping even if the clock is disconnected. I narrowed it down to a Wire.endTransmission() line in the module code, and then found this page.

Forgive my possible ignorant novice question, but which implementation listed by @matthijskooijman works best for this issue? Or is this a situation where there's some line of code I can add that will check whether the clock device is connected/powered before reaching the Wire.endTransmission() line? A lot of this discussion is over my head so I wasn't sure if this has a simpler solution than the issues discussed above?

Use wire.endTransmission(true), it would end the transmission even if there's no response from the other side. Although this still does not act like a TIME_OUT.

Hyperdimensionals commented 4 years ago

wire.endTransmission(true) worked for the SDA and SCL lines getting disconnected, but unfortunately when the power line is disconnected it still blocks. I'm using a battery powered clock module so the battery could die and I don't want that to become a failure point.

Mahdi-Hosseinali commented 4 years ago

It probably depends on your circuitry, mine works when the other device is off. The reference manual says it sends a stop signal and releases the line, but it apparently doesn't. Or it could be that something else is wrong and we are not adept enough to understand it (electronics can be very complicated)

ermtl commented 4 years ago

It's been 9 FUCKING YEARS since that bug was first discussed (2011), countless people pulled their hair trying to understand why their Arduino was freezing, why would it work normally, then suddenly stop, a dozen of times, it's been raised in here, dozens of times people had been told to get lost, use something else, etc ... arduino/Arduino#7328 arduino/Arduino#2418 arduino/Arduino#1842 arduino/Arduino#5249 arduino/ArduinoCore-avr#42 Google shows pages after pages of people getting started and who get discouraged by this elusive bug they don't understand. This runs totally contrary of what arduino is supposed to stand for.

If the fix was difficult, if it compromised compatibility or added other problem, it would be understandable, but it's not the case, all that shit is caused by those 2 damn lines of code that obviously can create an infinite loop if for some reason the read operation does not complete :

// wait for read operation to complete while(TWI_MRX == twi_state){ continue; }

Yes, I know, this state is not supposed to happen according to the I2C protocol, but guess what ? electrical glitches didn't get the memo.

Countless people, after losing hours or days kind of solved the issue either by making a modified version for themselves, or switching to another library, so there are several implementations of a timeout that are simple and would easily solve the issue.

But arduino developers stubbornly refuse to fix it !

You think I'm rude ? After 9 years of giving the finger on this issue to the whole community for absolutely no reason, I couldn't care less.

How long before you close it again ? Just be honest, and put a "fuck you all, WONTFIX" label on it ...

VladimirAkopyan commented 4 years ago

Basically, @ermtl is right, this is ridiculous, WTF?

cmaglie commented 4 years ago

You think I'm rude?

Nah I don't think it... you are, period.

Also posting the same message 5 times is not nice, especially for the 99.9% of people subscribed to this repo that are not interested in this discussion. Anyway, now you did it, so let's just skip over this.

How long before you close it again ?

We haven't closed anything... the real problem is that there isn't a one-fit-all solution. If you read the previous discussions all the proposed fix were always barely tested and tailored to the OP specific use case.

I personally tested two PR in the past that, in their promises, will add timeout "without breaking anything" but you know what? at the times when I tested in my setup (an I2C display with a bunch of env sensors) they just broke my previously perfectly working sketch or made it unreasonably slow.

Don't underestimate the complexity of this issue, it's not just "2 damn lines of code".

Said that, I ask you, did you read the previous discussions? Which one, of the previously proposed solution, you vouch? Have you already tested them in your setup?

VladimirAkopyan commented 4 years ago

@cmaglie sure the @ermtl is rude, but his assessment is broadly accurate. I am willing to bet my house that someone is going to use Arduino to build a ventilator to deal with CORVID-19, and it will hang and kill someone.

I tried to follow the discussion, but it's spread over many threads. Let's consider a strategy here:

  1. What is the central place for this discussions, is it this thread? Is it this PR? https://github.com/arduino/Arduino/pull/2489
  2. What are the acceptance criteria, can we define a list of libraries and sensors that we are determined not to break? Are breaking changes acceptable?
  3. What is the strategy if breaks are a unavoidable? For libraries that will break, are we going to do a PR or get in touch with the author?
  4. Given that I2C protocol is often abused and misused, including by sensor manufacturers, how do we cater to those cases?
  5. Perhaps the alternative is to create a separate method / API and gradually depricate the old one? Would that be acceptable?

I think the frustration is not just from lack of a solution - that's understandable if the problem is complex. I think it stems from lack of clarity and leadership, there does not appear to be a plan on how to get it fixed. If someone wanted to contribute, is it not clear what they should do, at least to me.

bxparks commented 4 years ago

@VladimirAkopyan: You would win that bet. I was amazed and mortified when one of the authors of this (https://github.com/ermtl/Open-Source-Ventilator) asked me a question about one of my libraries. Fortunately, I just noticed this in their README.md: "0.15 Minor version change. This version allows the replacement of the buggy Wire.h arduino library (can hang the controller) with a correct version known as jm_Wire"

ermtl commented 4 years ago

bxparks : Well, it's a small world out there and I'm the main author of this Open Source Ventilator controller as part of an international team trying to design and assemble complete open source, easy to replicate emergency ventilators.

A few month ago I had the problem on a commercial product, and I went all the way, from adding capacitors to better filter supply electrical noise, to lowering the value of the pullup resistors to 1Ko, adding little serie resistors just in case, to re designing the wiring, to thinking that I might have had a bad lot of ICs, ordering new ones, waiting for them to arrive, seeing they were no better, adding a watchdog (but it was just an ugly fix), searching for a new similar sensor, ordering the new sensor, trying it, understanding that it could not possibly be 2 different kind of sensors that are both bad the same way, finally suspecting something could be wrong with the library and finally finding this utterly stupid bug ! I then switched to this library that simply includes timeouts: http://dsscircuits.com/articles/arduino-i2c-master-library It was nice, but forced me to rewrite the sensor library. In the meantime, what I billed as a 1 week job had taken me more than 2 month (that's 1 month 3 weeks I could not even imagine billing the customer anything for) and the customer grew impatient. When I finally solved the issue, it was a sight of relief, but in the process, my customer's trust got broken, and he hired someone else to rebuild it, and it's not with an Arduino !

This is just what happened to me, and the bug was caught fast as the device was a noisy environment and it would lockup every few hours / minutes, but some are designing products that are not in an electrically noisy environment, and they might never see the bug, start selling their stuff and people place them close to motors, or old blinking fluo tubes, and suddenly, the products start to fail !

People use Arduinos as the glue between sensors and actuators, and a large proportion of them are I2C circuits. Having such a vicious bug lurking, ready to lockup the entire board undermines the whole platform and confines it to "hobby" status. Even then, when newcomers assemble their first gizmos, seeing it die on them for apparently no reason is really frustrating and might drive them away.

In the Open-Source-Ventilator project, many people must cooperate, we're in a hurry, people are trying several sensors, so telling them they can't use any of the most usual libraries without rewriting them first is totally unrealistic.

After spending a day on the problem (again), I found the jm_wire library (available in the library manager) that's compatible with Wire and the author simply made minimal changes to implement a fucking timeout, and, call that rocket science if you will, but yes, the timeout length can be changed, no less ! Crazy ... just what everyone asked for 9 years.

However, it does not entirely solve the issue as I2C sensor libraries will reference the wire library within their code with

include

When also including jm_Wire, all the library's function will be defined twice, and you can bet the compiler will yell and fail.

So the less ugly way I found is to look at the source files of each and every I2C related library searching for #include and manually replacing it with:

if __has_include("jm_Wire.h")

include

else

include

endif

It does the trick ... until you update the library !

Since I'm working on a device that needs utmost reliability, not just a gizmo that blinks, it's ok for me to warn people about the stupid problem, and tell them how to overcome it. Here's my explanation: https://github.com/ermtl/Open-Source-Ventilator/blob/master/OpenSourceVentilator/README.md

There is just a thing I would change in the library, and it's that you have to set jm_Wire.twi_readFrom_wait and jm_Wire.twi_writeTo_wait to true manually. If that was by default, there would be zero impact on the API, the only change would be that in case of electrical error / glitch, the automagic random self hanging of the controller would be gone. And if people want multi master (given the flaws in the current library, I bet not many successfully ever did that, it can barely work with a single sensor) they can add delays by setting twi_readFrom_timeout and twi_writeTo_timeout !


If, after 9 FUCKING YEARS, Arduino developers still refuse to understand the problem and it's impact, I don't know what will wake them up. People tried asking, alerting, complaining, begging, and every time, the pull requests have been ignored, the bug have been closed, replaced with a new one so that newcomers have no idea this is such an old problem, and that countless others before them also complained, here and elsewhere about it.

So yes, I'm rude and if cmaglie and others don't like it, they know where they can shove it up !

per1234 commented 4 years ago

Hi all. If you truly care about getting this issue fixed, please keep the discussion on topic and refrain from the use of profanity or insults. Adding pages of non-productive discussion will only make it more difficult for this to be fixed. If you only want to vent or chat about your projects, there are other places to do that. This is a place for us to be productive by working together to fix problems and make improvements.

Productive, original, on-topic input is always welcome.

If the discussion proceeds in an unproductive manner, the maintainers will be forced to lock this thread, which benefits nobody.

ermtl commented 4 years ago

perl1234. The problem is, 9 years of politely asking led to absolutely nowhere. All those people who took time to patiently, politely explain what the problem was, taking time to propose fixes were all (very politely) ignored or told to get lost and use something else.

Ain't that rude to behave this way for so long? Ain't that rude to close one after another all the bug reports, not by marking the newly added bug reports as duplicate, but by closing the older ones that contain complete threads about the issue in what appears like an attempt to make their presence a bit more difficult to find ? Ain't that rude to dismiss people who explain real issues with an unmotivated "Don't underestimate the complexity of this issue" that clearly means "you're dumb, shut up" without ever explaining what would be so problematic and how it would break anything ?

So you've been rude to the community and it lasted so long, either some people enjoyed it or there is a reason that's not been told, and now that I'm turning the tables on you, you play it like an offended virgin, but you reap what you sow.

"the maintainers will be forced to lock this thread, which benefits nobody." That would only be what ? the fifth time in a row ?

Now you say you want productive, on topic input, I just told you about the jm_Wire library, and I can't wait for a constructive answer. Unfortunately, I only expect a template answer along the line of "Don't underestimate the complexity of this issue" form someone who's mind is made before he even looked at the library and who won't even take the time to motivate his rebuttal.

and if there's something I would really like, it's to be proven wrong.

me21 commented 4 years ago

What about an interim solution: since this bug affects so many people, it should be put to "Known bugs" section of Arduino framework so that it's easily noticed, along with recommendation to try jm_Wire library for now. Then Arduino developers can take as much time as they need to fix this bug properly.

greyltc commented 4 years ago

So, can anyone tell me why my PR arduino/Arduino#107 that I posted in this thread six months ago https://github.com/arduino/ArduinoCore-avr/issues/42#issuecomment-531779880 should not be merged to fix the issue today?

It's been reviewd in that PR and I've made the fixes/changes according to those comments there but still it sits waiting to be merged. From my testing it fixes the problem and doesn't change the API and I've never heard of anyone saying otherwise. I welcome more review for that PR if it needs it.

What more do I need to do to solve this?

greyltc commented 4 years ago

@ermtl out of curiosity, what arduino board were you using when you saw the bug? Have you taken scope traces of the bus when it locks up for you? I have and I'd be interested to compare mine to yours.

FYI I put mine here https://github.com/3devo/ArduinoLibraryWire/issues/1 a while ago.

matthijskooijman commented 4 years ago

I had been planning to review your PR for a long time, but I was constantly too busy with all the other stuff... Anyway, made some time now and reviewed. I think there's still a few things to improve or consider, though.

ermtl commented 4 years ago

@greyltc this was a Nano (Atmega328P), and the slave devices were a pair of MPU6050 (people are having hard time with this chip !). Nice catch with your scope traces, it's a clear case where either the slave released too early or the master took control too late. Maybe a rare case of incompatibility between standard, Fast and FM+ modes, even the complete I2C specs don't clearly sort that out (or it elapsed me): https://www.nxp.com/docs/en/user-guide/UM10204.pdf My case gave similar results (Arduino randomly hanging) but for a different reason. The controller was driving 2 beefy brushed DC motors and the noise from the 2 BTS7960 H bridges was causing unavoidable glitches. Mitigating both issues would be an improvement, but any problem "solved" by hiding it is not truly solved. The I2C bus having a strong state and a weak state, you can mitigate timing/level/glitches violations but you can never prevent them completely. The only proper solution is to add a timeout, and when they decided to use I2C for power management, SMbus did just that, adding a 25 to 35ms timeout to guarantee reliable operation http://smbus.org/specs/ The jm_Wire library goes even further by allowing the user to set their own additional timeout if their particular situation requires it. I can't think of a simpler, more elegant, more complete, easier fix, but it probably won't happen any time soon. As long as the problem is not fixed in the library named Wire, and I2C libraries keep referencing that name, the only solution I see is individual, by manually editing the code in every file of every I2C library source that you use (and doing it again after each library update) as I explained above, or making name changes to the jm_library to rename it Wire and manually replace the official, buggy Wire library with the modified version in ~/.arduino15/packages/arduino/hardware/avr/1.8.2/libraries/Wire/ (or wherever it's stored) However, that would not solve the issue for the community at large and people who are not aware will keep being bitten by it. A partial, painful solution would be to approach as many library developers as possible creating a pull request so that they replace the #include statements in their code with the #if __has_include("jm_Wire.h") trick I explained above as this would keep the regular behaviour unless the jm_Wire library is included by the user. However, for that mitigating strategy to at least partially work, the issue needs to be officially acknowledged as a "Known Bug" as @me21 suggested, but I bet they won't even do that as it would be admitting they were wrong from the start and their attitude clearly shows an unwillingness to do so. So as of now, and as I said in my first message, I consider it as a WONTFIX, I did all I could, I solved the problem for me and gave a way for others to also bypass the problem, and I feel the pain of the countless folks who will suffer from the consequences of that bug. The whole thing is silly and shameful.

wmarkow commented 4 years ago

@matthijskooijman , @greyltc , I have put my comments in your conversation: https://github.com/arduino/ArduinoCore-avr/pull/107/files/3fc5fb88280789753eacc4e099ce1814f34c76d0

greyltc commented 4 years ago

Yep. I saw them, thank you. I'll address them tomorrow.

cmaglie commented 4 years ago

@cmaglie sure the @ermtl is rude but his assessment is broadly accurate.

Nobody denied that the problem exists, nobody closed any issue as "wontfix", and, yes, this issue is 9 years old. There is a lack of clarity and leadership around this issue? yes, sure, this is a fact.

So, this gives you the right to come here and being rude? Absolutely not.

This is an open software community, social interactions are important.

I am willing to bet my house that someone is going to use Arduino to build a ventilator to deal with CORVID-19, and it will hang and kill someone.

Even the Boeing-737 crashed due to hardware/software problems. I think that nobody will ever dreams to run a ventilator on any hardware/firmware that has not been widely tested, whoever made it and whatever the hardware used.

@ermtl

Now you say you want productive, on topic input, I just told you about the jm_Wire library, and I can't wait for a constructive answer.

So what's your suggestion, to replace the Wire library with jm_Wire? I've made a quick review I can see those potential problems:

As I said, there is no simple answer here, this is not about "two damn lines of code".

ermtl commented 4 years ago

@cmaglie I can't believe it ...

from the public API https://github.com/jmparatte/jm_Wire/blob/master/src/jm_Wire.h 

I don't see any function to set the timeout, so I assume that the timeout amount is hardcoded. This is not good as well explained by @matthijskooijman here

Except you're playing with words. There is no function, but there is a variable so it's not hardcoded. Actually, there's even 2 of them:

They are both initialized by twi_init and the declaration for them is in jm_twi.h and they are available for the user as shown in the example jm_LiquidCrystal_I2C_demo.ino that's with the library.

I won't show the code that would be required to initialize them with a function, it's so simple that would be insulting.

Please read the code before you try to lecture me. Ok ?

the setClock function has been disabled I think this is a consequence of the way timeout has been hardcoded, so there is no way from this library to set clock frequency.

Except it's not hardcoded, so tell me again, from a logical point of view, what are the consequences of a false assumption ? nice try. By the way, you saw the attempts at hardcoded delays in commented out code (line 234, 235 and 236 of jm_twi.c) but somehow, you failed to see twi_readFrom_timeout in the final version line 237 ... almost funny.

About the setClock function (more exactly twi_setFrequency), see, I did not take the bait ...

here's the code (line 145 of jm_twi.c):

void twi_setFrequency(uint32_t frequency)
#if 0
{
  TWBR = ((F_CPU / frequency) - 16) / 2;

  /* twi bit rate formula from atmega128 manual pg 204
  SCL Frequency = CPU Clock Frequency / (16 + (2 * TWBR))
  note: TWBR should be 10 or higher for master mode
  It is 72 for a 16mhz Wiring board with 100kHz TWI */
}
#else
{
}
#endif

Obviously the guy posted it without removing the #if 0 he used while debugging his changes, is this your best excuse for refusing to act on a 9 year old bug ?

the TWI device is not reset when a timeout happens. This may solve some kind of lockups but probably not all of them (there may be some cases where the TWI state machine may be unlocked only via reset of the TWI device).

You're making that up, Mr Maybe ! the TWI state machine does'n need to be "unlocked" for it's not locked in any way. There is no need to reset anything, this is pure speculation. The problem is caused by glitches that desynchronize the master and the slave. The protocol clearly states that the start condition initiates a new transaction, and the fact that the previous one ended with a timeout does not change that.

As I said, there is no simple answer here, this is not about "two damn lines of code".

There is no simple answer, only plain and simple lies laid bare for all to see ... (did you really expect I would not actually read the code ?)

But I'm not surprised, except the reference to the 737 Max as a comparison for not fixing an obvious bug for 9 years (does that imply you consider it as complicated as an aircraft ?) that's exactly the kind of answer I expected. It's not formally a WONTFIX, except you use bogus excuses as a reason for refusing to fix it, I guess that's entirely different ...

cmaglie commented 4 years ago

Except you're playing with words. There is no function, but there is a variable so it's not hardcoded. Actually, there's even 2 of them: twi_readFrom_timeout twi_writeTo_timeout I won't show the code that would be required to initialize them with a function, it's so simple that would be insulting.

AFAICS twi_writeTo_timeout is never used.

By the way, you saw the attempts at hardcoded delays in commented out code (line 234, 235 and 236 of jm_twi.c) but somehow, you failed to see twi_readFrom_timeout in the final version line 237 ... almost funny.

I was referring to this: https://github.com/jmparatte/jm_Wire/blob/master/src/utility/jm_twi.c#L338

Except it's not hardcoded, so tell me again, from a logical point of view, what are the consequences of a false assumption ? nice try.

so does setClock works? because I see that twi_setFrequency (and cosequently setClock) is disbaled. Maybe I'm looking at the wrong place.

You're making that up, Mr Maybe ! the TWI state machine does'n need to be "unlocked" for it's not locked in any way. There is no need to reset anything, this is pure speculation

Really? so @greyltc @matthijskooijman maybe you are wasting your time on arduino/Arduino#107 trying to reset the TWI device, @ermtl said that.

matthijskooijman commented 4 years ago

@ermtl, Thanks for bringing some attention back to this issue. It is really something we should fix, even though it might not be as trivial as you make it seem. However, I do not quite like your accusatory tone and I think you are not really making any constructive contribution to this discussion and are intent on seeing bad intent where I just see an overloaded developer team that can only invest their time once. So, I will probably stop responding to your comments after this post.

You're making that up, Mr Maybe ! the TWI state machine does'n need to be "unlocked" for it's not locked in any way. There is no need to reset anything, this is pure speculation. The problem is caused by glitches that desynchronize the master and the slave. The protocol clearly states that the start condition initiates a new transaction, and the fact that the previous one ended with a timeout does not change that.

I've seen in practice that when there is noise, it can look like an arbitration failure or start condition happens, which makes the TWI hardware think there is another master on the bus that is currently in a transaction. This makes the TWI hardware wait for a stop condition from that other master, which of course never happens when there is no such master. This "Waiting" condition is, AFAICT, completely undetectable to the sketch (except using a timeout, since the hardware will simply delay all operations) and the only way I have found to resolve it is to reset the hardware.

If the jm_Wire version does not do the reset, just detects timeouts and this does solve problems for them, then there might be another cause for timeouts that is different from the "Waiting for another master" condition that I have observed. If anyone is observing this, it might be useful to figure out exactly what it is, though I guess a timeout with reset would fix it just as well as a timeout without reset would, so the fix will probably stay the same.

In any case, from what @cmaglie said about this jm_Wire library, it does not look like it's in any shape to be included as-is. In that sense, I would suggest that any further time spent on this issue is either further diagnosing the underlying causes (as suggested in the previous paragraph), or getting arduino/Arduino#107 into shape (which I think is the most promising path formward).

ermtl commented 4 years ago

The whole thing just does not make sense.

Every communication protocol must be able to withstand errors, detect them and then keep going. This is especially true of a protocol such as I2C that relies on pullups to create the high transmission level as this is vulnerable to electrical glitches.

When an error occurs, the worst course of action is to enter a deadlock, because then, the processor is unable to do anything until it's reset, not even activate an alarm. I can't think of a single case where hanging the processor is the best course of action, yet that's what happens on the first glitch.

If there are complex issues with multi master configurations, open a bug for it, it's a different problem.

Electrical glitches can and do happen at any time and they are unpredictable and random by nature. That means it's totally impossible to predict what the consequences will be on the current transmission. If the glitch affects SCL, the master and the slave won't even agree about where they are in the transmission process. You can't correct such errors or wait for them to correct themselves, it's hopeless. That's why the I2C state machine implements an electrically different situation for start and stop conditions. Upon receipt of a start condition, the state machine is reset, whatever happened in the last transmission is discarded and a new transmission begins.

When it happens, all you can do is to give up the current transmission, let it go in a timeout (and it's much better if the value of that timeout can be modified by the user) and give it an error code so that the user can decide what is the best course of action. For a sensor, the user's decision will probably be to make a new measurement. If the application is critical, the user can have an error counter and sound an alarm if there are too many. That's how you solve a communication error, but you can do none of that if the processor hangs.

All I'm asking in here and what everybody have been asking for 9 years is the ability to get out of the deadlock by adding a simple timeout to lines 236/237 of twi.c. Why find so many excuses to refuse making this simple modification ? It can't have any adverse effect. You can even use -1 as a magic value to make the timeout infinite thus preserving the current behaviour if you think it makes sense.

What if it does not solve every possible I2C problem ? open a bug for those problems, at least the most common, the most severe one will be solved.

I would prefer the delay to be what the SMbus association defined (25 to 35ms) by default, but even if you want to make it infinite by default so that it keeps hanging as it does now, so be it, but at least give us the option to get out of the deadlock.

sauttefk commented 4 years ago

I totally support @ermtl latest post. One of the patterns of software design is "graceful error handling" I²C should therefore have a timeout to enable this gracefulness. As @ermtl has proposed, we could also implement a configurable timeout and even a keep the old deadlock behaviour.

matthijskooijman commented 4 years ago

@sauttefk, I believe nobody is opposing anything said in that last post at all, it has only lacked sufficient manpower and focus to actually implement things properly so far. A configurable timeout is exactly what is needed (and has possibly obstructed earlier attempts at getting this fixed), but also what is being implemented in arduino/Arduino#107 right now.

VladimirAkopyan commented 4 years ago

I believe the first step in addressing this is to put a big warning in official documentation and code comments. The statement about testing by @cmaglie is problematic: sometimes all you have is an Arduino, a serious problem that needs solving, and no actual experts in ~500-mile radius. Like when I lived in a small town in Russia. Of-course chances are you will get something wrong, but the framework shouldn't come with deadly hidden traps.

As an example: I was working on a hydroponic system, and it had some unreliable sensors that would fail periodically and need to be replaced. Their cost is low, they are not critical, and it would all be fine except this bug caused the system to freeze.

If it were to happen while the system was pumping acid or base solution into the main tank, the system would keep pumping indefinitely, until the whole tank is so acidic it will kill all the plants and result in £250k of financial losses.

Let's document it so people can at least know about the issue, instead of tearing their hair out or, worse yet, suffering harm.

Hyperdimensionals commented 4 years ago

In case this is useful testing data, jm_Wire.h fixed my problem mentioned above with my DS3231SN real time clock module, and I can now fully disconnect it from my Uno R3 without it blocking.

At the least, having Wire.h function as it does now but putting the timeout in as an option for those who need to use it seems like a no-brainer to me - at least until the fix can be thoroughly tested.

matthijskooijman commented 4 years ago

@Hyperdimensionals I'm interested by your case and want to know a bit more. To prevent cluttering this issue, I've opened arduino/Arduino#326 for further discussion about your issue (and other cases where the jm_Wire library helped even though it does not reset the hardware).

I believe the first step in addressing this is to put a big warning in official documentation and code comments.

This might be a good step. I welcome anyone who wants to move this forward to open a new issue (to keep discussion separated) and put down a clear proposal: What text should be added and where, with a PR if possible (IIRC there is a git repo for the arduino reference docs).

per1234 commented 4 years ago

IIRC there is a git repo for the arduino reference docs

That repo is only for the Arduino Language Reference content. The library reference content is not hosted in a git repo (or at least not a public one, I don't actually know where it is), so it's not possible to propose changes to it via a PR.

I guess the best way to suggest changes to the Wire library documentation is by providing a clear explanation of the proposed change in an issue. Since the Wire library documentation is for all versions of the Wire library, rather than only for the Arduino AVR Boards platform's Wire library, the arduino/Arduino repo's issue tracker would be the most appropriate location for this issue, since that's our catch-all for issues that don't fit perfectly with any of the other issue trackers.

greyltc commented 4 years ago

Shall we close this?

matthijskooijman commented 4 years ago

I think this is not done yet: Examples and documentation still need to updated. And ideally, a PR that enables timeouts by default would be prepared to make sure this is not forgotten (and is easy to merge when it is remembered later).

matthijskooijman commented 4 years ago

arduino/Arduino#356 is a start at updating the examples, btw (but more is needed, see my comments there).

bperrybap commented 4 years ago

@ermtl I think the problem is more complex and bigger than many people are appreciating. Yes, the Wire code should not lock up; however, even when a timeout is added and even if there are options to fully reset the ic2 bus, it cannot fully solve the issues created by the i2c signal corruption. This is because the root cause is that there is a h/w issue that corrupts the signals to look like legitimate i2c signals/states that can be doing strange and arbitrary things to the slave or even to another slave should the address be corrupted to the address of another slave.

IMO, the Wire timeout stuff is a bit of band-aide since the real issue is a h/w issue causing clock or data signal issues. Keeping the Wire code from locking up is definitely a good thing, but if there are errors on the i2c bus, things are not working correctly. Addresses, and Data is being corrupted and/or transactions/transfers are being lost. Eventually there can be an issue on the slave. For example, if a transaction were to be lost or corrupted on a PCF8547 based LCD backpack, the host and the LCD could become out of nibble sync and the LCD display would quickly turn to garbage and never recover until re-initialized or power cycled and yet the sketch would likely never know since most libraries that use Wire do not bother to check return status.

Ironically, with the timeout, things will keep running but may start to misbehave as when there are serious enough h/w issues on the bus to trigger the timeout, there are likely many other "silent" errors also occurring that are not being detected at all. I have seen cases where prior to a lockup, the i2c address was corrupted - (Wire returned an error if there was no slave at the corrupted i2c address), but also data corruption were incorrect data was successfully written to the slave. If the i2c address is corrupted to an address of another slave, then data would "successfully" be written to the incorrect slave.

This isn't an attempt to justify not properly handling the timeout, but one good thing about the lockup is it immediately points out a h/w issue with the i2c bus.

If you step back and look at this from a larger picture, adding/fixing timeouts in the Wire code can't solve nor detect all the issues caused by misbehaving i2c signals. yes timeouts can prevent the processor from locking up. But there is no way to detect or prevent data from being written to the wrong slave in the case of the slave address being corrupted, and s/w can't detect or prevent incorrect/corrupted data from being written to the correct slave.

This is a h/w issue that the Wire library s/w can sometimes detect through a timeout. However, from a very high level perspective, when you have i2c signal corruption happening, which is really worse? A lockup, which is immediately visible or silently corrupting slaves and or the slave's Arduino library state/status?

matthijskooijman commented 3 years ago

@bperrybap, I agree that there are often underlying hardware issues to solve, but in a lot of cases this is not entirely feasible and/or having recovery from timeouts is sufficient (sometimes, not always). Also, by using timeouts (which are reported back to the sketch), a sketch can take additional action is needed (e.g. powercycle a slave, use a general call reset, or whatever mechanism it has to get things into shape again).

However, from a very high level perspective, when you have i2c signal corruption happening, which is really worse? A lockup, which is immediately visible or silently corrupting slaves and or the slave's Arduino library state/status?

There is sense into this, but in practice people would just see an Arduino that locks up, without any indication that the I2C bus is the culprit. In an ideal world (maybe after examples are updated and sketches in the wild too) sketches would always check the result of I2C transactions and print info about failures. But even if not, in your example of a display: If the sketch keeps running, but the display stops updating, that would be a stronger indication of I2c failure than a full lockup.

So, considering things, I still believe that enabling lockups by default is a good idea. Unless that's not what you were arguing against?

bperrybap commented 3 years ago

My main comment on this is that no matter which way you go, a lockup or an attempted timeout, the i2c system and all the slaves are pretty much stuffed whenever there are i2c h/w issues since you can't detect many of the issues and it is unpredictable what things are going to happen once you starting sending corrupted data to a slave or send corrupted data to the wrong slaves.

So, considering things, I still believe that enabling lockups by default is a good idea. Unless that's not what you were arguing against?

I could go either way on if timeouts are enabled by default. One line of thought is that if the low level i2c code is going to have code that attempts to recover from bad h/w, then it seems reasonable to enable it by default. The reason being, is that if the i2c code is "good enough" to recover from these h/w issues and not lockup, why would you ever not want it to be enabled by default?

The other line of thought is that when there are i2c h/w issues it is unpredictable what things are going to happen once you starting sending corrupted data to a slave or even corrupted data to the wrong slaves. Code that wants to try to recover from these i2c errors will need to basically reset and start over to ensure full operation is restored. So you might as well just lock up and let a watchdog timer trigger to reset everything.

A bigger concern for me is portability, and maintenance for the new Wire API timeout functions given the way it was implemented. The AVR wire library was updated to extend the existing Wire API. These API extensions are only in the AVR platform Wire library. AND.... only in IDE versions going forward. This is problematic in that many libraries that use the Wire library need to run on many different versions of the IDE and also run on more than just the AVR platform. The update for this timeout should have created a macro like WIRE_HAS_TIMEOUT to indicate the existence of the new Wire API timeout functions. (Same was what was done when end() was introduced to the Wire API by adding the macro WIRE_HAS_END) This offers a portable way for code that uses the Wire library to know if these new timeout Wire API functions exist. "as is", it creates a never ending maintenance issue for code that uses the Wire library since it will have to have specific conditionals for IDE versions and h/w platforms in order to help "guess" if the API functions exist.

matthijskooijman commented 3 years ago

The update for this timeout should have created a macro like WIRE_HAS_TIMEOUT

Yes, I completely agree there. So since it is really trivial, I just bit the bullet and created a PR to add just that. Thanks for the reminder :-)

matthijskooijman commented 3 years ago

And ideally, a PR that enables timeouts by default would be prepared to make sure this is not forgotten (and is easy to merge when it is remembered later).

Done: arduino/Arduino#363. Comments on the default values used are welcome.

matthijskooijman commented 3 years ago

I've written a proposal for reference documentation at arduino/reference-en#895, feedback welcome.

Aleev2007 commented 3 years ago

You made up a problem out of the blue. So that everything works as before and does not slow down. One line of code in the library is enough. ))) Enjoy.

***
word timeout = 0xFFFF;  //  this is just for example
bool  timeout_Flag = false;
***
// wait for read operation to complete
while(TWI_MRX == twi_state){
if (--timeout == 0) { timeout_Flag = true; break; } // One line of code :))
}
greyltc commented 3 years ago

@Aleev2007 Hey, cool, that's pretty much what we did to fix this six months ago in deea9293201dbab724b6b0519c35ddba3e6b92d9 except we caught all the other places the state machine can lock up too, made the timeout optional and configurable in engineering units, and added provisions for the user to run arbitrary cleanup code when the bus times out!

Unfortunately we didn't manage to fit that into one line though :-/