Closed carpelux closed 4 years ago
The only problem I see is the long delay due to the 9600 baud serial clock you're using. Boost that to 115.2K and your exceptions may magically disappear. I suspect you're spending too much time with those two slow print statements, and it's crashing because of that. Personally, I'd set a flag in the callback after I'd received the data, and then use the flag to print it out from inside loop() or some other less-critical function. Asynchronous callbacks have looser restrictions that ISRs do, but that depends on WiFi traffic. When you're in the middle of an OTA download, that's a heck of a lot of traffic. BOOM!
edit: even worse is the 12KHz I2C clock, as that's spending a long time inside the Slave section. I don't see a simple way around that in your setup, as it probably won't run above 50KHz. I can't tell from your first post, but I'd make sure the slave side was compiled for 160MHz CPU, as well. The slave code is very time-critical, and even receiving one byte looks like it'd take around 41-83 microseconds inside the onSclChange ISR with each SCL edge. I haven't timed that section yet, so the microseconds are a guess since I see a 'clock stretch' in there.
If it's caused merely because the ISR in the slave code is fired, then maybe the OTA code should disable the other interrupts while critical sections are in process. It's better to lose an I2C transaction than your update payload. The OTA update ends with a reset, so you'd lose it in any event. I don't know if there's a way to un-hook the I2C interrupts from the sketch side once you've recognized an OTA is in process.
The Serial.prints are there only because they are so in the example code I used to create the MVCE. The speed is 9600 as I didnt bother to figure out how to set the Mac terminal serial speed so I could capture the stack trace with a screen command.
in the "Production" code I have the same crash problem and there I only assigns a value to a volatile declared variable in the receiveEvent.
I just tried in the slave code to remove all Serial.print and compile it for 160MHz but the result was the same, crash, so it seems like it is like You print in the last sentence. As You also write it's no problems losing the i2c communication during a OTA as it will reset anyway.
It has been working with 2.5.x so the question is if the problem nows lies in changes in the i2c or the OTA code. Anyway, The i2c seems a bit more stable now. I have a reset in the "production" code that reboots the slave when loosing connection and when on 2.5.x it was rebooting on average once a day, now with 2.6.x i would say it's on average once a week so thats a big step forward.
If it was possible in the OTA to disable other interrupts or maybe have some way to do something like wire.end() I suppose this problem could be solved. Unfortuneately I'm confident enough to write that code.
Edit: typos
@carpelux You're confident enough, as you're already tweaking the I2C core file. :-) I don't know if the following will work, but I suspect it may. Let's try disabling the slave inside the OTA handler. We'll do it the fast and dirty way. In ArduinoOTA.cpp insert the following just before the _runUpdate(); at line 364:
detachInterrupt(4);
detachInterrupt(5);
Here's what the finished insert looks like, in case the line numbers changed between 2.6.2 and the current git:
//this needs to be called in the loop()
void ArduinoOTAClass::handle() {
if (_state == OTA_RUNUPDATE) {
detachInterrupt(4);
detachInterrupt(5);
_runUpdate();
_state = OTA_IDLE;
}
That would be cleaner if you extern'd two variables that match your SCL and SDA pins, but it's a fast test with only one file to tweak. If the OTA fails, you'll need to re-init the I2C slave routine with Wire.begin(SDA_PIN, SCL_PIN, I2C_SLAVE); again as the interrupts are now toast.
Thanks a lot. A quick test showed that it works fine now. I will need to do some more testing in the production code but that needs to wait for a week as I'm on the way to a trip right now.
I will get back with the result as soon I'm back, but it looks really promising.
I made a mistake in the insert above, it should have been detachInterrupt(digitalPinToInterrupt(interruptPin));, so:
detachInterrupt(digitalPinToInterrupt(4));
detachInterrupt(digitalPinToInterrupt(5));
I'm surprised it worked with the original syntax. Chalk it up to a newbie mistake and using Google to find examples. ;-) Google finds a LOT of bad examples.
A cleaner solution would be to replace in your slave code:
extern const int SDA_PIN = 4;
extern const int SCL_PIN = 5;
//#define SDA_PIN 4
//#define SCL_PIN 5
and then change the tweak in ArduinoOTA.cpp (in case you change the port pins):
detachInterrupt(digitalPinToInterrupt(SDA_PIN));
detachInterrupt(digitalPinToInterrupt(SCL_PIN));
I checked out the slave code already and it compiles and runs, just about to check your whole test case.
By the way, the master side can be compiled at 80MHz CPU, it's the slave side that needs 160MHz.
There's a better fix in the works so that you won't have to work around the bug by disabling interrupts. The fix will be out Real Soon Now. I've been testing it most of today with a modified version of your test code (running at 50KHz bus, and packets every 5ms for stress.) I also added a master-transmitter into your slave code so I could beat it up a little more severely. No hiccoughs in the OTA upload with interrupts enabled, although the OTA does frequently trash the I2C slave reception while OTA packets are being received as WiFi is a higher-priority interrupt.
That simpler solution just above worked, but may have blown up in your face down the road. There's a more definitive fix (PR #6898) in place which should be out for the 2.6.3 release. Since there were so many small changes to the code, I want to re-run a bunch of tests this weekend before blessing it.
After a week away I got back home late at night yesterday. It is impressive work that has been done while I was away.
I downloaded the core_esp8266_si2c.cpp from the PR and set it to work in my production code. It has been running fine now for about 10 hours, including a few OTA updates. I will let it run but as far as I can see now my problem with OTA is solved.
A big thanks for all effort @Tech-TX and to all other involved in this great project!
It's definitely a group effort, as I suck at C++ and I'm only barely functional at C. All of the heavy lifting has been done by the other two gentlemen. I'm doing the grunt work (it's what I'm good at) and trying to break things. :innocent: I'm pretty good at breaking things.
edit: which PR are you using? The first one worked, but may still have issues down the road if the compiler optimizes incorrectly. #6898 is the one that's going into the release code, as I haven't been able to break it yet. I should finish my testing within a couple more hours.
Yes, It's #6898 is the one that I'm using and it still works perfect so I am pretty sure my problem is solved.
Once more, big thanks to You all!
Excellent! For my stress test (where I showed the logic analyzer capture above) I was hitting the bus pretty heavily, and I did around 100 OTA uploads with no errors on this latest PR. Before the fix OTA uploads always died at around 5 to 10% completion since parts of the slave code were in Flash. If your slave is controlling something delicate, I'd verify each received slave packet with a CRC or something so that you don't send your (whatever) off the rails when the I2C packet(s) get mangled during OTA updates.
Closing via #6898
@carpelux you might want to play with the latest git from today with the new core_esp8266_si2c.cpp file. We've just tweaked the clock generation to go smoothly down below where you're running it. You won't need to jiggle the core file to get a 12KHz bus clock any longer. :smile: It's not perfect, but it's within 1% of the programmed bus frequency down at the range you're using it. It gets more choppy on frequency accuracy above 100KHz because we're bit-banging the clock from a loop timer, but it's about 15 times more accurate than it was previously at the high end.
For your program just enter your desired speed in Wire.setClock(freq) after Wire.begin. 12000 should mimic what you were doing before.
The 160MHz compile recommendation for the slave hasn't changed, as it's limited by how fast we can respond to interrupts. With a slow bus clock (below ~36KHz) you can compile the slave for 80MHz CPU.
Basic Infos
Platform
Settings in IDE
Problem Description
I am interfacing a ESP8266 as a slave trough I2C with a old legacy system as master in order to read parameters from the system. It is a slow I2C communication, 10 KHz and I have been running with the ESP8266 for about one year with core versions starting from 2.5.0 an up. It has been working resonable good, with some glitches in the communication where it was hanging once in a while. I solved that hanging by simply reboot.
After 2.6.0 the communication seems to be more stable but there is a new issue. When I start to do a OTA update the ESP chrashes as soon there is some data on the SDA line.
In order to do debugging I set up two wemos D1 mini's. They were connected to each other like this:
GPIO 4 (pin D2) <----> GPIO 4 (pin D2) GPIO 5 (pin D1) <----> GPIO 5 (pin D1) Ground <----> Ground
I tried to use the ESP826 Wire examples: master_writer and slave_receiver combined with BasicOTA but couldn't get them running properly. However, If I used 160MHz clock it worked. I then did a change in core_esp8266_si2c.cpp to the Twi::setClock function adding some longer delays and thereby managed to get the slowest clockspeed down to about 12KHz in order to mimic the behaviour of the legacy system. After that I also could run the sketches on 80MHz.
The line 108 in core_esp8266_si2c.cpp that is causing Exception 0: Illegal instruction says:
I hope I supplied all relevant inf and that it wasn't to long-winded.
MCVE Sketch
master_writer.ino :
slave_receiver.ino
core_esp8266_si2c.cpp
Debug Messages
Decoded stack trace: