esp8266 / Arduino

ESP8266 core for Arduino
GNU Lesser General Public License v2.1
16.07k stars 13.33k forks source link

Using i2c slave and OTA together makes ESP8266 crash. #6875

Closed carpelux closed 4 years ago

carpelux commented 4 years ago

Basic Infos

Platform

Settings in IDE

Problem Description

I am interfacing a ESP8266 as a slave trough I2C with a old legacy system as master in order to read parameters from the system. It is a slow I2C communication, 10 KHz and I have been running with the ESP8266 for about one year with core versions starting from 2.5.0 an up. It has been working resonable good, with some glitches in the communication where it was hanging once in a while. I solved that hanging by simply reboot.

After 2.6.0 the communication seems to be more stable but there is a new issue. When I start to do a OTA update the ESP chrashes as soon there is some data on the SDA line.

In order to do debugging I set up two wemos D1 mini's. They were connected to each other like this:

GPIO 4 (pin D2) <----> GPIO 4 (pin D2) GPIO 5 (pin D1) <----> GPIO 5 (pin D1) Ground <----> Ground

I tried to use the ESP826 Wire examples: master_writer and slave_receiver combined with BasicOTA but couldn't get them running properly. However, If I used 160MHz clock it worked. I then did a change in core_esp8266_si2c.cpp to the Twi::setClock function adding some longer delays and thereby managed to get the slowest clockspeed down to about 12KHz in order to mimic the behaviour of the legacy system. After that I also could run the sketches on 80MHz.

The line 108 in core_esp8266_si2c.cpp that is causing Exception 0: Illegal instruction says:

 return (GPI & (1 << twi_sda)) != 0;

I hope I supplied all relevant inf and that it wasn't to long-winded.

MCVE Sketch

master_writer.ino :

#include <Wire.h>
#include <PolledTimeout.h>
#define SDA_PIN 4
#define SCL_PIN 5
const int16_t I2C_MASTER = 0x42;
const int16_t I2C_SLAVE = 0x08;

void setup() {
  Serial.begin(74880);           // start serial for output
  Serial.println();
  Serial.println("Starting...");
  Wire.begin(SDA_PIN, SCL_PIN, I2C_MASTER); // join i2c bus (address optional for master)
  Wire.setClock(10000);
}

byte x = 0;

void loop() {
  using periodic = esp8266::polledTimeout::periodicMs;
  static periodic nextPing(1000);

  if (nextPing) {
    Wire.beginTransmission(I2C_SLAVE); // transmit to device #8
    Wire.write("x is ");        // sends five bytes
    Wire.write(x);              // sends one byte
    Wire.endTransmission();    // stop transmitting

    Serial.println (x);

    x++;
  }
}

slave_receiver.ino

#include <ESP8266WiFi.h>
#include <ESP8266mDNS.h>
#include <WiFiUdp.h>
#include <ArduinoOTA.h>

#ifndef STASSID
#define STASSID "your-ssid"
#define STAPSK  "your-password"
#endif

const char* ssid = STASSID;
const char* password = STAPSK;

#include <Wire.h>
#define SDA_PIN 4
#define SCL_PIN 5
const int16_t I2C_MASTER = 0x42;
const int16_t I2C_SLAVE = 0x08;

void setup() {
  Serial.begin(9600);           // start serial for output
  Serial.println();

  Serial.println("Booting and connecting to WiFi");
  WiFi.mode(WIFI_STA);
  WiFi.begin(ssid, password);
  while (WiFi.waitForConnectResult() != WL_CONNECTED) {
    Serial.println("Connection Failed! Rebooting...");
    delay(5000);
    ESP.restart();
  }

  // ================ i2c =====================

  Wire.begin(SDA_PIN, SCL_PIN, I2C_SLAVE); // new syntax: join i2c bus (address required for slave)
  Wire.onReceive(receiveEvent); // register event

  // =============== OTA =====================
  ArduinoOTA.setHostname("i2cproblemdemo");
  ArduinoOTA.onStart([]() {
    String type;
    if (ArduinoOTA.getCommand() == U_FLASH) {
      type = "sketch";
    } else { // U_FS
      type = "filesystem";
    }
    // NOTE: if updating FS this would be the place to unmount FS using FS.end()
    Serial.println("Start updating " + type);
  });
  ArduinoOTA.onEnd([]() {
    Serial.println("\nEnd");
  });
  ArduinoOTA.onProgress([](unsigned int progress, unsigned int total) {
    Serial.printf("Progress: %u%%\r", (progress / (total / 100)));
  });
  ArduinoOTA.onError([](ota_error_t error) {
    Serial.printf("Error[%u]: ", error);
    if (error == OTA_AUTH_ERROR) {
      Serial.println("Auth Failed");
    } else if (error == OTA_BEGIN_ERROR) {
      Serial.println("Begin Failed");
    } else if (error == OTA_CONNECT_ERROR) {
      Serial.println("Connect Failed");
    } else if (error == OTA_RECEIVE_ERROR) {
      Serial.println("Receive Failed");
    } else if (error == OTA_END_ERROR) {
      Serial.println("End Failed");
    }
  });
  ArduinoOTA.begin();
  Serial.println("Ready");
  Serial.print("IP address: ");
  Serial.println(WiFi.localIP());
}

void loop() {
  ArduinoOTA.handle();
}

void receiveEvent(size_t howMany) {
  (void) howMany;
  while (1 < Wire.available()) { // loop through all but the last
    char c = Wire.read(); // receive byte as a character
    Serial.print(c);         // print the character
  }
  int x = Wire.read();    // receive byte as an integer
  Serial.println(x);         // print the integer
}

core_esp8266_si2c.cpp

void Twi::setClock(unsigned int freq)
{   
    preferred_si2c_clock = freq;
#if F_CPU == FCPU80
    if (freq <= 10000)
    {   
        twi_dcount = 200;  //about 12KHz
    }
    else if (freq <= 50000)
    {   
        twi_dcount = 38;  //about 50KHz
    }
    else if (freq <= 100000)
    {   
        twi_dcount = 19;  //about 100KHz
    }
    else if (freq <= 200000)
    {   
        twi_dcount = 8;  //about 200KHz
    }
    else if (freq <= 300000)
    {   
        twi_dcount = 3;  //about 300KHz
    }
    else if (freq <= 400000)
    {   
        twi_dcount = 1;  //about 400KHz
    }
    else
    {   
        twi_dcount = 1;  //about 400KHz
    }
#else
    if (freq <= 10000)
    {   
        twi_dcount = 254;  //about 12KHz
    }
    else if (freq <= 50000)
    {   
        twi_dcount = 64;  //about 50KHz
    }
    else if (freq <= 100000)
    {   
        twi_dcount = 32;  //about 100KHz
    }
    else if (freq <= 200000)
    {   
        twi_dcount = 14;  //about 200KHz
    }
    else if (freq <= 300000)
    {   
        twi_dcount = 8;  //about 300KHz
    }
    else if (freq <= 400000)
    {   
        twi_dcount = 5;  //about 400KHz
    }
    else if (freq <= 500000)
    {   
        twi_dcount = 3;  //about 500KHz
    }
    else if (freq <= 600000)
    {   
        twi_dcount = 2;  //about 600KHz
    }
    else
    {   
        twi_dcount = 1;  //about 700KHz
    }
#endif
}

Debug Messages

Decoded stack trace:

Exception 0: Illegal instruction
PC: 0x40206134: Twi::SDA_READ() at /Users/christer/Library/Arduino15/packages/esp8266/hardware/esp8266/2.6.2/cores/esp8266/core_esp8266_si2c.cpp line 108
EXCVADDR: 0x00000000

Decoding stack results
0x40100b74: interrupt_handler(void*) at /Users/christer/Library/Arduino15/packages/esp8266/hardware/esp8266/2.6.2/cores/esp8266/interrupts.h line 29
0x4010022c: ets_post(uint8, ETSSignal, ETSParam) at /Users/christer/Library/Arduino15/packages/esp8266/hardware/esp8266/2.6.2/cores/esp8266/core_esp8266_main.cpp line 160
0x40100ab0: interrupt_handler(void*) at /Users/christer/Library/Arduino15/packages/esp8266/hardware/esp8266/2.6.2/cores/esp8266/core_esp8266_wiring_digital.cpp line 135
0x40100ff4: umm_free_core(void*) at /Users/christer/Library/Arduino15/packages/esp8266/hardware/esp8266/2.6.2/cores/esp8266/umm_malloc/umm_malloc.cpp line 316
0x4021e12a: ip4_output_if at core/ipv4/ip4.c line 1550
0x4021ecc3: ip_chksum_pseudo at core/inet_chksum.c line 395
0x40219673: tcp_output_control_segment at core/tcp_out.c line 1956
0x40219569: tcp_output_alloc_header at core/tcp_out.c line 1863
0x40219ebc: tcp_send_empty_ack at core/tcp_out.c line 2057
0x40219faa: tcp_output at core/tcp_out.c line 1319
0x4020462d: Print::write(char const*) at /Users/christer/Library/Arduino15/packages/esp8266/hardware/esp8266/2.6.2/cores/esp8266/Print.h line 60
0x4020411e: EspClass::flashEraseSector(unsigned int) at /Users/christer/Library/Arduino15/packages/esp8266/hardware/esp8266/2.6.2/cores/esp8266/Esp.cpp line 585
0x40205ba6: __yield() at /Users/christer/Library/Arduino15/packages/esp8266/hardware/esp8266/2.6.2/cores/esp8266/core_esp8266_main.cpp line 100
0x40204e37: UpdaterClass::_writeBuffer() at /Users/christer/Library/Arduino15/packages/esp8266/hardware/esp8266/2.6.2/cores/esp8266/Updater.cpp line 323
0x40205b77: __esp_yield() at /Users/christer/Library/Arduino15/packages/esp8266/hardware/esp8266/2.6.2/cores/esp8266/core_esp8266_main.cpp line 107
0x402064c1: __delay(unsigned long) at /Users/christer/Library/Arduino15/packages/esp8266/hardware/esp8266/2.6.2/cores/esp8266/core_esp8266_wiring.cpp line 57
0x40203b52: ArduinoOTAClass::_runUpdate() at /Users/christer/Library/Arduino15/packages/esp8266/hardware/esp8266/2.6.2/cores/esp8266/Updater.h line 154
0x402089e8: esp8266::MDNSImplementation::MDNSResponder::_updateProbeStatus() at /Users/christer/Library/Arduino15/packages/esp8266/hardware/esp8266/2.6.2/libraries/ESP8266mDNS/src/LEAmDNS_Control.cpp line 1321
0x402094f9: esp8266::MDNSImplementation::MDNSResponder::_process(bool) at /Users/christer/Library/Arduino15/packages/esp8266/hardware/esp8266/2.6.2/libraries/ESP8266mDNS/src/LEAmDNS_Control.cpp line 90
0x40203cb0: ArduinoOTAClass::handle() at /Users/christer/Library/Arduino15/packages/esp8266/hardware/esp8266/2.6.2/libraries/ArduinoOTA/ArduinoOTA.cpp line 365
0x402013e0: loop() at /Volumes/home/Data/Verksamheter/Elektronik/Arduino/Demo i2c ota error/slave_receiver/slave_receiver/slave_receiver.ino line 118
0x40205c8c: loop_wrapper() at /Users/christer/Library/Arduino15/packages/esp8266/hardware/esp8266/2.6.2/cores/esp8266/core_esp8266_main.cpp line 180
Tech-TX commented 4 years ago

The only problem I see is the long delay due to the 9600 baud serial clock you're using. Boost that to 115.2K and your exceptions may magically disappear. I suspect you're spending too much time with those two slow print statements, and it's crashing because of that. Personally, I'd set a flag in the callback after I'd received the data, and then use the flag to print it out from inside loop() or some other less-critical function. Asynchronous callbacks have looser restrictions that ISRs do, but that depends on WiFi traffic. When you're in the middle of an OTA download, that's a heck of a lot of traffic. BOOM!

edit: even worse is the 12KHz I2C clock, as that's spending a long time inside the Slave section. I don't see a simple way around that in your setup, as it probably won't run above 50KHz. I can't tell from your first post, but I'd make sure the slave side was compiled for 160MHz CPU, as well. The slave code is very time-critical, and even receiving one byte looks like it'd take around 41-83 microseconds inside the onSclChange ISR with each SCL edge. I haven't timed that section yet, so the microseconds are a guess since I see a 'clock stretch' in there.

If it's caused merely because the ISR in the slave code is fired, then maybe the OTA code should disable the other interrupts while critical sections are in process. It's better to lose an I2C transaction than your update payload. The OTA update ends with a reset, so you'd lose it in any event. I don't know if there's a way to un-hook the I2C interrupts from the sketch side once you've recognized an OTA is in process.

carpelux commented 4 years ago

The Serial.prints are there only because they are so in the example code I used to create the MVCE. The speed is 9600 as I didnt bother to figure out how to set the Mac terminal serial speed so I could capture the stack trace with a screen command.

in the "Production" code I have the same crash problem and there I only assigns a value to a volatile declared variable in the receiveEvent.

I just tried in the slave code to remove all Serial.print and compile it for 160MHz but the result was the same, crash, so it seems like it is like You print in the last sentence. As You also write it's no problems losing the i2c communication during a OTA as it will reset anyway.

It has been working with 2.5.x so the question is if the problem nows lies in changes in the i2c or the OTA code. Anyway, The i2c seems a bit more stable now. I have a reset in the "production" code that reboots the slave when loosing connection and when on 2.5.x it was rebooting on average once a day, now with 2.6.x i would say it's on average once a week so thats a big step forward.

If it was possible in the OTA to disable other interrupts or maybe have some way to do something like wire.end() I suppose this problem could be solved. Unfortuneately I'm confident enough to write that code.

Edit: typos

Tech-TX commented 4 years ago

@carpelux You're confident enough, as you're already tweaking the I2C core file. :-) I don't know if the following will work, but I suspect it may. Let's try disabling the slave inside the OTA handler. We'll do it the fast and dirty way. In ArduinoOTA.cpp insert the following just before the _runUpdate(); at line 364:

detachInterrupt(4);
detachInterrupt(5);

Here's what the finished insert looks like, in case the line numbers changed between 2.6.2 and the current git:

//this needs to be called in the loop()
void ArduinoOTAClass::handle() {
  if (_state == OTA_RUNUPDATE) {
    detachInterrupt(4);
    detachInterrupt(5);
    _runUpdate();
    _state = OTA_IDLE;
  }

That would be cleaner if you extern'd two variables that match your SCL and SDA pins, but it's a fast test with only one file to tweak. If the OTA fails, you'll need to re-init the I2C slave routine with Wire.begin(SDA_PIN, SCL_PIN, I2C_SLAVE); again as the interrupts are now toast.

carpelux commented 4 years ago

Thanks a lot. A quick test showed that it works fine now. I will need to do some more testing in the production code but that needs to wait for a week as I'm on the way to a trip right now.

I will get back with the result as soon I'm back, but it looks really promising.

Tech-TX commented 4 years ago

I made a mistake in the insert above, it should have been detachInterrupt(digitalPinToInterrupt(interruptPin));, so:

detachInterrupt(digitalPinToInterrupt(4));
detachInterrupt(digitalPinToInterrupt(5));

I'm surprised it worked with the original syntax. Chalk it up to a newbie mistake and using Google to find examples. ;-) Google finds a LOT of bad examples.

A cleaner solution would be to replace in your slave code:

extern const int SDA_PIN = 4;
extern const int SCL_PIN = 5;
//#define SDA_PIN 4
//#define SCL_PIN 5

and then change the tweak in ArduinoOTA.cpp (in case you change the port pins):

detachInterrupt(digitalPinToInterrupt(SDA_PIN));
detachInterrupt(digitalPinToInterrupt(SCL_PIN));

I checked out the slave code already and it compiles and runs, just about to check your whole test case.

By the way, the master side can be compiled at 80MHz CPU, it's the slave side that needs 160MHz.

Tech-TX commented 4 years ago

There's a better fix in the works so that you won't have to work around the bug by disabling interrupts. The fix will be out Real Soon Now. I've been testing it most of today with a modified version of your test code (running at 50KHz bus, and packets every 5ms for stress.) I also added a master-transmitter into your slave code so I could beat it up a little more severely. No hiccoughs in the OTA upload with interrupts enabled, although the OTA does frequently trash the I2C slave reception while OTA packets are being received as WiFi is a higher-priority interrupt. OTA plus I2C

Tech-TX commented 4 years ago

That simpler solution just above worked, but may have blown up in your face down the road. There's a more definitive fix (PR #6898) in place which should be out for the 2.6.3 release. Since there were so many small changes to the code, I want to re-run a bunch of tests this weekend before blessing it.

carpelux commented 4 years ago

After a week away I got back home late at night yesterday. It is impressive work that has been done while I was away.

I downloaded the core_esp8266_si2c.cpp from the PR and set it to work in my production code. It has been running fine now for about 10 hours, including a few OTA updates. I will let it run but as far as I can see now my problem with OTA is solved.

A big thanks for all effort @Tech-TX and to all other involved in this great project!

Tech-TX commented 4 years ago

It's definitely a group effort, as I suck at C++ and I'm only barely functional at C. All of the heavy lifting has been done by the other two gentlemen. I'm doing the grunt work (it's what I'm good at) and trying to break things. :innocent: I'm pretty good at breaking things.

edit: which PR are you using? The first one worked, but may still have issues down the road if the compiler optimizes incorrectly. #6898 is the one that's going into the release code, as I haven't been able to break it yet. I should finish my testing within a couple more hours.

carpelux commented 4 years ago

Yes, It's #6898 is the one that I'm using and it still works perfect so I am pretty sure my problem is solved.

Once more, big thanks to You all!

Tech-TX commented 4 years ago

Excellent! For my stress test (where I showed the logic analyzer capture above) I was hitting the bus pretty heavily, and I did around 100 OTA uploads with no errors on this latest PR. Before the fix OTA uploads always died at around 5 to 10% completion since parts of the slave code were in Flash. If your slave is controlling something delicate, I'd verify each received slave packet with a CRC or something so that you don't send your (whatever) off the rails when the I2C packet(s) get mangled during OTA updates.

devyte commented 4 years ago

Closing via #6898

Tech-TX commented 4 years ago

@carpelux you might want to play with the latest git from today with the new core_esp8266_si2c.cpp file. We've just tweaked the clock generation to go smoothly down below where you're running it. You won't need to jiggle the core file to get a 12KHz bus clock any longer. :smile: It's not perfect, but it's within 1% of the programmed bus frequency down at the range you're using it. It gets more choppy on frequency accuracy above 100KHz because we're bit-banging the clock from a loop timer, but it's about 15 times more accurate than it was previously at the high end.

For your program just enter your desired speed in Wire.setClock(freq) after Wire.begin. 12000 should mimic what you were doing before.

The 160MHz compile recommendation for the slave hasn't changed, as it's limited by how fast we can respond to interrupts. With a slow bus clock (below ~36KHz) you can compile the slave for 80MHz CPU.