espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.89k stars 7.33k forks source link

[TW#17765] I2C crashing - watchdog timeout (master & v3.0 branch) #1503

Closed DavidAntliff closed 6 years ago

DavidAntliff commented 6 years ago

I have a project that has three I2C slave devices on a single bus (running at 100kHz). For some time I was developing with ESP-IDF 2.1.1 and everything was working pretty well, except for a weird problem where the I2C master would freeze up after a few minutes. I did some research and it looks like this is a problem with the I2C master hardware state machine which has been addressed in more recent commits of ESP-IDF. So to make use of this fix I migrated my project to use master (595688a32ad653d8e6cb1c7682b813f96125853e). I had to make a few changes (remove references to FreeRTOS heap measurement commands, add nvs_flash_init() before initialising WiFi) but then everything seemed to work well. The slave devices are all being polled correctly and everything seems happy.

The project is here: https://github.com/DavidAntliff/esp32-poolmon/tree/ESP-IDF_master

I came back a little while later and the application is crashing over and over with the following console output shortly after boot:

Guru Meditation Error: Core  0 panic'ed (Interrupt wdt timeout on CPU0)
Register dump:
PC      : 0x400859f3  PS      : 0x00060034  A0      : 0x80084685  A1      : 0x3ffb0590  
0x400859f3: xQueueGenericSendFromISR at /Users/david/esp32/esp-idf-master/components/freertos/./queue.c:2037

A2      : 0x00000001  A3      : 0x00000000  A4      : 0x3ffb05b0  A5      : 0x00000002  
A6      : 0x3ffbc970  A7      : 0x00060021  A8      : 0x800859f3  A9      : 0x3ffb0570  
A10     : 0x00000000  A11     : 0x00000000  A12     : 0x00000002  A13     : 0x3ffbadc0  
A14     : 0x00000000  A15     : 0x400849fc  SAR     : 0x00000012  EXCCAUSE: 0x00000005  
0x400849fc: i2c_isr_handler_default at /Users/david/esp32/esp-idf-master/components/driver/./i2c.c:1023

EXCVADDR: 0x00000000  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0xffffffff  

Backtrace: 0x400859f3:0x3ffb0590 0x40084682:0x3ffb05b0 0x40084a89:0x3ffb05e0 0x40082ba5:0x3ffb0610 0x4000bfed:0x00000000
0x400859f3: xQueueGenericSendFromISR at /Users/david/esp32/esp-idf-master/components/freertos/./queue.c:2037

0x40084682: i2c_master_cmd_begin_static at /Users/david/esp32/esp-idf-master/components/driver/./i2c.c:1023

0x40084a89: i2c_isr_handler_default at /Users/david/esp32/esp-idf-master/components/driver/./i2c.c:1023

0x40082ba5: _xt_lowint1 at /Users/david/esp32/esp-idf-master/components/freertos/./xtensa_vectors.S:1105

Rebooting...

A software or on-board reset does not stop this endless reset behaviour, however removing power for a short period of time does "fix" the issue. It is strange that a brief ESP32 reset does not clear it. (EDIT: but a long reset press does).

DavidAntliff commented 6 years ago

In addition, I've not seen the problem occur with WiFi disabled, but when I enable WiFi I've now seen it happen about five times.

DavidAntliff commented 6 years ago

Another dump with slightly different info:

Guru Meditation Error: Core  0 panic'ed (Interrupt wdt timeout on CPU0)
Register dump:
PC      : 0x4008550d  PS      : 0x00060034  A0      : 0x800859a0  A1      : 0x3ffb0570  
0x4008550d: prvCopyDataToQueue at /Users/david/esp32/esp-idf-master/components/freertos/./queue.c:2037

A2      : 0x3ffbc928  A3      : 0x00000000  A4      : 0x00000002  A5      : 0x3ffbadc0  
A6      : 0x00000000  A7      : 0x400849fc  A8      : 0x3ffbc97c  A9      : 0x3ffbc978  
0x400849fc: i2c_isr_handler_default at /Users/david/esp32/esp-idf-master/components/driver/./i2c.c:1023

A10     : 0x3ffbc978  A11     : 0x3ffb05b8  A12     : 0x00000004  A13     : 0x3ffbc97c  
A14     : 0x00000001  A15     : 0x00000000  SAR     : 0x00000020  EXCCAUSE: 0x00000005  
EXCVADDR: 0x00000000  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0xffffffff  

Backtrace: 0x4008550d:0x3ffb0570 0x4008599d:0x3ffb0590 0x40084682:0x3ffb05b0 0x40084a89:0x3ffb05e0 0x40082ba5:0x3ffb0610 0x4000bfed:0x00000000
0x4008550d: prvCopyDataToQueue at /Users/david/esp32/esp-idf-master/components/freertos/./queue.c:2037

0x4008599d: xQueueGenericSendFromISR at /Users/david/esp32/esp-idf-master/components/freertos/./queue.c:2037

0x40084682: i2c_master_cmd_begin_static at /Users/david/esp32/esp-idf-master/components/driver/./i2c.c:1023

0x40084a89: i2c_isr_handler_default at /Users/david/esp32/esp-idf-master/components/driver/./i2c.c:1023

0x40082ba5: _xt_lowint1 at /Users/david/esp32/esp-idf-master/components/freertos/./xtensa_vectors.S:1105

Rebooting...
DavidAntliff commented 6 years ago

Also, if I hold down the reset button on my ESP32 board for about a second (rather than just pressing it quickly) then the issue does clear on next boot. Pressing it too quickly doesn't help.

DavidAntliff commented 6 years ago

From what I can tell so far, this only occurs when there's a glitch on the I2C bus and WiFi is enabled. If WiFi is disabled, the glitch may happen but it doesn't cause this cascade-reset issue.

DavidAntliff commented 6 years ago

I've now seen this happen with the v3.0 branch (e7dc749e2f953a976325964b5ac49b53c8bafc42).

DavidAntliff commented 6 years ago

To be clear, I am using a fairly long cable with my I2C light sensor (TSL2561) - this cable has the side effect of causing the occasional I2C error (missing slave ACK or timeout). Normally this doesn't cause a problem (apart from the failed transaction) however it appears to trigger this issue occasionally. If I use a much shorter cable, the occasional I2C error does not occur, and neither does this issue. So the cable is clearly related to the issue, but I believe it is triggering the problem by inducing I2C errors, rather than causing it. I think the I2C master shouldn't fail like this in the event of a slave or comms issue.

DavidAntliff commented 6 years ago

I have now observed this issue occur when WiFi is disabled.

costaud commented 6 years ago

Thanks for the detailed descriptions.

I have a question that how large is the pull-up resister?

We will have some tests under similar condition and get back to you.

panfeng-espressif commented 6 years ago

Thanks for the detailed descriptions. I would like to ask what is esp_mqtt.h? I could not find it in either esp-idf3.0 or esp-idf2.1. If you could share it, that will help us debug this issue.

negativekelvin commented 6 years ago

@panfeng-espressif https://github.com/DavidAntliff/esp-mqtt/

It is a submodule of his other project

DavidAntliff commented 6 years ago

SDA and SCK pull up resistance measured (with ESP32, and AVR, removed) at 7.4 kOhm each.

Not sure if it's relevant but I sometimes get a similar error (and resulting continual crash) involving RMT (this time with ESP IDF v3.0):

Guru Meditation Error: Core  0 panic'ed (LoadProhibited)
. Exception was unhandled.
Register dump:
PC      : 0x40084e99  PS      : 0x00060033  A0      : 0x80081aa4  A1      : 0x3ffb05c0  
0x40084e99: rmt_driver_isr_default at /Users/david/esp32/esp-idf-v3.0/components/driver/./rmt.c:497

A2      : 0x00000006  A3      : 0x00000040  A4      : 0x00000002  A5      : 0x00000000  
A6      : 0x00000000  A7      : 0x00001800  A8      : 0x00000006  A9      : 0x00000001  
A10     : 0x00000001  A11     : 0x3ffb05c0  A12     : 0x80082d8b  A13     : 0x00000001  
A14     : 0x00060021  A15     : 0x00020d90  SAR     : 0x00000014  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x00000014  LBEG    : 0x4000c46c  LEND    : 0x4000c477  LCOUNT  : 0x00000000  

Backtrace: 0x40084e99:0x3ffb05c0 0x40081aa1:0x3ffb05f0 0x40082b25:0x3ffb0610 0x4000bfed:0x00000000
0x40084e99: rmt_driver_isr_default at /Users/david/esp32/esp-idf-v3.0/components/driver/./rmt.c:497

0x40081aa1: shared_intr_isr at /Users/david/esp32/esp-idf-v3.0/components/esp32/./intr_alloc.c:754

0x40082b25: _xt_lowint1 at /Users/david/esp32/esp-idf-v3.0/components/freertos/./xtensa_vectors.S:1105

Rebooting...

Might be a red herring though...

DavidAntliff commented 6 years ago

On several occasions the system has run for 3+ days without me seeing this error. Other times it can happen after just a few minutes since power up. I have noticed that it doesn't seem to crash, or enter this continual crash loop, if I don't have a serial console connected, but I need to do more tests to confirm this for sure.

koobest commented 6 years ago

Hi @DavidAntliff , Thanks for you feedback. we are now want to reproducing your problem. is there any place in your code that I should to modify ?

DavidAntliff commented 6 years ago

There isn't really any modification required - I am currently running from the head of my ESP-IDF_v3.0 branch, using ESP IDF v3.0. I've also seen the same problem with my ESP-IDF_master branch, using ESP IDF master @ 595688a32ad653d8e6cb1c7682b813f96125853e.

Extra info:

I have it in a crash-loop now, and it's failing over and over during or shortly after the call to i2c_driver_install:

W (2142) i2c: about to run i2c_driver_install
Guru Meditation Error: Core  0 panic'ed (Interrupt wdt timeout on CPU0)
Register dump:
PC      : 0x4000bff0  PS      : 0x00060034  A0      : 0x80086778  A1      : 0x3ffb0560  
A2      : 0x00000003  A3      : 0x00060023  A4      : 0x00060023  A5      : 0x3ffb9e8c  
A6      : 0x00000001  A7      : 0x00000000  A8      : 0x00000000  A9      : 0x3ffb9d5c  
A10     : 0x0000000a  A11     : 0x3ffbaf80  A12     : 0x800d1954  A13     : 0x3ffbaf60  
A14     : 0x00000031  A15     : 0x0000000e  SAR     : 0x0000001f  EXCCAUSE: 0x00000005  
EXCVADDR: 0x00000000  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0xffffffff  

Backtrace: 0x4000bff0:0x3ffb0560 0x40086775:0x3ffb0570 0x40085c00:0x3ffb0590 0x40084852:0x3ffb05b0 0x40084c7d:0x3ffb05e0 0x40082bbd:0x3ffb0610 0x4000bfed:0x00000000

Rebooting...

It ran without errors, performing many I2C operations (one or two a second) for 4 hours, 3 minutes and 43 seconds before this occurred.

koobest commented 6 years ago

Hi, @DavidAntliff Can you provide the log when the system power on ? I found that your code does not initialize nvs_flash before wifi start up, but I'm not sure if this is the cause.

DavidAntliff commented 6 years ago

I'm mostly using my ESP-IDF_v3.0 branch (with ESP IDF v3.0) at the moment (which has the same problem as my branch that uses the ESP IDF master branch), and I do call nvs_flash_init() there.

Here's the log when my application (commit 801dca9ce8833707fa6bb44c6b614fc0f24fc7b4) starts:

ets Jun  8 2016 00:22:57

rst:0x1 (POWERON_RESET),boot:0x16 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0018,len:4
load:0x3fff001c,len:5576
load:0x40078000,len:0
ho 12 tail 0 room 4
load:0x40078000,len:13716
entry 0x40078fb4
I (29) boot: ESP-IDF v3.0-rc1-151-gb6b8af49-dirty 2nd stage bootloader
I (30) boot: compile time 12:35:35
I (31) boot: Enabling RNG early entropy source...
I (36) boot: SPI Speed      : 40MHz
I (40) boot: SPI Mode       : DIO
I (44) boot: SPI Flash Size : 4MB
I (48) boot: Partition Table:
I (52) boot: ## Label            Usage          Type ST Offset   Length
I (59) boot:  0 nvs              WiFi data        01 02 00009000 00006000
I (66) boot:  1 phy_init         RF data          01 01 0000f000 00001000
I (74) boot:  2 factory          factory app      00 00 00010000 00100000
I (81) boot: End of partition table
I (86) esp_image: segment 0: paddr=0x00010020 vaddr=0x3f400020 size=0x17e40 ( 97856) map
I (129) esp_image: segment 1: paddr=0x00027e68 vaddr=0x3ffb0000 size=0x033a0 ( 13216) load
I (134) esp_image: segment 2: paddr=0x0002b210 vaddr=0x40080000 size=0x00400 (  1024) load
0x40080000: _iram_start at /Users/david/esp32/esp-idf-v3.0/components/freertos/./xtensa_vectors.S:1685

I (136) esp_image: segment 3: paddr=0x0002b618 vaddr=0x40080400 size=0x049f8 ( 18936) load
I (152) esp_image: segment 4: paddr=0x00030018 vaddr=0x400d0018 size=0x64b58 (412504) map
0x400d0018: _stext at ??:?

I (296) esp_image: segment 5: paddr=0x00094b78 vaddr=0x40084df8 size=0x0def0 ( 57072) load
0x40084df8: i2c_isr_handler_default at /Users/david/esp32/esp-idf-v3.0/components/driver/./i2c.c:1023

I (320) esp_image: segment 6: paddr=0x000a2a70 vaddr=0x400c0000 size=0x00000 (     0) load
I (331) boot: Loaded app from partition at offset 0x10000
I (331) boot: Disabling RNG early entropy source...
I (332) cpu_start: Pro cpu up.
I (335) cpu_start: Starting app cpu, entry point is 0x4008102c
0x4008102c: call_start_cpu1 at /Users/david/esp32/esp-idf-v3.0/components/esp32/./cpu_start.c:215

I (0) cpu_start: App cpu up.
I (346) heap_init: Initializing. RAM available for dynamic allocation:
D (353) heap_init: New heap initialised at 0x3ffae6e0
I (358) heap_init: At 3FFAE6E0 len 00001920 (6 KiB): DRAM
D (364) heap_init: New heap initialised at 0x3ffb9140
I (369) heap_init: At 3FFB9140 len 00026EC0 (155 KiB): DRAM
I (375) heap_init: At 3FFE0440 len 00003BC0 (14 KiB): D/IRAM
I (382) heap_init: At 3FFE4350 len 0001BCB0 (111 KiB): D/IRAM
D (389) heap_init: New heap initialised at 0x40092ce8
I (393) heap_init: At 40092CE8 len 0000D318 (52 KiB): IRAM
I (399) cpu_start: Pro cpu start user code
D (411) clk: RTC_SLOW_CLK calibration value: 3302912
D (85) intr_alloc: Connected src 46 to int 2 (cpu 0)
D (86) intr_alloc: Connected src 57 to int 3 (cpu 0)
D (86) stack_chk: Intialize random stack guard
D (91) intr_alloc: Connected src 24 to int 9 (cpu 0)
I (96) cpu_start: Starting scheduler on PRO CPU.
D (0) intr_alloc: Connected src 25 to int 2 (cpu 1)
I (0) cpu_start: Starting scheduler on APP CPU.
D (111) heap_init: New heap initialised at 0x3ffe0440
D (121) heap_init: New heap initialised at 0x3ffe4350
D (121) intr_alloc: Connected src 16 to int 12 (cpu 0)
W (2141) app_main: about to run i2c_master_init
W (2141) i2c: about to run i2c_param_config
W (2141) i2c: about to run i2c_driver_install
D (2141) intr_alloc: Connected src 49 to int 13 (cpu 0)
W (2141) app_main: about to run i2c_master_scan
W (8181) sensor_temp: about to run owb_rmt_initialize
D (8181) intr_alloc: Connected src 47 to int 17 (cpu 0)
W (8181) sensor_temp: about to run find_owb_rom_codes
W (8261) sensor_temp: about to run associate_ds18b20_devices
D (11291) avr_support: avr_support_init
D (11291) avr_support: request AVR reset
I (11291) avr_support: Core ID 1
D (11291) avr_support: I2C 0, REG 0x01: 0x02
D (11291) avr_support: Publishing all switch states
D (11301) avr_support: Publishing all pump states
D (11301) avr_support: command 0xc7, 0b11000111
I (11301) avr_support: CP off
I (11311) avr_support: PP off
I (11311) avr_support: Alarm off
I (11321) avr_support: AVR reset
D (11521) avr_support: I2C 0, REG 0x00: 0x00
D (11521) avr_support: I2C 0, REG 0x01: 0x00
D (11521) avr_support: last_switch_states 0x00, new_switch_states 0x00, changed 0x00
D (11521) avr_support: last_pump_states 0x02, new_pump_states 0x00, changed 0x02
...
DavidAntliff commented 6 years ago

If I hold down the reset button for a few seconds, I get a slightly different log:

ets Jun  8 2016 00:22:57

rst:0x1 (POWERON_RESET),boot:0x17 (SPI_FAST_FLASH_BOOT)
flash read err, 1000
Falling back to built-in command interpreter.
OK
>ets Jun  8 2016 00:22:57

rst:0x10 (RTCWDT_RTC_RESET),boot:0x16 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0018,len:4
load:0x3fff001c,len:5576
load:0x40078000,len:0
ho 12 tail 0 room 4
load:0x40078000,len:13716
entry 0x40078fb4
I (30) boot: ESP-IDF v3.0-rc1-151-gb6b8af49-dirty 2nd stage bootloader
I (30) boot: compile time 12:35:35
I (31) boot: Enabling RNG early entropy source...
I (36) boot: SPI Speed      : 40MHz
I (41) boot: SPI Mode       : DIO
I (45) boot: SPI Flash Size : 4MB
I (49) boot: Partition Table:
I (52) boot: ## Label            Usage          Type ST Offset   Length
I (59) boot:  0 nvs              WiFi data        01 02 00009000 00006000
I (67) boot:  1 phy_init         RF data          01 01 0000f000 00001000
I (74) boot:  2 factory          factory app      00 00 00010000 00100000
I (82) boot: End of partition table
I (86) esp_image: segment 0: paddr=0x00010020 vaddr=0x3f400020 size=0x17e40 ( 97856) map
I (129) esp_image: segment 1: paddr=0x00027e68 vaddr=0x3ffb0000 size=0x033a0 ( 13216) load
I (134) esp_image: segment 2: paddr=0x0002b210 vaddr=0x40080000 size=0x00400 (  1024) load
I (136) esp_image: segment 3: paddr=0x0002b618 vaddr=0x40080400 size=0x049f8 ( 18936) load
I (152) esp_image: segment 4: paddr=0x00030018 vaddr=0x400d0018 size=0x64b58 (412504) map
I (297) esp_image: segment 5: paddr=0x00094b78 vaddr=0x40084df8 size=0x0def0 ( 57072) load
I (320) esp_image: segment 6: paddr=0x000a2a70 vaddr=0x400c0000 size=0x00000 (     0) load
I (332) boot: Loaded app from partition at offset 0x10000
I (332) boot: Disabling RNG early entropy source...
I (332) cpu_start: Pro cpu up.
I (336) cpu_start: Starting app cpu, entry point is 0x4008102c
I (0) cpu_start: App cpu up.
I (346) heap_init: Initializing. RAM available for dynamic allocation:
D (353) heap_init: New heap initialised at 0x3ffae6e0
I (358) heap_init: At 3FFAE6E0 len 00001920 (6 KiB): DRAM
D (365) heap_init: New heap initialised at 0x3ffb9140
I (369) heap_init: At 3FFB9140 len 00026EC0 (155 KiB): DRAM
I (376) heap_init: At 3FFE0440 len 00003BC0 (14 KiB): D/IRAM
I (382) heap_init: At 3FFE4350 len 0001BCB0 (111 KiB): D/IRAM
D (390) heap_init: New heap initialised at 0x40092ce8
I (394) heap_init: At 40092CE8 len 0000D318 (52 KiB): IRAM
I (400) cpu_start: Pro cpu start user code
D (412) clk: RTC_SLOW_CLK calibration value: 3297498
D (86) intr_alloc: Connected src 46 to int 2 (cpu 0)
D (87) intr_alloc: Connected src 57 to int 3 (cpu 0)
D (87) stack_chk: Intialize random stack guard
D (91) intr_alloc: Connected src 24 to int 9 (cpu 0)
I (97) cpu_start: Starting scheduler on PRO CPU.
D (0) intr_alloc: Connected src 25 to int 2 (cpu 1)
I (0) cpu_start: Starting scheduler on APP CPU.
D (111) heap_init: New heap initialised at 0x3ffe0440
D (121) heap_init: New heap initialised at 0x3ffe4350
D (121) intr_alloc: Connected src 16 to int 12 (cpu 0)
W (2141) app_main: about to run i2c_master_init
W (2141) i2c: about to run i2c_param_config
W (2141) i2c: about to run i2c_driver_install
D (2141) intr_alloc: Connected src 49 to int 13 (cpu 0)
W (2141) app_main: about to run i2c_master_scan
W (8181) sensor_temp: about to run owb_rmt_initialize
D (8181) intr_alloc: Connected src 47 to int 17 (cpu 0)
W (8181) sensor_temp: about to run find_owb_rom_codes
W (8261) sensor_temp: about to run associate_ds18b20_devices
D (11291) avr_support: avr_support_init
D (11291) avr_support: request AVR reset
I (11291) avr_support: Core ID 1
D (11291) avr_support: I2C 0, REG 0x01: 0x00
D (11291) avr_support: Publishing all switch states
D (11301) avr_support: Publishing all pump states
D (11301) avr_support: command 0xc7, 0b11000111
I (11301) avr_support: CP off
I (11311) avr_support: PP off
I (11311) avr_support: Alarm off
I (11321) avr_support: AVR reset
D (11521) avr_support: I2C 0, REG 0x00: 0x00
D (11521) avr_support: I2C 0, REG 0x01: 0x00
D (11521) avr_support: last_switch_states 0x00, new_switch_states 0x00, changed 0x00
D (11521) avr_support: last_pump_states 0x00, new_pump_states 0x00, changed 0x00
...
DavidAntliff commented 6 years ago

What do these lines mean during boot?

0x40080000: _iram_start at /Users/david/esp32/esp-idf-v3.0/components/freertos/./xtensa_vectors.S:1685
...
0x400d0018: _stext at ??:?
...
0x40084df8: i2c_isr_handler_default at /Users/david/esp32/esp-idf-v3.0/components/driver/./i2c.c:1023
...
0x4008102c: call_start_cpu1 at /Users/david/esp32/esp-idf-v3.0/components/esp32/./cpu_start.c:215
...

They don't appear on a first boot after power cycle or a long board reset, but they do appear on a make monitor reset or a short board reset.

panfeng-espressif commented 6 years ago

Let me configure that you are using three I2C light sensors(TSL2561) as your I2C slave and ESP32 as I2C master. How do they connect with each other? We are going to buy some sensors that you use. Thank you

DavidAntliff commented 6 years ago

Here's my schematic: https://drive.google.com/file/d/1Tl_T7OjuoEaj_AxchCVL4cJy9trVFNh_/view?usp=sharing

Less importantly, it's currently built on vero board: https://drive.google.com/open?id=1ERjawt9woHZCbJvHvDzopnqYay1ns9jU

On the I2C side of things I'm using one TSL2561 sensor, and one ATTiny84A as an I2C slave (see code here), and one LCD 1602 display with I2C adapter.

I'm also using 3 DS18B20 temperature sensors on the OneWire bus, and using RMT to drive this, and I'm sometimes seeing issues with that too - not sure yet if it's related to I2C.

DavidAntliff commented 6 years ago

The strange thing about this problem is that sometimes it can run for several days without error (thousands of successful I2C operations), and other times it encounters an error in just a few minutes. I've been unable to determine any real pattern to this at all. The only thing that seems certain is that once it crashes, it just keeps crashing over-and-over.

I'm planning to run some tests with fewer I2C devices to see if I can work out which combination causes the issue. This seems like an obvious thing to do, and I've already started doing that, but since days can elapse between instances, it's very difficult to know for certain whether a particular combination is OK or not. What I really need to do is find a 100% reliable way to trigger the fault, and then I can eliminate combinations very quickly.

Is there any way that an I2C slave (such as the AVR) can hold the bus in such a way that the ESP32 is unable to recover?

panfeng-espressif commented 6 years ago

Where is the definition of the function 'rmt_get_ringbuf_handler'? When I compile yesterday, there is no error come out. But this time, I was told this function is missing.

igrr commented 6 years ago

@panfeng-espressif the function has been renamed recently to rmt_get_ringbuf_handle, see #274 and https://github.com/espressif/esp-idf/commit/0eb59844bbd72931b3fcf2bb55f6a2bde7c54402

DavidAntliff commented 6 years ago

This thread on the forum is about the same thing: https://esp32.com/viewtopic.php?f=13&t=4275

With help from ESP_Angus I decoded the watchdog timeout address to be at line 1045 in i2c.c.

This section of code deals with ACK and TIMEOUT errors, then tries to push a command to the command event queue. If the queue is full, and the call blocks, would this cause a watchdog event? However xQueueOverwriteFromISR isn't meant to block according to the docs , so why would this call result in a watchdog event?

Is there a hardware state that can happen that would cause many multiple failures (and queue posts) in the same call to the handler? What if the I2C bus was a bit unstable, and errors were not only possible but fairly common? If this situation can persist over a reset, then that suggests it's being left in an unexpected hardware state - should the driver attempt to reset the peripheral (perhaps via i2c_hw_fsm_reset?) in i2c_driver_install?

I notice the loop in the interrupt handler beginning at line 363 only terminates when status == 0. If the FSM is stuck, would this loop ever terminate? Will it yield in all cases? If not, would this cause a watchdog event?

I recall that there is a hardware issue that was worked around in 3.0 with i2c_hw_fsm_reset - is it possible that I've encountered another scenario that triggers this issue but isn't covered by the current workaround? Note that when I was using ESP IDF 2.1, that particular hardware issue was happening a lot, which is why I moved to 3.0rc1.

DavidAntliff commented 6 years ago

Also, not sure if it's useful, but after observing with an oscilloscope I see that there is no edge activity on the I2C bus once this crash (and subsequent rolling crashes) occurs. Both SDA and SCL stay high (not sure if driven by a bad slave or master, or just by pull-ups - I can probably test that with a known resistive load). So the second and subsequent crashes are happening without any real I2C operations taking place.

negativekelvin commented 6 years ago

Can you call xQueueOverwriteFromISR from non-isr context?

DavidAntliff commented 6 years ago

@negativekelvin I was just looking at that too - the first call to i2c_master_cmd_begin_static is made in a non-ISR context, and then the rest of the linked list is picked up by the interrupt handler. I can only assume that it's OK, since it seems to work in almost all cases except this one, but I am not in a position to say really. There is a mutex within xQueueOverwriteFromISR but I can't see how that could get stuck.

negativekelvin commented 6 years ago

Well in normal cases I don't think the first call would reach those code paths. Have you tried removing the attiny device from the equation? Does calling i2c_hw_fsm_reset before starting the first transaction stop the boot loop?

DavidAntliff commented 6 years ago

Could the first non-ISR call reach them if the FSM was in a bad state?

In the backtrace I can see that the watchdog did actually fire with the CPU at this line:

        taskENTER_CRITICAL_ISR(&pxQueue->mux);

(there's a bug with addr2line apparently, so I used gdb's info line *0x4... to get the line numbers instead, which is why they don't match those in the console log).

So it is blocked by the mutex. I think I can imagine a situation where this could happen - if the non-ISR call to i2c_master_cmd_begin_static calls xQueueOverwriteFromISR and takes the mutex, but the ISR interrupts and tries to write to the queue, then the ISR will be blocked.

Yes, I have tried without the ATTiny, and I wasn't able to cause the problem. However I've also tried without the LCD 1602 and without the TSL2561 and I don't see the problem with those absent either (and sometimes it runs for days before the error occurs so it's hard to conclude anything in those cases), so I haven't been able to isolate it to a specific slave unfortunately. I only see this problem when all three are present, although with more time I can do more exhaustive testing. I suspect that it's interaction between transactions to multiple I2C slaves that might be causing the problem, but I do need to prove that.

I will try a few things:

  1. your suggestion of an early call to i2c_hw_fsm_reset to try and stop the boot loop.
  2. an experimental rewrite of i2c_master_cmd_begin to do the same operations but without the ISR-specific calls.
  3. try to eliminate I2C slave devices by removing them and wait for the error to occur.
  4. determine whether the SDA & SCL lines are pulled up or driven high by a slave (or the master) when the fault occurs.
DavidAntliff commented 6 years ago

In answer to 1, yes, calling i2c_hw_fsm_reset at the beginning of i2c_driver_install does stop the crash-boot-crash loop.

(Although it's uncovered another crash-loop involving RMT and rmt_driver_isr_default, which I hinted at earlier but now seems much clearer).

ginodecock commented 6 years ago

Hi,

I have a simular problem since IDF3.0rc1. With IDF2.1.1 there was no crash. When a tmp102 is disconnected I got the following error until it's reconnected: ESP_ERROR_CHECK failed: esp_err_t 0xfffffffff at 0x401068f1 0x401068f1: tmp102_init1 at C:/msys32/home/projects/ulp_cc_templogger/main/main.c:164

file: "C:/msys32/home/projects/ulp_cc_templogger/main/tmp102.c" line 199 func: tmp102_init1 expression: i2c_master_ccmd_begin(I2C_NUM_0, cmdconfig, 100 / portTICK_PERIOD_MS)

Backtrace: 0x400883bb:0x3ffc1f60 0x400887f1:0x3ffc1f80 0x401068f1:0x3ffc1fa0 0x40107261:0x3ffc1fc0 0x400d13f8:0x3ffc1ff0 0x400883bb: invoke_abort at Y:/VM/Share/GitHub/esp-idf-v3.0-rc1/components/esp32/panic.c:572

0x400887f1: _esp_error_check_failed at Y:/VM/Share/GitHub/esp-idf-v3.0-rc1/components/esp32/panic.c:584

0x401068f1: tmp102_init1 at C:/msys32/home/projects/ulp_cc_templogger/main/main.c:164

0x40107261: app_main at C:/msys32/home/projects/ulp_cc_templogger/main/main.c:460

0x400d13f8: main_task at Y:/VM/Share/GitHub/esp-idf-v3.0-rc1/components/esp32/cpu_start.c:456

Rebooting... ets Jun 8 2016 00:22:57

rst:0xc (SW_CPU_RESET),boot:0x17 (SPI_FAST_FLASH_BOOT) configsip: 0, SPIWP:0xee clk_drv:0x00,q_drv::0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00 mode:DIO, clock div:1 load:0x3fff0018,len:4

Code:

void setup_i2c() { //Configure as I2C master i2c_config_t conf; conf.mode = I2C_MODE_MASTER; conf.sda_io_num = SDA_PIN; conf.scl_io_num = SCL_PIN; conf.sda_pullup_en = GPIO_PULLUP_ENABLE; conf.scl_pullup_en = GPIO_PULLUP_ENABLE; conf.master.clk_speed = 100000; ESP_ERROR_CHECK(i2c_param_config(I2C_NUM_0, &conf));

ESP_ERROR_CHECK(i2c_driver_install(I2C_NUM_0, I2C_MODE_MASTER, 0, 0, 0));

}

void tmp102_init1() { i2c_cmd_handle_t cmdconfig = i2c_cmd_link_create();

ESP_ERROR_CHECK(i2c_master_start(cmdconfig)); ESP_ERROR_CHECK(i2c_master_write_byte(cmdconfig, (TMP102_ADDR1 << 1 | I2C_MASTER_WRITE), ACK_CHECK_ON)); ESP_ERROR_CHECK(i2c_master_write_byte(cmdconfig, 0x01, ACK)); //configuration register ESP_ERROR_CHECK(i2c_master_write_byte(cmdconfig, 0xE1, ACK)); //powersave + trigger oneshot ESP_ERROR_CHECK(i2c_master_write_byte(cmdconfig, 0xA0, ACK)); // ESP_ERROR_CHECK(i2c_master_stop(cmdconfig)); ESP_ERROR_CHECK(i2c_master_cmd_begin(I2C_NUM_0, cmdconfig, 100 / portTICK_PERIOD_MS));// Crash when tmp102 is not connected. i2c_cmd_link_delete(cmdconfig);

i2c_cmd_handle_t cmdreadreg = i2c_cmd_link_create(); ESP_ERROR_CHECK(i2c_master_start(cmdreadreg)); ESP_ERROR_CHECK(i2c_master_write_byte(cmdreadreg, (TMP102_ADDR1 << 1 | I2C_MASTER_WRITE), ACK_CHECK_ON)); ESP_ERROR_CHECK(i2c_master_write_byte(cmdreadreg, 0x00, ACK)); //Set to read temperature ESP_ERROR_CHECK(i2c_master_stop(cmdreadreg)); ESP_ERROR_CHECK(i2c_master_cmd_begin(I2C_NUM_0, cmdreadreg, 100 / portTICK_PERIOD_MS)); i2c_cmd_link_delete(cmdreadreg); }

My problem is always reproducable. If there is a new test i2c driver, I can test it.

DavidAntliff commented 6 years ago

@ginodecock your ESP32 reboots because a call to ESP_ERROR_CHECK has failed - the function within has returned a non-ESP_OK code. As written, if any of your i2c function calls hit an error, your ESP32 will reboot. What you'll need to do is remove all the ESP_ERROR_CHECK wrappers and see if your code triggers a watchdog panic in the i2c interrupt handler.

And/or, modify the source for ESP_ERROR_CHECK and have it print out the error code rather than assert - then you can at least see what the bad error code is, which will be useful to know.

DavidAntliff commented 6 years ago

@igrr does the ESP-IDF FreeRTOS port support "interrupt nesting"? Apparently calling a FromISR function outside of an ISR context is only "probably" safe if interrupt nesting is supported. I'm wondering if the problem I'm seeing is caused by an I2C hardware FSM failure in the non-ISR context of i2c_master_cmd_begin_static that causes it to call xQueueOverwriteFromISR, taking the internal mutex, only to be interrupted by the ISR, which then blocks itself on xQueueOverwriteFromISR and fires the watchdog, causing the ESP32 to reboot. Then, because the FSM freeze isn't addressed by i2c_hw_fsm_reset until after a subsequent transaction has started, this never gets to happen because any subsequent I2C transaction fails immediately. I'm not sure why this always results in a watchdog+reset at that point, but an early preventative call to i2c_hw_fsm_reset seems to avoid it.

There's still some gaps in my understanding. Any comments from Espressif would be helpful please. I'm going to try some tests this weekend to develop this hypothesis and then attempt to prove or disprove it.

ginodecock commented 6 years ago

Hi,

I don't see a watchdog panic in the i2c interrupt handler. When modifying the ESP_ERROR_CHECK I got -1 as reported error.

ammaree commented 6 years ago

HI, Sorry I am coming late into this discussion but some of the problems look familiar. Some background.

We have custom designed and built ESP32 controller with 8x I2C/1-Wire (DS2482-800 bridge) inputs and 16x I2C/GPIO (PCA9555) outputs with various other inputs. The 1-Wire inputs are used to read iButtons (room access and bathroom water access) and the 16x GPIO's outputs are used to drive 8x LEDs (in the iButton readers) and 8x Relays (switching door strikers and water solenoid valves).

In a test environment with short leads all worked perfectly but in the live environment with longer UTP cabling driving the LED & Striker and reading the 1-Wire we had catastrophic results. Similarly on the bathrooms where the DC solenoid and LED/1Wire cabling was separated.

The solution (so far) has been very simple. We simply soldered an 1N4007 diode with reverse bias as close as possible to the door striker and the water solenoid valve. This has the effect of absorbing (shorting) the reverse EMF introduced when power is removed from the striker or solenoid valve.

I have not been able to look at your schematic since the link is not working. Hope this helps a bit.

Andre

DavidAntliff commented 6 years ago

@ammaree Hi André, thank you for your comments. I'll fix the link shortly.

It's fairly clear to me that I have some electrical issues on one of my I2C sensors, and your suggestion may well help with that. However, I tried an experiment: I set up four I2C slave devices (AVR ATtiny84 and 85s) on very short cables, and wrote an ESP32 program to fire off I2C transactions to each in parallel. The bus looks good and it runs well at 400 kHz. However, in the middle of the night, there was a single incident which triggered this bug, causing a watchdog timeout and then a subsequent crash loop.

So, my concern is that although I can make my own electrical side more reliable (and I will), there's a serious bug in ESP-IDF at the moment which causes a catastrophic and persistent failure once triggered. I'd like to track this down and solve it, because it seems that even with a good I2C bus, there's always the risk of an error that could cause this problem.

The challenge at the moment is finding a way to trigger this bug reliably. It's easy to create a noisy bus, but it seems that this bug is related to a possible race in the driver that is only apparent on a noisy bus, but is still relatively rare. Once I can get a simple circuit working that reliably triggers the bug, it should become a bit easier to track it down.

DavidAntliff commented 6 years ago

I fixed the schematic link earlier. Here it is too.

DavidAntliff commented 6 years ago

I've found a reliable way to cause the crash involving 4 x AVRs acting as I2C slaves, and a long cable to a TSL2561, however I will work on simplifying the set-up so that it can be reproduced by others - more on this soon.

Note: I disabled the crash loop by calling i2c_hw_fsm_reset at the beginning of i2c_driver_install. This helps because now it only crashes due to the "initial" fault.

It now crashes pretty reliably every 30 seconds or so. Using the gdb info line *0xaddress technique, I've inspected the stack traces and although the reported error is always "Core 0 panic'ed (Interrupt wdt timeout on CPU0)", and it always happens within the i2c driver, the exact location of each panic varies a little. Here's a selection of the places where it has failed - each one is the first frame reported in the backtrace, over a series of independent crashes in the same build:

(gdb) info line *0x400866ed
Line 157 of "/Users/david/esp32/esp-idf-v3.0/components/freertos/./portmux_impl.h"
   starts at address 0x400866ed <vTaskExitCritical+53> and ends at 0x40086705 <vTaskExitCritical+77>.

       assert(mux->count < 0x100); // Indicates memory corruption
(gdb) info line *0x40086604
Line 4203 of "/Users/david/esp32/esp-idf-v3.0/components/freertos/./tasks.c"
   starts at address 0x40086604 <vTaskEnterCritical> and ends at 0x40086607 <vTaskEnterCritical+3>.

#ifdef CONFIG_FREERTOS_PORTMUX_DEBUG
    void vTaskEnterCritical( portMUX_TYPE *mux, const char *function, int line )
#else
    void vTaskEnterCritical( portMUX_TYPE *mux )
#endif
    {    // <------ Line 4203
        BaseType_t oldInterruptLevel=0;
(gdb) info line *0x400849d9
Line 376 of "/Users/david/esp32/esp-idf-v3.0/components/driver/./i2c.c"
   starts at address 0x400849cf <i2c_isr_handler_default+111> and ends at 0x400849db <i2c_isr_handler_default+123>.

                p_i2c_obj[i2c_num]->status = I2C_STATUS_ACK_ERROR;
(gdb) info line *0x400845e0
Line 1168 of "/Users/david/esp32/esp-idf-v3.0/components/driver/./i2c.c"
   starts at address 0x400845dc <i2c_master_cmd_begin_static+124>
   and ends at 0x400845e9 <i2c_master_cmd_begin_static+137>.

        xQueueOverwriteFromISR(p_i2c->cmd_evt_queue, &evt, &HPTaskAwoken);
(gdb) info line *0x4008560b
Line 2065 of "/Users/david/esp32/esp-idf-v3.0/components/freertos/./queue.c"
   starts at address 0x40085606 <prvCopyDataToQueue+106> and ends at 0x40085612 <prvCopyDataToQueue+118>.

                --( pxQueue->uxMessagesWaiting );

Note that it stops in a different place each time - could this be due to memory corruption as one of the code comments above suggests?

DavidAntliff commented 6 years ago

Also, I think I have disproved my earlier hypothesis that the queue mutex is being held by the non-ISR context call to xQueueOverwriteFromISR, resulting in a watchdog timeout. I added some logging around the mutex take and give (for the non-ISR context only) and after many, many crashes I haven't seen any evidence that the ISR is blocked by the non-ISR's hold on the mutex. So I think we can discount this theory.

DavidAntliff commented 6 years ago

@panfeng-espressif @igrr I have found a way to reproduce this issue with a much simpler program and hardware setup.

Please see my new repository here: https://github.com/DavidAntliff/esp32-i2c-error-1503

With my two "DOIT" boards I'm able to reproduce simply by hooking up two wires to two GPIOs (as I2C SDA and SCL) and repeatedly touching them together. Not a true representation of any properly configured I2C bus, of course, but it does trigger the exact same issue I see in my "complete" system, without all the overhead of extra I2C devices, use of RMT, WiFi, etc.

With my "LoLin32 Lite" board it's harder to reproduce this way (they must be more electrically stable, perhaps), but with a bit of perseverance I'm able to get this board to show the same error. I've included notes in the project's README to explain more.

This should hopefully make it much easier for the Espressif team to reproduce and debug. Please let me know if this helps.

negativekelvin commented 6 years ago

I would suggest you add some logging to the isr and see if there are multiple interrupts being generated at the same time. If so then instead of looping certain code paths in the isr should probably clear ALL interrupts before calling other functions.

DavidAntliff commented 6 years ago

@negativekelvin does ESP-IDF support any logging from an ISR? I assume printf isn't going to work as it will be too slow. I can log simple values to an ad-hoc timestamped array but if something already exists I'll use that.

When you say "multiple interrupts being generated at the same time" do you mean that all-but-one interrupt is being ignored, or that they are being resolved in an incorrect order? What sort of bug do you anticipate here? The hardware FSM is getting "stuck" in this case, so I'm keen to better understand how this might relate to interrupts.

Or, are you suggesting that the ISR is spinning (looping) too much due to too many interrupts (in too short a time) and that causes the watchdog to fire? That seems plausible and fast logging might show that up.

negativekelvin commented 6 years ago

ESP_EARLY_LOGI

Yes too many interrupts or some other race condition between two interrupts

kylehendricks commented 6 years ago

I haven't gotten nearly as deep into this as @DavidAntliff has but I believe I am also having the same issues. It behaves exactly like he described. It will work reliably for a day straight then suddenly go into an endless bootloop with the same Guru Meditation Error: Core 0 panic'ed (Interrupt wdt timeout on CPU0) error until I disconnect power completely.

Here are the complete logs right before the reboot:

Guru Meditation Error: Core  0 panic'ed (Interrupt wdt timeout on CPU0)
Core 0 register dump:
PC      : 0x400845fd  PS      : 0x00060034  A0      : 0x80084a7c  A1      : 0x3ffb05b0  
0x400845fd: i2c_master_cmd_begin_static at /home/khendricks/dev/esp32/tools/esp-idf/components/driver/./i2c.c:947

A2      : 0x00000001  A3      : 0x00000001  A4      : 0x800d17de  A5      : 0x3ffbb880  
A6      : 0x00000000  A7      : 0x00000001  A8      : 0x3ffbd0d8  A9      : 0x00000001  
A10     : 0x00000001  A11     : 0x00000000  A12     : 0x3ffb05b0  A13     : 0x00000002  
A14     : 0x3ffbd1e4  A15     : 0x00060021  SAR     : 0x00000011  EXCCAUSE: 0x00000005  
EXCVADDR: 0x00000000  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0xffffffff  
Core 0 was running in ISR context:
EPC1    : 0x400d1b9a  EPC2    : 0x00000000  EPC3    : 0x00000000  EPC4    : 0x400845fd
0x400d1b9a: esp_vApplicationIdleHook at /home/khendricks/dev/esp32/tools/esp-idf/components/esp32/./freertos_hooks.c:85

0x400845fd: i2c_master_cmd_begin_static at /home/khendricks/dev/esp32/tools/esp-idf/components/driver/./i2c.c:947

Backtrace: 0x400845fd:0x3ffb05b0 0x40084a79:0x3ffb05e0 0x40082b95:0x3ffb0610 0x400d1b97:0x00000000
0x400845fd: i2c_master_cmd_begin_static at /home/khendricks/dev/esp32/tools/esp-idf/components/driver/./i2c.c:947

0x40084a79: i2c_isr_handler_default at /home/khendricks/dev/esp32/tools/esp-idf/components/driver/./i2c.c:947

0x40082b95: _xt_lowint1 at /home/khendricks/dev/esp32/tools/esp-idf/components/freertos/./xtensa_vectors.S:1105

0x400d1b97: esp_vApplicationIdleHook at /home/khendricks/dev/esp32/tools/esp-idf/components/esp32/./freertos_hooks.c:85

Core 1 register dump:
PC      : 0x400d1b9a  PS      : 0x00060034  A0      : 0x80086748  A1      : 0x3ffbbe40  
0x400d1b9a: esp_vApplicationIdleHook at /home/khendricks/dev/esp32/tools/esp-idf/components/esp32/./freertos_hooks.c:85

A2      : 0x00000008  A3      : 0x00000001  A4      : 0x00000001  A5      : 0x3ffbb984  
A6      : 0x00000000  A7      : 0x00000001  A8      : 0x3ffb371c  A9      : 0x3ffb36e0  
A10     : 0x00000000  A11     : 0x80000001  A12     : 0x00000000  A13     : 0x00000001  
A14     : 0x00060021  A15     : 0x00000000  SAR     : 0x00000000  EXCCAUSE: 0x00000005  
EXCVADDR: 0x00000000  LBEG    : 0x00000000  LEND    : 0x00000000  LCOUNT  : 0x00000000  

Backtrace: 0x400d1b9a:0x3ffbbe40 0x40086745:0x3ffbbe60
0x400d1b9a: esp_vApplicationIdleHook at /home/khendricks/dev/esp32/tools/esp-idf/components/esp32/./freertos_hooks.c:85

0x40086745: prvIdleTask at /home/khendricks/dev/esp32/tools/esp-idf/components/freertos/./tasks.c:4571

My current setup is a single ST VL53L0X Time-Of-Flight sensor hooked up to the ESP32 via a 56cm cable and 4.7kohm pull-up resistors. My esp-idf is on master at commit https://github.com/espressif/esp-idf/commit/daa8cfa800677a71a3447eba418a4336979a3cb8 .

TimHughey commented 6 years ago

Appears I too have experienced this same issue. One SHT-31 on the i2c bus and seven devices on a 1-wire bus (via RMT). Log message immediately proceeding the WDT timeout indicated the read of the SHT-31 failed with esp_rc=-1

I (1265887) mcrTimestamp:  Sat Feb  3 23:45:50 2018 heap=166k first=259k min=163k max=170k delta=120
W (1265917) mcrI2c readSHT31: read failed for i2cDev(mcrDev(mcrDevAddr(0x44) id=i2c/542419a4ae30.01.sht31 desc=sht31 rus=20374 wus=0 reading=0x3ffd5cb4) bus=1 use_mplex=true) esp_rc=-1
Guru Meditation Error: Core  0 panic'ed (Interrupt wdt timeout on CPU0)
Core 0 register dump:
PC      : 0x400845ff  PS      : 0x00060034  A0      : 0x40082560  A1      : 0x3ffb05e0  
0x400845ff: i2c_isr_handler_default at /Users/thughey/devel/esp-idf/components/driver/./i2c.c:1023

0x40082560: _xt_lowint1 at /Users/thughey/devel/esp-idf/components/freertos/./xtensa_vectors.S:1105

A2      : 0x3ffbee50  A3      : 0x3ff53000  A4      : 0x00000400  A5      : 0x00000000  
A6      : 0x00000000  A7      : 0x00000001  A8      : 0x00000001  A9      : 0x00000000  
A10     : 0xabc4f586  A11     : 0x00000001  A12     : 0x800d3132  A13     : 0x3ffbdb20  
A14     : 0x00000000  A15     : 0x00000000  SAR     : 0x00000015  EXCCAUSE: 0x00000005  
EXCVADDR: 0x00000000  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0xffffffff  
Core 0 was running in ISR context:
EPC1    : 0x400d3402  EPC2    : 0x00000000  EPC3    : 0x00000000  EPC4    : 0x400845ff
0x400d3402: esp_vApplicationIdleHook at /Users/thughey/devel/esp-idf/components/esp32/./freertos_hooks.c:85

0x400845ff: i2c_isr_handler_default at /Users/thughey/devel/esp-idf/components/driver/./i2c.c:1023

Backtrace: 0x400845ff:0x3ffb05e0 0x4008255d:0x3ffb0610 0x400d33ff:0x00000000
0x400845ff: i2c_isr_handler_default at /Users/thughey/devel/esp-idf/components/driver/./i2c.c:1023

0x4008255d: _xt_lowint1 at /Users/thughey/devel/esp-idf/components/freertos/./xtensa_vectors.S:1105

0x400d33ff: esp_vApplicationIdleHook at /Users/thughey/devel/esp-idf/components/esp32/./freertos_hooks.c:85

Core 1 register dump:
PC      : 0x400d3402  PS      : 0x00060334  A0      : 0x80086d08  A1      : 0x3ffbe0e0  
0x400d3402: esp_vApplicationIdleHook at /Users/thughey/devel/esp-idf/components/esp32/./freertos_hooks.c:85

A2      : 0x00000008  A3      : 0x00000001  A4      : 0x00000001  A5      : 0x3ffbdc18  
A6      : 0x00000000  A7      : 0x00000001  A8      : 0x3ffb34fc  A9      : 0x3ffb34c0  
A10     : 0x00000000  A11     : 0x00000001  A12     : 0x8008571f  A13     : 0x3ffc6080  
A14     : 0x00000003  A15     : 0x00060023  SAR     : 0x00000000  EXCCAUSE: 0x00000005  
EXCVADDR: 0x00000000  LBEG    : 0x00000000  LEND    : 0x00000000  LCOUNT  : 0x00000000  

Backtrace: 0x400d3402:0x3ffbe0e0 0x40086d05:0x3ffbe100
0x400d3402: esp_vApplicationIdleHook at /Users/thughey/devel/esp-idf/components/esp32/./freertos_hooks.c:85

0x40086d05: prvIdleTask at /Users/thughey/devel/esp-idf/components/freertos/./tasks.c:4554

CPU halted.
DavidAntliff commented 6 years ago

Unfortunately I haven't had a chance to investigate this further - in particular the possibility of a flood of interrupts starving the watchdog. I hope to revisit at some stage but I'm not sure when, as I've already spent three weeks chasing this and I need to move on with other tasks. Of course eventually I'll be forced to revisit if it isn't fixed in the meantime since my project is pretty much dead with this issue present.

In the meantime I believe I've demonstrated that the problem is real, is not related to my project specifically, and that there should be enough info here for Espressif to take over, reproduce and find the root cause.

On that note, has anyone had a chance to follow my reproduction instructions and reproduce the issue by simply brushing two wires together while running my example code? I've noticed it's more difficult to reproduce on certain boards (newer LoLin32 Lite boards seem much more resilient than my older "DOIT" boards) , however I've been able to eventually reproduce it on all of mine, so it would be interesting to see how easily anyone can reproduce it on other boards, since I think there's some sort of electrical component to this too (perhaps in how bus noise propagates onto the board and triggers interrupts).

TimHughey commented 6 years ago

I've not attempted to recreate the issue with your code or by brushing wires together. At this point my code is stable after I cleaned up my code to avoid hammering away at the i2c bus. I'll let me board soak for awhile to see if the issue resurfaces.

P.S. I'm using the Adafruit Feather ESP32 module.

PerMalmberg commented 6 years ago

Seem's I'm having the same (or very similar) issue as the rest here. In my case I always get timeouts reported by the I2C driver (the actual text is from my own I2CMasterDevice class) just before the timeout.

I'm communicating on two separate buses with two ADS1115 and one MCP23017. They are both read based on a timer tick() as well as a "ready interrupt" from the respective devices.

It happens very rarely (many days in between) so I don't have more than a stack trace.

E (4356362) I2CMasterDevice: Error during read - Send command error, no ACK from slave
E (4356362) I2CMasterDevice: Error during write - Send command error, no ACK from slave
Guru Meditation Error: Core  0 panic'ed (Interrupt wdt timeout on CPU0)
Core 0 register dump:
PC      : 0x40083f3c  PS      : 0x00060034  A0      : 0x400824f4  A1      : 0x3ffb09f0  
0x40083f3c: i2c_master_cmd_begin_static at /home/permal/esp/esp-idf/components/driver/./i2c.c:778

0x400824f4: _xt_lowint1 at /home/permal/esp/esp-idf/components/freertos/./xtensa_vectors.S:1105

A2      : 0x3ffd9848  A3      : 0x3ff67000  A4      : 0x00000400  A5      : 0x00000001  
A6      : 0x00000000  A7      : 0x00000000  A8      : 0x800843cc  A9      : 0x00000000  
A10     : 0x00000001  A11     : 0x00000001  A12     : 0x800d350a  A13     : 0x3ffc0160  
A14     : 0x00000000  A15     : 0x00000000  SAR     : 0x00000012  EXCCAUSE: 0x00000005  
EXCVADDR: 0x00000000  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0xffffffff  
Core 0 was running in ISR context:
EPC1    : 0x400d2242  EPC2    : 0x00000000  EPC3    : 0x00000000  EPC4    : 0x40083f3c
0x400d2242: esp_vApplicationIdleHook at /home/permal/esp/esp-idf/components/esp32/./freertos_hooks.c:85

0x40083f3c: i2c_master_cmd_begin_static at /home/permal/esp/esp-idf/components/driver/./i2c.c:778

Backtrace: 0x40083f3c:0x3ffb09f0 0x400824f1:0x3ffb0a20 0x400d223f:0x00000000
0x40083f3c: i2c_master_cmd_begin_static at /home/permal/esp/esp-idf/components/driver/./i2c.c:778

0x400824f1: _xt_lowint1 at /home/permal/esp/esp-idf/components/freertos/./xtensa_vectors.S:1105

0x400d223f: esp_vApplicationIdleHook at /home/permal/esp/esp-idf/components/esp32/./freertos_hooks.c:85

Core 1 register dump:
PC      : 0x400d2242  PS      : 0x00060234  A0      : 0x8008b330  A1      : 0x3ffc0740  
0x400d2242: esp_vApplicationIdleHook at /home/permal/esp/esp-idf/components/esp32/./freertos_hooks.c:85

A2      : 0x00000008  A3      : 0x00000001  A4      : 0x00000001  A5      : 0x3ffc0270  
A6      : 0x00000000  A7      : 0x00000001  A8      : 0x3ffb34b0  A9      : 0x3ffb3474  
A10     : 0x00000000  A11     : 0x00000000  A12     : 0x8008a578  A13     : 0x3ffdf6b0  
A14     : 0x00000000  A15     : 0x3ffc0400  SAR     : 0x00000000  EXCCAUSE: 0x00000005  
EXCVADDR: 0x00000000  LBEG    : 0x00000000  LEND    : 0x00000000  LCOUNT  : 0x00000000  

Backtrace: 0x400d2242:0x3ffc0740 0x4008b32d:0x3ffc0760
0x400d2242: esp_vApplicationIdleHook at /home/permal/esp/esp-idf/components/esp32/./freertos_hooks.c:85

0x4008b32d: prvIdleTask at /home/permal/esp/esp-idf/components/freertos/./tasks.c:4702
panfeng-espressif commented 6 years ago

Hello, I found that sometimes there was a strange low level on SCL when SDA was prepared to start a I2C transmission. This made I2C transmission failed sometimes. Have you met such kind of condition before?

vonnieda commented 6 years ago

Edit: After more investigation I don't think this is related, but I'll leave it here just in case.

Pretty sure I am seeing this now as well. I've just updated IDF to a5eb369126e5cef31715d2755931cf933bb98bee and now am getting random watchdog crashes after a failed I2C read:

E (105377) FB_HAL_V1R (0107) fb_hal_battery_task fb_hal_adc_read_channels
ets Jun  8 2016 00:22:57

rst:0x7 (TG0WDT_SYS_RESET),boot:0x33 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0018,len:4
load:0x3fff001c,len:5648
ho 0 tail 12 room 4
load:0x40078000,len:0
load:0x40078000,len:13912
entry 0x40078fd8
W (91) boot: PRO CPU has been reset by WDT.
W (91) boot: WDT reset info: PRO CPU PC=0x40085652
0x40085652: spi_flash_op_block_func at /Users/jason/esp/esp-idf/components/spi_flash/./cache_utils.c:82 (discriminator 1)

W (91) boot: WDT reset info: APP CPU PC=0x40093ebf
0x40093ebf: is_free at /Users/jason/esp/esp-idf/components/heap/./multi_heap.c:339
 (inlined by) multi_heap_malloc_impl at /Users/jason/esp/esp-idf/components/heap/./multi_heap.c:370

For what it's worth, before upgrading this was perfectly stable. I was previously using:

esp-idf: 914ab075672216c15a5564d9ae54fc257b237c13 toolchain: 1.22.0-75-gbaf03c2