espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.19k stars 7.17k forks source link

i2c_master: hangs when SCL is randomly pulled low during communication (IDFGH-12655) #13647

Open KJ7LNW opened 4 months ago

KJ7LNW commented 4 months ago

Hi @mythbuster5,

Thank you for your recent commits that help with I2C stability. At this point the only way that I can lock up/crash the chip (in a way that it cannot do error handling) is by randomly grounding the SCL line by hand with a pushbutton.

Eventually I can get it to hang with this WDT warning/trace:

E (60864) task_wdt: Task watchdog got triggered. The following tasks/users did not reset the watchdog in time:
E (60864) task_wdt:  - IDLE (CPU 0)
E (60864) task_wdt: Tasks currently running:
E (60864) task_wdt: CPU 0: i2c_task
E (60864) task_wdt: Print CPU 0 (current core) backtrace
Print CPU 0 (current core) registers
Core  0 register dump:
MEPC    : 0x4200fb30  RA      : 0x4200fb5a  SP      : 0x40812800  GP      : 0x4080b8a4  
0x4200fb30: i2c_ll_master_clr_bus at /home/ewheeler/src/esp32/esp-idf/components/hal/esp32c6/include/hal/i2c_ll.h:767 (discriminator 1)
 (inlined by) s_i2c_master_clear_bus at /home/ewheeler/src/esp32/esp-idf/components/esp_driver_i2c/i2c_master.c:77 (discriminator 1)
0x4200fb5a: s_i2c_hw_fsm_reset at /home/ewheeler/src/esp32/esp-idf/components/esp_driver_i2c/i2c_master.c:115

TP      : 0x40812910  T0      : 0x40028192  T1      : 0x80000000  T2      : 0x00000000  
0x40028192: __retarget_lock_release_recursive in ROM

S0/FP   : 0x40812a6c  S1      : 0x00000103  A0      : 0x40811788  A1      : 0x40812428  
A2      : 0x00000000  A3      : 0x00000800  A4      : 0x60004000  A5      : 0x00000013  
A6      : 0x00000001  A7      : 0x0000000a  S2      : 0x00000006  S3      : 0x40812830  
S4      : 0x408128b7  S5      : 0x00000001  S6      : 0x00000000  S7      : 0x00000000  
S8      : 0x00000000  S9      : 0x00000000  S10     : 0x00000000  S11     : 0x00000000  
T3      : 0x00000000  T4      : 0x4080d808  T5      : 0x00001800  T6      : 0x00000000  
MSTATUS : 0x4201edbc  MTVEC   : 0x420229b8  MCAUSE  : 0x00000366  MTVAL   : 0x42010c8a  
0x42010c8a: s_i2c_synchronous_transaction at /home/ewheeler/src/esp32/esp-idf/components/esp_driver_i2c/i2c_master.c:877

Which causes it to spin here:

https://github.com/espressif/esp-idf/blob/636ff35b52f10e1a804a3760a5bd94e68f4b1b71/components/hal/esp32c6/include/hal/i2c_ll.h#L767

Note that pushing the reset button always "fixes" it and communication resumes after the reset.

Question

Based on the comment at the top of the function, perhaps nothing else that can be done to prevent this... but before I acquiesce to this limitation, I have a question:

Is there a way to know if the 9x SCL pulses been sent by the hardware have completed, so that it can stop spinning and return control to the program, even if the bus really is locked up?

For example, can we just read the GPIO line level and return from the spinning while loop if the line is high?

Here is the entire function for your reference:

https://github.com/espressif/esp-idf/blob/636ff35b52f10e1a804a3760a5bd94e68f4b1b71/components/hal/esp32c6/include/hal/i2c_ll.h#L749-L769

KJ7LNW commented 4 months ago

As a proof of concept, this is one way to prevent the hang, by waiting only while the SCL line is low.

To prevent the spinning on hw->scl_sp_conf.scl_rst_slv_en I had to modify a chip specific header (esp32c6 in my case) and move while (hw->scl_sp_conf.scl_rst_slv_en) outside the low level function. Thus, adding something like a nowait parameter to i2c_ll_master_clr_bus() across chip implementations could be a solution.

With the diff below I can no longer hang the bus. You know the code best, so maybe there is an even better way to solve this:

diff --git a/components/esp_driver_i2c/i2c_master.c b/components/esp_driver_i2c/i2c_master.c
index 4ee7c2b409..ef56bb01c5 100644
--- a/components/esp_driver_i2c/i2c_master.c
+++ b/components/esp_driver_i2c/i2c_master.c
@@ -75,6 +75,15 @@ static esp_err_t s_i2c_master_clear_bus(i2c_bus_handle_t handle)
 #else
     i2c_hal_context_t *hal = &handle->hal;
     i2c_ll_master_clr_bus(hal->dev, I2C_LL_RESET_SLV_SCL_PULSE_NUM_DEFAULT);
+
+    // It would be a good idea to add timout logic here and return an error,
+    // so the program can proceed without hanging if SCL is stuck low:
+    i2c_dev_t *hw = hal->dev;
+    while (hw->scl_sp_conf.scl_rst_slv_en && !gpio_get_level(handle->scl_num));
+    hw->ctr.conf_upgate = 1;
+
+    // Log if it would have hung, this is only for debug:
+    if (hw->scl_sp_conf.scl_rst_slv_en)
+        ESP_LOGE(TAG, "********************* I2C transaction would have hung!");
+
 #endif
     return ESP_OK;
 }
diff --git a/components/hal/esp32c6/include/hal/i2c_ll.h b/components/hal/esp32c6/include/hal/i2c_ll.h
index 5d0ce895d3..340c4f8496 100644
--- a/components/hal/esp32c6/include/hal/i2c_ll.h
+++ b/components/hal/esp32c6/include/hal/i2c_ll.h
@@ -764,8 +764,8 @@ static inline void i2c_ll_master_clr_bus(i2c_dev_t *hw, uint32_t slave_pulses)
     hw->ctr.conf_upgate = 1;
     // hardware will clear scl_rst_slv_en after sending SCL pulses,
     // and we should set conf_upgate bit to synchronize register value.
-    while (hw->scl_sp_conf.scl_rst_slv_en);
-    hw->ctr.conf_upgate = 1;
 }

 /**
mythbuster5 commented 4 months ago

This is a good question. Yes. Usually I2C fsm got something wrong, we reset the bus, that is send 9 SCL pulses. However, during sending SCL pulses, there is another disturbance work on I2C bus, I2C hardware will completely confused.

So, in this part, I will have a talk with our hardware design colleagues to see if there is any better solution.

AxelLin commented 3 months ago

Seems I also hit the issue with v5.2.1-443-g2ba5320112 and v5.3-dev-3675-ge486f3b944.

E (8324) task_wdt: Task watchdog got triggered. The following tasks/users did not reset the watchdog in time:
E (8324) task_wdt:  - IDLE (CPU 0)
E (8324) task_wdt: Tasks currently running:
E (8324) task_wdt: CPU 0: main
E (8324) task_wdt: Aborting.

Stack dump detected
Core  0 register dump:
MEPC    : 0x4204d196  RA      : 0x4204e4c0  SP      : 0x4082ce80  GP      : 0x4081a974
0x4204d196: i2c_ll_master_clr_bus at /home/axel/esp/esp-idf/components/hal/esp32c6/include/hal/i2c_ll.h:732 (discriminator 1)
 (inlined by) i2c_master_clear_bus at /home/axel/esp/esp-idf/components/driver/i2c/i2c.c:635 (discriminator 1)
 (inlined by) i2c_hw_fsm_reset at /home/axel/esp/esp-idf/components/driver/i2c/i2c.c:667 (discriminator 1)
0x4204e4c0: i2c_master_cmd_begin at /home/axel/esp/esp-idf/components/driver/i2c/i2c.c:1549

TP      : 0x407e9bf0  T0      : 0x00000000  T1      : 0x40826000  T2      : 0xffffffff
S0/FP   : 0x40836dd8  S1      : 0x00000000  A0      : 0x00000000  A1      : 0x00000000
A2      : 0x4081e464  A3      : 0x00000800  A4      : 0x00000013  A5      : 0x60004000
A6      : 0x00000004  A7      : 0x00000014  S2      : 0x00000001  S3      : 0x000003e8
S4      : 0x000003e8  S5      : 0x00000622  S6      : 0x00000a0a  S7      : 0x00000007
S8      : 0x00000000  S9      : 0x00000000  S10     : 0x00000000  S11     : 0x00000000
T3      : 0x40826000  T4      : 0x00000001  T5      : 0x40826000  T6      : 0x00000001
MSTATUS : 0x00000000  MTVEC   : 0x40836cac  MCAUSE  : 0x00000000  MTVAL   : 0x00000000
MHARTID : 0x00000006  

Backtrace:

0x4204d196 in i2c_ll_master_clr_bus (slave_pulses=9, hw=0x60004000) at /home/axel/esp/esp-idf/components/hal/esp32c6/include/hal/i2c_ll.h:732
732         while (hw->scl_sp_conf.scl_rst_slv_en);
#0  0x4204d196 in i2c_ll_master_clr_bus (slave_pulses=9, hw=0x60004000) at /home/axel/esp/esp-idf/components/hal/esp32c6/include/hal/i2c_ll.h:732
#1  i2c_master_clear_bus (i2c_num=I2C_NUM_0) at /home/axel/esp/esp-idf/components/driver/i2c/i2c.c:635
#2  i2c_hw_fsm_reset (i2c_num=i2c_num@entry=I2C_NUM_0) at /home/axel/esp/esp-idf/components/driver/i2c/i2c.c:667
#3  0x4204e4c0 in i2c_master_cmd_begin (i2c_num=I2C_NUM_0, cmd_handle=cmd_handle@entry=0x40836fc0, ticks_to_wait=ticks_to_wait@entry=1000) at /home/axel/esp/esp-idf/components/driver/i2c/i2c.c:1548
#4  0x4201c246 in i2c_dev_probe (dev=dev@entry=0x40836cac, operation_type=operation_type@entry=I2C_DEV_WRITE) at /home/axel/esp/esp-idf-dev/esp-idf-lib/components/i2cdev/i2cdev.c:264
eriksl commented 2 months ago

This is a good question. Yes. Usually I2C fsm got something wrong, we reset the bus, that is send 9 SCL pulses. However, during sending SCL pulses, there is another disturbance work on I2C bus, I2C hardware will completely confused.

So, in this part, I will have a talk with our hardware design colleagues to see if there is any better solution.

Just my $0.02. I have been working with I2C for 30 years now, have also written an extensive driver for ESP8266. The recommended (by Philips/NXP) way to resolve a "stuck" I2C bus (SDA), is to cycle the SCL line at least 8 times (and leave SDA floating high). This will make most state machines in the slaves reset to their initial state. And stop outputting to SDA (data + ack). If a slave hogs SCL (clock stretching) you're on your own. There is no way whatsoever to resolve a bus with a stuck SCL.

So I guess Espressif is doing the right thing here.

KJ7LNW commented 2 months ago

So I guess Espressif is doing the right thing here.

That may be true in terms of how the bus reset is performed, but it is the current error handling implementation which is problematic:

This issue is that when the bus enters a stuck state, the driver hangs in an interrupt handler and crashes the software via WDT.

The correct software fix would return an error (or a timeout) from s_i2c_master_clear_bus() instead of crashing the entire device.

eriksl commented 2 months ago

Ah yes ok, clear!