Open dek-RVB opened 11 months ago
Q1: From you description. Do you mean that the data on oscilloscope is correct, but data which esp got is wrong? Q2: What if disable the UDP, but only initialize the timer and I2C? Still wrong?
@mythbuster5 Answer to Q1: The data on the oscilloscope is correct, but the data reported by the esp is sometimes wrong. Answer to Q2: If the UDP server is not initialized in combination with the timer and I2C, the error will not occur.
Hi, We also currently have a similar problem. We are running an i2c sensor with a UDP server and the BLE scan running. About once every hour the application crashes because of a memory corruption of the heap. It is not possible for us to compare the values from the oscilloscope and the data that the i2c stack returns like you did (there is too much data).
When disabling the UDP server or the BLE scanning, the problem seems to occur much less often. Our guess is that the problem is coming from the i2c driver (or how we use it). The UDP server and BLE scanning are just using the heap a lot which create an environment where memory corruption is much more likely.
We have tested and reproduced the issue on v5.0 and v5.1.
Seems to be related to #7781
Some more tests have been performed. New insights in the errors have been discovered.
In the original code a current, voltage and temperature measurement (using I2C as described in the issue) are sequentially performed in a FreeRTOS task. This task then waits 10 ms (using vTaskDelay) and performs the 3 measurements again and again...
Each measurement consists of 2 bytes of data. A diagram is shown below. I1 is the most significant byte of the current measurement and I0 the less significant byte. The same convention is used for the voltage and temperature.
A pattern was discovered in the errors. These errors can be separated in multiple cases. Keep in mind that the I2C bus was monitored and that this bus contains no errors. All the errors happen internally in the ESP32.
The temperature errors are always the same. The first byte of the temperature T1 is replaced by the last byte of voltage measurement V0 as shown below.
There are two types of current errors. The first one being similar to the temperature error. The first byte of the current I1 is replaced by the last byte of the previous measurement being T0. This is shown below.
The second error is more complicated as I have no idea where the erroneous byte comes from. The last byte of the current I0 is replaced by a value that is constant within one run of the code. Restarting the ESP32 (without rebuilding/reflashing) can change that value, but not always. The monitored values are 0x04 and 0x8D, but I have no idea what causes these bytes to appear there.
While I was monitoring the bus with an oscilloscope and logging the measured data with its timestamp using a logging script, a weird timing behavior was discovered. The FreeRTOS tickrate is set to 100 ticks per second, which gives a minimal interval of 10ms between two timestamps of measurements. An example of the normal behavior is shown below.
Whenever a temperature error occurs the OS misses its 10ms second mark and instead logs a timestamp at a 15 ms mark. It then again waits 15ms and then proceeds with the expected 10 ms between two logged timestamps. I would expect that whenever the OS cannot reach its set tickrate (due to for example CPU overload), it would skip a tick causing it to have a delta of 20 ms between two timestamps. An example of the error and expected behavior is shown below.
The current error again is more complex. Whenever the erroneous current is negative (MSB of I1 being 1) the OS misses its 10 ms mark and instead logs a timestamp at the 12 ms mark. It then waits 18 ms before proceeding with the expected 10 ms delta between two timestamps. Whenever the erroneous current is positive (MSB of I1 being 0) the 10 ms mark is reached and no timing problems are visible in the timestamp logging.
I provided example code in the original issue to easily reproduce the errors. The new insights in the replaced bytes give information about the error pattern of the example code. As only temperature measurements are performed, the first byte of the newly measured temperature is replaced by the last byte of the previously measured temperature. The temperature is relatively stable which results in two of the same bytes in the reported erroneous temperature measurement.
@mythbuster5 Would you have any insights on this weird behavior?
Thanks in advance.
RVB
I can totally reproduce this issue on the Espressif devkit
We have been able to make the problematic code a lot smaller: it's only 170 lines of C now, in a single file, based on the 'i2c simple example' https://github.com/espressif/esp-idf/tree/v5.1.2/examples/peripherals/i2c/i2c_simple. We have determined that the network stack is not involved with this bug, so we've eliminated that from the code. I've attached the zip as attachment.
As for other suggestions:
When call the “i2c_driver_install()” API to register I2C, please set the last parameter to "ESP_INTR_FLAG_IRAM" for testing.
This does not fix the problem, corruption still happens (but less often)
Increase the Timer period for testing.
Increased the timer from 1 to 10 us, corruption still happens
You can try to set the esp_timer task core affinity to CPU1 for testing.
However, with the affinity set to CPU1, the corruption does not seem to happen anymore.
We're not satisfied with this solution yet, for the following reasons:
We have also reduced the hardware needed to reproduce this bug. We are able to reproduce this on just an official ESP32-Ethernet-Kit_A_V1.2 with a PCT2075 sensor module from Adafruit (https://www.adafruit.com/product/4369). This sensor is likely also available from other vendors, in case Adafruit does not ship to your region.
We received support from Espressif. There was indeed an issue with the I2C FIFO, the following patch given by one of their employees fixes it:
From fb0c921cc6c93a755f3f39f472fc88b59d130dad Mon Sep 17 00:00:00 2001
From: Jacques_Zhao <redacted@espressif.com>
Date: Fri, 30 Aug 2024 19:23:45 +0800
Subject: [PATCH] i2c: fix i2c read error
---
components/hal/esp32/include/hal/i2c_ll.h | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/components/hal/esp32/include/hal/i2c_ll.h b/components/hal/esp32/include/hal/i2c_ll.h
index f2903de44a..f1aa6aacf9 100644
--- a/components/hal/esp32/include/hal/i2c_ll.h
+++ b/components/hal/esp32/include/hal/i2c_ll.h
@@ -518,6 +518,7 @@ static inline void i2c_ll_get_scl_timing(i2c_dev_t *hw, int *high_period, int *l
__attribute__((always_inline))
static inline void i2c_ll_write_txfifo(i2c_dev_t *hw, const uint8_t *ptr, uint8_t len)
{
+ hw->fifo_conf.nonfifo_en = 0;
uint32_t fifo_addr = (hw == &I2C0) ? 0x6001301c : 0x6002701c;
for(int i = 0; i < len; i++) {
WRITE_PERI_REG(fifo_addr, ptr[i]);
@@ -536,9 +537,14 @@ static inline void i2c_ll_write_txfifo(i2c_dev_t *hw, const uint8_t *ptr, uint8_
__attribute__((always_inline))
static inline void i2c_ll_read_rxfifo(i2c_dev_t *hw, uint8_t *ptr, uint8_t len)
{
+ hw->fifo_conf.nonfifo_en = 1;
for(int i = 0; i < len; i++) {
- ptr[i] = HAL_FORCE_READ_U32_REG_FIELD(hw->fifo_data, data);
+ ptr[i] = hw->ram_data[i];
}
+ hw->fifo_conf.nonfifo_en = 0;
+
+ hw->fifo_conf.rx_fifo_rst = 1;
+ hw->fifo_conf.rx_fifo_rst = 0;
}
/**
--
2.34.1
@mythbuster5 Could you review https://github.com/espressif/esp-idf/issues/12860#issuecomment-2413975957 ?
We have tested this on 10 boards for two weeks and haven't had a single error anymore (we used to have multiple per 10 minutes). The patch above was sent by an Espressif employee, thanks for the support!
@mythbuster5 Could you review #12860 (comment) ?
Is there anything wrong with above fix? (I'm wondering why it's still not yet fixed in github).
@AxelLin This is provided by me actually. I'm still doing the internal test. This is a really hard to debug due to it related to wifi/bt etc. Sorry for the late waiting..
Answers checklist.
IDF version.
v5.1.2 (also tested on master)
Espressif SoC revision.
ESP32 (revision v3.1)
Operating System used.
Windows
How did you build your project?
VS Code IDE
If you are using Windows, please specify command line type.
None
Development Kit.
ESP32-Ethernet-Kit-V1.2 and custom board
Power Supply used.
External 5V
What is the expected behavior?
The temperature is monitored using a PCT2075 over an I2C-bus, while an auto-reload esp timer that triggers a level 3 interrupt is running. An UDP server is also set up. It is expected that the temperature is outputted without corrupted data.
What is the actual behavior?
The temperature that is monitored by the setup described above gives reasonable data most of the time, but randomly logs temperature spikes. These spikes seem to happen at random moments. Sometimes the corrupted data is two of the same bytes after each other and other times it looks random. No pattern is seen yet. The data on the I2C bus has been checked and does not show any of those temperature spikes. The device does not crash.
Steps to reproduce.
Connect the SDA and SCL pin of the Adafruit PCT2075 to IO2 and IO4 of the ESP32-Ethernet-Kit V1.2, respectively.
Connect the address pins of the PCT2075 to ground or 3V3 (make sure to change to the appropriate address in the code (ec_control.c --> PCT2075_I2C_ADDR).
Connect the PCT2075 to GND and 3V3.
Connect the ESP32-Ethernet-Kit V1.2 to a PoE capable device.
Make sure the interrupt level of the 'High resolution timer (esp_timer)' is set to '3' and the 'Support ISR dispatch method' checkbox is active in the sdkconfig.
Build and flash the project found in the attached files.
Open the monitor; an IP address will be assigned to the device and the temperatures below 20°C and above 60°C will be logged. Also the measurements before and after the erroneous data is logged.
The occurrence of errors can be significantly increased by flooding the device with ARP messages. This can be done by:
EC_controller_test.zip
Debug Logs.
More Information.
Initial Setup
Custom PCB
The custom design is a PCB containing:
Errors on the custom PCB
I2C
The custom board sporadically reported current and temperature spikes (both positive and negative) at random moments. Those spikes do not happen an the same time. The I2C bus was monitored with an oscilloscope and did not show any sign of corrupted data sent over the bus. The time between two spikes ranges from a few seconds to a couple of hours.
We discovered later that the rate of erroneous values is increased by flooding the network with ARP messages. Disabling the initialization of the UDP server removed the current and temperature spikes.
It was also discovered that disabling the esp timer callback also removes the current and temperature spikes. However enabling the callback to an empty function still gives erroneous data. Increasing the timer's frequency increases the number of error rate. The frequency can not be too high as it will introduce watchdog timeouts.
The increase in timer frequency and the ARP flooding consistently reduce the time between two spikes to a couple of spikes per 10 minutes.
SPI
The SPI bus reads from the DAC are randomly converted to writes which gives unwanted values at the ouput of the DAC (confirmed by monitoring the SPI bus with an oscilloscope). ARP flooding has no impact on the rate of SPI read/writes. However, increasing the timer's frequency increases the number of read/writes. It is still unclear if SPI and I2C errors are related to each other.
Tests
The system has been tested on stack overflows, task sizes, memory leaking... The power supply is stable.
Also tested:
but none of the above helped to resolve the weird behavior of the system.
ESP32-Ethernet-Kit V1.2
First, the code has been reduced to its minimum, while still showing erroneous data on the custom board. Therefore only the UDP server initialization (no active task), the auto-reload timer with an empty callback and a task that reads the temperature sensor using I2C have been preserved. This reduces the errors to only temperature spikes. This code has been ported to be used on the ESP32-Ethernet-Kit V1.2 in combination with a Adafruit PCT2075.
To increase the number of errors the timer auto-reload value has been set to 1 µs and the number of I2C reads have been increased. To be clear, the errors still occur without those changes, but these can take hours to happen.
Does anyone know what is going on with this specific combination of UDP server, auto-reload timer and I2C bus?
Thanks in advance
RVB