espressif / esp-mesh-lite

A lite version Wi-Fi Mesh, each node can access the network over the IP layer.
114 stars 15 forks source link

ESP full reset during sending when root node is lost (without any info/warning or error message) (AEGHB-370) #25

Open HWuest opened 10 months ago

HWuest commented 10 months ago

I have a simple parent node connected to a root node sending a message on a regular basis using esp_mesh_lite_try_sending_msg((char)"rec", (char)"rec_ack", 1, item, &esp_mesh_lite_send_msg_to_parent);

When the root node is switched off or reset or WiFi connection is lost or unstable sometimes the parent node (after some seconds) does a full reset without any info/warning or error message instead of normal disconnect behaviour I (130435) wifi:bcn_timeout,ap_probe_send_start ....

Sometimes still reporting wifi:bcn_timeout,ap_probe_send_start before reset sometimes shortly after...

esp_reset_reason() simply states ESP_RST_POWERON but power is stable and parent node is not touched at all.

When sending a message every 250ms it happens every time the root connection is lost. But even without sending anything actively to the mesh network I was able to produce a reset (seldom 1 out of >20) Maybee when connection loss occures during internal send actions?!?

The root cause seems to be somewhere around info mesage wifi:bcn_timeout,ap_probe_send_start but I was not able to find this in the available source code to analyze the beahviour further...

(In the application this unstable behaviour leeds to sudden device resets with loss of stored information without a chance to react prior to the reset every 1 to 4 hours)

Available heap, stack-sizes, watchdogs have been checked and looks quite Ok...

HWuest commented 10 months ago

I run another test, one child node connected with a root node, child node sitting idle without any activity (no mesh data sending). After 2.5 hours the child node got a reset with esp_rom_get_reset_reason(PRO_CPU_NUM) reason: RESET_REASON_SYS_RTC_WDT

What could cause a sys_rtc_wdt timeout and how to avoid it? In the configuration I didn't find any settings to adapt the sys_rtc_wdt.

Without Mesh-Lite network connection I didn't see a resets but I will run another test over night...

tswen commented 10 months ago

Please provide the following information so that I can analyze it better:

HWuest commented 10 months ago

(IDF 5.0.1, ESP32S2 Mini, different USB power sources)

I analyzed the error with different loging and configuration settings. Processor stops working without any log info (even with log verbose).

Sometimes it already stops in the initialization process, so to be sure that there is no power source problem I lowered the brown-out detection voltage to 2.67V and got problems starting the device (resets from brown-out detector?!?).

So error behaviour seems to be related to a combination of WiFi peak current USB load and USB cable type.

I changed the WiFi power to 10 dBm and I could not reproduce the error for a while. Going back to 20 dBm the problem re-occures.

Adding a 100nF capacitor to 3.3V and 5V doesn't help much for stable 20 dBm performance. Stability depends highly on used USB-port setup (on port on the PC seems to be worth).

The new ESP32S2 Mini boards from china seems to have more problems here (didn't face that problem before)...

Long-time test still running...

Hopfully this findings helps others facing similar unstable behaviour ;-)

HWuest commented 10 months ago

After a while error comes up again so root couse not found... (Reducing processor clock to 80MHz and switching PSRAM off to reduce the chance of HW quality problems had no effect. I had one board which seems to have an additional HW problem)

I can't analyze it further, no logging possible, USB-COM port is closed imediatly...

To reproduce the problem:

  1. Build Mesh-Lite Network with parent (connected to rooter) and one child, child send a simple message every 200ms to parent by esp_mesh_lite_try_sending_msg((char)"rec", (char)"rec_ack", 1, item, &esp_mesh_lite_send_msg_to_parent);
  2. Power off root node
  3. During connection loss recognition process (up to ~5 seconds, message sending continues), than sometimes child closes USB connection and reboots without any message...

Normal process log on correct power off / connection loss detection: .... W (18486) DHS: boot 228 <-- my message W (18686) DHS: boot 228 <-- my message I (18686) wifi:bcn_timeout,ap_probe_send_start W (18886) DHS: boot 228 <-- my message I (21196) wifi:state: run -> init (c800) I (21196) wifi:pm stop, total sleep time: lu us / lu us

I (21206) wifi:idx I (21206) wifi:new:<11,0>, old:<11,2>, ap:<11,2>, sta:<11,2>, prof:1

Reset behaviour on root power off: .... W (18137) DHS: boot 229 <-- my message W (18337) DHS: boot 229 <-- my message ClearCommError failed (PermissionError(13, 'Das Ger??t erkennt den Befehl nicht.', None, 22)) Waiting for the device to reconnect.. E (4447) DHS: Start V0.2... (234 / 1) <-- Bootcount increased an reset reason power on!

tswen commented 10 months ago

Do you have any other S2 modules or chips for testing, or does this issue only occur on a specific S2 module? From the current perspective, it seems that this issue is not related to Mesh-Lite because many other customers are using the Mesh-Lite solution without encountering this problem. It appears that this issue is more likely to be caused by hardware problems.

HWuest commented 10 months ago

Yes, I tried with 4 different modules, all show the same behaviour. I have an older one which I use for a long time without problems, I will try again with this one...

For it looks also as if it is related to a HW or configuration problem, therefor I ran a lot of tests reducing frequencies, diabling PSRAM, etc. wihout any effect. I will also try a simple WIFI server project to check if the problem is a general WiFi HW problem of the modules but I need some time for it. I will come back later with results...

HWuest commented 9 months ago

After drying different S2 Modules resulting in slightly diffrent but most time unstable behaviour it seems to be a HW-problem with the voltage level of the 3.3V on board voltage regulators.

The measured voltage on many of my new boards is below 3V and drops further during Wifi-Startup to about 2.8V. The older boards which works without problems have a voltage of above 3.2V.

I added a diode in the ground connection of the voltage regulator to higher the output voltage to about 3.5V (tiny smd soldering, so currently only done one one board) on one board which was worth (having massive boot and early startup problems). It works now stable for some time.

I will change other boards and test them for a longer time...

Seems that my cheep Wemos S2 Mini bought from china have this issue and also a bunch of extra voltage regulators I bought for repair/replacement.

HWuest commented 7 months ago

After some time I now could continue with testing. The problem still exist and I found out, that it is related to the ADC reading.

Whenever I use adc_oneshot_read the ESP32S2 crashes after some minutes/hours of normal work.

The problem seems to be related to the ADC_CALIBRATION_V1 part in file adc_oneshot.c, function adc_oneshot_read:

if SOC_ADC_CALIBRATION_V1_SUPPORTED

adc_atten_t atten = adc_ll_get_atten(handle->unit_id, chan);
adc_hal_calibration_init(handle->unit_id);
adc_set_hw_calibration_code(handle->unit_id, atten);

endif

If I comment this part out (change of define would have same effect) the problem disapear so far (I'm not 100% shure because it sometimes last a long time before the error occures).

Could this be related to mesh light or is it a more general problem of the ADC functionality (I didn't saw this without running mesh lite so far and found no information concerning such ADC problems in the internet).

I will update my project accordingly and will do a long time test the next days....

tswen commented 7 months ago

Whenever I use adc_oneshot_read the ESP32S2 crashes after some minutes/hours of normal work. Is there a corresponding crash log?

HWuest commented 7 months ago

No, processor simply stops working without any log info (even with log verbose), makes a reset and restart the program.

Without SOC_ADC_CALIBRATION_V1_SUPPORTED it runs stable for 2 days now...

HWuest commented 7 months ago

Status update, after 3 Days aigain a Reset ocures (seems to be more seldom).

The ADC functions are using an I2C-Bus, could there be some interference with mesh lite?

If ADC_CALIBRATION is not active there are less I2C communications during ADC read which might lead to the more stable behaviour...

tswen commented 7 months ago

What is the reason for reset? Can you provide the context log file where the problem occurred?

HWuest commented 7 months ago

What log file do you mean, as I statet the processor restarts without any log messages even when I set log-level to verbose and enable various debug loggings in the sdkconfig setting. esp_reset_reason() reason is ESP_RST_POWERON as reason (but power was not interrupted)

Even with brown-out detector at 2.98V no brown-out reset as reason (at 3.3V/3.19V brown-out resets occure which fits to the supply voltage of the voltage regulator)...

Sometimes the USB CDC logging stops some time before the reset (so something gets corrupted internally?!?).

I know that is difficult to help without any logging information. I will run a long time test without any ADC_Read to be sure that this and nothing else couses the reset behaviour...

HWuest commented 7 months ago

Without ADC-Read the application (including normal mesh lite network communication) runs now for about 2 days stable without any reset. Also the same application with ADC-Read but without sending the data to the mesch lite network works stable. When I activate both together (ADC-Read and data sending to the mesh lite network) unstability occures. Time between resets is random (minutes to several hours) but seems to depend on frequency of ADC-Read opperation or mesh lite communication. Without SOC_ADC_CALIBRATION_V1_SUPPORTED time between resets is much longer but resets still occure (without SOC_ADC_CALIBRATION_V1_SUPPORTED there is less I2C communication involved in ADC-Read). Doing ADC-Read with interrupts disabled did not change the situation.

Digging deeper into the adc_oneshot_hal_convert function I saw I found in adc_hal_onetime_start following section:

if SOC_ADC_DIG_CTRL_SUPPORTED && !SOC_ADC_RTC_CTRL_SUPPORTED

(void)unit;
/**
 * There is a hardware limitation. If the APB clock frequency is high, the step of this reg signal: ``onetime_start`` may not be captured by the
 * ADC digital controller (when its clock frequency is too slow). A rough estimate for this step should be at least 3 ADC digital controller
 * clock cycle.
 *
 * This limitation will be removed in hardware future versions.
 *
 */
uint32_t digi_clk = APB_CLK_FREQ / (ADC_LL_CLKM_DIV_NUM_DEFAULT + ADC_LL_CLKM_DIV_A_DEFAULT / ADC_LL_CLKM_DIV_B_DEFAULT + 1);
//Convert frequency to time (us). Since decimals are removed by this division operation. Add 1 here in case of the fact that delay is not enough.
uint32_t delay = (1000 * 1000) / digi_clk + 1;
//3 ADC digital controller clock cycle
delay = delay * 3;
//This coefficient (8) is got from test. When digi_clk is not smaller than ``APB_CLK_FREQ/8``, no delay is needed.
if (digi_clk >= APB_CLK_FREQ/8) {
    delay = 0;
}

adc_oneshot_ll_start(false);
esp_rom_delay_us(delay);
adc_oneshot_ll_start(true);

//No need to delay here. Becuase if the start signal is not seen, there won't be a done intr.

else

// adc_oneshot_ll_start(unit);

endif

with SOC_ADC_DIG_CTRL_SUPPORTED and SOC_ADC_RTC_CTRL_SUPPORTED == 1 !

Changing the code to use the version with additional delay seems to fix the problem (long term test still running) !!! (set SOC_ADC_RTC_CTRL_SUPPORTED = 0 in soc_caps.h leads to compile error in other code parts, so change done in adc_oneshot_hal.c /adc_hal_onetime_start(...) directly )

I couldn't find additional information on this HW limit and why it is valid for my ESP32S2 chips, have you any sources of additional information to this ???

Link / reason for interference with mesh lite is stil not clear to me, any sugestions ?

I will report if long term test runs now without errors so that others can make use of this solution...

tswen commented 7 months ago

Hello, which module or chip are you using? You can perform tests using multiple devices simultaneously. Set different maximum transmission power (esp_wifi_set_max_tx_power) for each device's firmware and observe if there are any effects after reducing the transmission power. Additionally, are there any other interferences with the Wi-Fi antenna? If possible, please take a photo of the location where the Wi-Fi module is in the test product and send it to tiansenwen@espressif.com.

HWuest commented 7 months ago

Chip is an ESP32S2 mini Wemos Module.

Transmission power setting makes no difference, I tried that already

I use a bare module without interference on Wi-Fi antenna which is laying on my desk without any additional components/electronic connected.

After the above changes in ADC-read I'm back to the situation from the start of this issue, within 3 Days I got 3 times a reset (often over night). So this was no solution.

I've set up a fully new and clean IDF-project (no changes in sdkconfig beside an incresed TIMER_TASK_STACK_DEPTH of 4096) running a simple mesh lite network sending a message every 5 seconds to the parent (root) node using esp_mesh_lite_try_sending_msg (content boot count and boot reason) and stil geta reset around every 24 hours.

I run a test with increased communication rate of 100ms and without any USB-Log-Messages to see if the time between resets changes with higher communication rate. It started directly with some restes within the first minutes but now runs already for more than 10 minutes without (behaviour is extreamly variable). I tried blocking the communication and releasing it (shielding the antenna of the module) to simulate communication interrupts, no effect...

I will wait some more time and try to find out which situation couses the behaviour (without a way to force the error regularily I think we have no chance to find the couse).