espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.91k stars 7.33k forks source link

ESP-IDF esp_modem.c data corruption (IDFGH-5435) #7176

Closed zvasicek56 closed 3 years ago

zvasicek56 commented 3 years ago

Environment

Problem Description

There is a bug in the recent version of examples\protocols\pppos_client\components\modem\src\esp_modem.c introduced probably in 55d41. The data from the UART's RX buffer are consumed in two events: UART_DATA and UART_PATTERN_DET. This causes duplicate events which manifests in data corruption. Data received in UART_DATA event handler can be overwritten in UART_PATTERN_DET handler under some circumstances. The issue is that both handlers wrongly supposes the esp_dte->buffer is empty despite of the fact that it can contain data written in a previous event.

Debug Log

This issue is hard to debug, so please see the isolated sequence of events (bold), states and actions causing the data corruption:

image

Data (two characters 'OK' added in the fourth event) are overwritten in the last event.

Solution

It is not clear why the data are processed in UART_DATA handler in command mode.

  1. Is it safe to modify the handler as follows?
    
    static void esp_handle_uart_data(esp_modem_dte_t *esp_dte)
    {
    if (esp_dte->parent.dce->mode != MODEM_PPP_MODE) return;
    ...
    }


2. Should we rewrite both handlers, i.e. esp_handle_uart_data and esp_handle_uart_pattern, and read data from UART RX buffer as `uart_read_bytes(...,esp_dte->buffer + strlen(esp_dte->buffer),...)` instead of `uart_read_bytes(...,esp_dte->buffer,...)`?
david-cermak commented 3 years ago

Hi @zvasicek56

How often do you see this happening, does it only happen in the COMMAND_MODE? I've quickly checked the commit you referenced and it works as expected. Getting only UART_PATTERN_DET events in the command mode, since the rx events (UART_DATA events) should be disabled:

https://github.com/espressif/esp-idf/blob/1d7068e4be/examples/protocols/pppos_client/components/modem/src/esp_modem.c#L410-410

It might be possible, however, that we could get these two events in the transition mode, but again in the transition DATA_MODE -> COMMAND_MODE, the pattern detection is disabled, so we should get only data events. Perhaps if some older event was triggered before the transition and processed after it, but doesn't look like this from your log, as you're getting multiple different events.

zvasicek56 commented 3 years ago

Hi @david-cermak

thank you for your response. The issue is observed in the COMMAND_MODE. According to your description I traced the root cause of that behavior. It is caused by the function uart_flush which is called on line 242. This part of the code is activated if the modem is powering up and emits RDY lines. But this is not the only place where uart_flush or uart_flush_input is called in the esp_modem.c code so there may be different situations when the observed issue happens.

According to the source codes, the function uart_flush is an alias for uart_flush_input. The implementation of uart_flush_input is probably buggy as its execution causes side effects that are not mentioned in the documentation. The function internally calls uart_enable_rx_intr independently on the previous state of the interrupt mask.

It looks that also uart_read_bytes called within esp_handle_uart_pattern can activate rx interrupt as a side effect of uart_check_buf_full call.

david-cermak commented 3 years ago

@zvasicek56 Thanks for the investigation and sharing more details. I still wasn't able to reproduce the issue, but I think I understand the potential cause of the trouble.

It might be possible, that the pattern detection in the uart driver generates more events before the modem thread gets to receive them and thus reading data behind the currently processing event, causing the potential corruption you described above. If he assumption is correct, then this WIP fix below should help: PPPoS-Client-Fix-race-pattern-event-race-condition.patch.txt

Could you please apply the patch and see if this fixes the issue?

zvasicek56 commented 3 years ago

@david-cermak, the patch actually does not address the root of the problem. Data loss still persists and moreover, the patch leads to new errors: a) partial line is given to handle_line, b) "uart read bytes failed" error occurs. This is because the value of event.size is invalid at the time of the execution of the corresponding event handler.

The issue is that both UART and PATTERN interrupts are enabled at the same time as a consequence of the uart_flush call. This causes duplicate events that interfere together.

Please have look at the latest log after patch. The first issue is here: Event I (18244) UART: [UART_PATTERN_DET DATA] ... should be ignored, there is no 0a character in the buffer. This character was already processed in the previous handler, i.e. I (18114) UART: [UART DATA] ... . The previous implementation (before patching) could handle that because the handlers ignored the value of event.size.

Could you please comment the first solution proposed in the initial submission. Is it safe to just ignore all the UART DATA events if the modem driver is in command mode? Anyway, UART driver needs to be patched to fix uart_flush which enables interrupt even if it was disabled before.

E (17004) bg96: bg96_init(185): sync failed D (17594) esp-modem: Ignore UART event 7 for DTE with no DCE attached D (17594) esp-modem: Ignore UART event 7 for DTE with no DCE attached D (17594) esp-modem: Ignore UART event 7 for DTE with no DCE attached D (17604) esp-modem: Ignore UART event 7 for DTE with no DCE attached D (17614) esp-modem: Ignore UART event 7 for DTE with no DCE attached D (17614) esp-modem: Ignore UART event 7 for DTE with no DCE attached D (17624) esp-modem: Ignore UART event 7 for DTE with no DCE attached D (17634) esp-modem: Ignore UART event 7 for DTE with no DCE attached I (18004) pppos_example: Init modem I (18004) UART: [UART_PATTERN_DET DATA]: event.size: 2 ; modem mode: 0 D (18014) esp-modem: handle_line : 0x3ffba458 0d 0a |..|

I (18024) UART: [UART_PATTERN_DET DATA]: event.size: 4 ; modem mode: 0 D (18044) esp-modem: handle_line : 0x3ffba458 4f 4b 0d 0a |OK..| D (18054) dce_service: sync ok

I (18054) UART: [UART_PATTERN_DET DATA]: event.size: 2 ; modem mode: 0 D (18074) esp-modem: handle_line : 0x3ffba458 0d 0a |..|

I (18084) UART: [UART_PATTERN_DET DATA]: event.size: 4 ; modem mode: 0 D (18104) esp-modem: handle_line : 0x3ffba458 4f 4b 0d 0a |OK..| D (18114) dce_service: disable echo ok

I (18114) UART: [UART DATA]: event.size: 1 ; modem mode: 0 V (18124) esp-modem:: esp_handle_uart_data, length :1 V (18134) esp-modem: debug_data: uart_pattern_get_pos :-1 V (18224) esp-modem: debug_data: Continuous read in non-data mode : length: 2 char: 0xa D (18234) esp-modem: handle_line : 0x3ffba458 0d 0a |..|

I (18244) UART: [UART_PATTERN_DET DATA]: event.size: 1 ; modem mode: 0 D (18244) esp-modem esp_handle_uart_pattern: debug_data: 0x3ffba458 30 |0| D (18254) esp-modem: handle_line : 0x3ffba458 30 |0|

I (18264) UART: [UART DATA]: event.size: 15 ; modem mode: 0 V (18274) esp-modem:: esp_handle_uart_data, length :14 V (18294) esp-modem: debug_data: uart_pattern_get_pos :-1 V (18334) esp-modem: debug_data: Continuous read in non-data mode : length: 15 char: 0xa D (18344) esp-modem: handle_line : 0x3ffba458 47 20 44 75 6d 6d 79 20 4d 6f 64 65 6c 0d 0a |G Dummy Model..|

I (18354) UART: [UART_PATTERN_DET DATA]: event.size: 1 ; modem mode: 0 D (18374) esp-modem: handle_line : 0x3ffba458 4f |O|

I (18384) UART: [UART DATA]: event.size: 2 ; modem mode: 0 V (18384) esp-modem:: esp_handle_uart_data, length :1 V (18404) esp-modem: debug_data: uart_pattern_get_pos :-1 V (18444) esp-modem: debug_data: Continuous read in non-data mode : length: 2 char: 0xd V (18444) esp-modem: debug_data: Continuous read in non-data mode : length: 3 char: 0xa D (18454) esp-modem: handle_line : 0x3ffba458 4b 0d 0a |K..|

I (18464) UART: [UART_PATTERN_DET DATA]: event.size: 2 ; modem mode: 0 E (18574) esp-modem: uart read bytes failed E (18614) esp-modem: esp_modem_dte_send_cmd(322): process command timeout E (18614) dce_service: esp_modem_dce_get_module_name(311): send command failed E (18614) bg96: bg96_init(189): get module name failed I (19624) pppos_example: Init modem

david-cermak commented 3 years ago

@zvasicek56 Thanks for testing the patch. It looks like my assumption was wrong. In this case, I'm afraid, I'm running out of ideas. I've been testing this exact scenario quite extensively, simulating both events at the same time, but no luck in understanding what was happening in your case. Even if I made both interrupts enabled and both events fired, the data got processed correctly. Are you running only one client or multiple? I this happening after startup, or have you been switching between modes?

Could you please comment the first solution proposed in the initial submission.

I don't like the idea of skipping the data processing based on the operational mode. Both handlers must work in both modes, as they seem to work in my case. Another reason is support for the transition mode, where we could receive both data and commands at once. Have seen issues with processing commands (already in the command mode), just after switching from PPP mode. therefore the data event has to work correctly in a plain command mode.

Have you tried disabling the pattern detection interrupts altogether? Could you please check once? This should work the same way.

david-cermak commented 3 years ago

Hi @zvasicek56

I wasn't able to reproduce the issue, but I believe there is a potential problem of these two events competing and causing data races.

I was thinking of processing the pattern detection manually for some time already (I've done so in the esp_modem component to easily switch the underlying DTE: uart -> usb...), but haven't really gotten to use only the UART_DATA event, for both PPP data and AT commands. Here a WIP patch removing the uart pattern detection. Could you please check Examples-pppos_client-Remove-uart-pattern-detection.patch.txt if this solves the problems you report? Thanks.

Alvin1Zhang commented 3 years ago

Thanks for reporting, would you please help share if any updates for the issue? Thanks.

zvasicek56 commented 3 years ago

Thank you very much @david-cermak. I can confirm that your patch, which only uses the uart_data event, works correctly, at least for SIM7600. The issue with two competing events has been partly solved by fixing uart_flush_input, but your solution is safer.

david-cermak commented 3 years ago

@zvasicek56 Thanks for the feedback! I also think this is a good change. I will clean the patch up and get it merged. This issue is to be closed once merged and synced.

AxelLin commented 3 years ago

@david-cermak

Just curious why do you use a union keep the "line_buffer_size" since it's no longer used? https://github.com/espressif/esp-idf/blob/master/examples/protocols/pppos_client/components/modem/include/esp_modem.h#L62-L65

david-cermak commented 3 years ago

@AxelLin

why do you use a union keep the "line_buffer_size" since it's no longer used?

Not used, but still a kind of public API, so wanted to keep it compatible so that both the new option and the old one is accepted.

AxelLin commented 3 years ago

@david-cermak

I think I hit a bug with the change in a90817c.

In esp_handle_uart_data() esp_dte_handle_line()

I check the uart data (It's actually multiple lines): I got below, the "+COPS: 0" is unexpected data so esp_dte_handle_line() return error. +COPS: 0

OK

However, it then stop working.

Remove the MODEM_CHECK from this line, then it works. https://github.com/espressif/esp-idf/blob/master/examples/protocols/pppos_client/components/modem/src/esp_modem.c#L124

Looks like the MODEM_CHECK causes the second line "OK" never received.

david-cermak commented 3 years ago

Thanks for the early feedback, I'll look into it and fix it (I've tested this multiline scenario, expecting that the line handlers return ESP_OK to indicate also read more, but ESP_FAIL to stop parsing immediately. Probably missed the fact that this unexpected error is reported but reading/parsing might still continue)

xdevelnet commented 1 year ago

Hey! I'm sorry to rise up this issue, but can you tell me in which version of esp-idf this issue has been resolved?

I suppose it would be enough to use

platform = https://github.com/platformio/platform-espressif32.git

right?

Thank you!