Modbus unexpected error flow (IDFGH-3829)

simon-thiebaut commented 4 years ago

Environment

Development Kit: NA
Kit version NA
Module or chip used: ESP32-WROOM-32D
IDF version (run git describe --tags to find it): v4.1-rc
Build System: CMake
Compiler version xtensa-esp32-elf-gcc (crosstool-NG esp-2020r2) 8.2.0
Operating System: Windows
(Windows only) environment type: MSYS2 mingw32
Using an IDE?: Yes (eclipse)
Power Supply: NA

Problem Description

After upgrade to tag v4.1-rc, Modbus communication sporadically encounters errors that modifies SW flow. Although this is not confirmed yet, this new error seems to cause desynchronization among exchanged Modbus frames. Indeed, we've encountered multiple "resource release failure" or "Take resource failure" messages that causes SW fails in our application. So we've tried to investigate by adding more log, and we've found a weird behavior on error management, please read below.

In function eMBMasterPoll, checking EV_MASTER_EXECUTE: if eException is raised, xMBMasterPortEventPost modifies code execution strangely, cf. log sample below and instrumented function eMBMasterPoll in attachement.

Expected Behavior

Expected log in case of exception would be the following, based on instrumented file (this "good" error management behavior has been observed somewhere else during our logs):

MB_PORT_COMMON: ERR 3
MB_PORT_COMMON: ERR 4
MB_PORT_COMMON: error type = 2
MB_PORT_COMMON: error step 1
MB_PORT_COMMON: error step 2

Actual Behavior

Actual log in case of exception is the following, based on instrumented file:

(13164) MB_PORT_COMMON: ERR 3
(13186) MB_PORT_COMMON: error type = 1 // Error type gets modified here: otherwise, we should have had type 2
(13198) MB_PORT_COMMON: error step 1
(13206) MB_PORT_COMMON: error step 2
(13385) MB_PORT_COMMON: eMBMasterPoll: Unprocessed event 4

ERR 4 gets never displayed at all, and unprocessed event makes me think that we jump somewhere back to the function start when doing xMBMasterPortEventPost( EV_MASTER_ERROR_PROCESS ), without displaying ERR 4 and clearing event 4. After checking the code, I would say eMBMasterPoll gets interrupted by xMBMasterRTUTimerExpired between display of ERR3 and 4, and that eMBMasterCurErrorType gets modified to 1 by vMBMasterSetErrorType(EV_ERROR_RECEIVE_DATA) in xMBMasterRTUTimerExpired.

I don't know whether it is expected or not, but I'm expecting maybe a real time scheduling issue... You will find our SDKconfig attached. Especially, do you have a requirement on FreeRTOS timer daemon task priority to handle Modbus event groups ? So far it has the lowest priority in our project.

Steps to reproduce

Our setup is the following:

board 1 (master) sends periodic requests to board 2 (slave)
board 1 and board 2 are under OTA process, and can reboot asynchronously (I mean in the middle of a Modbus request/response for example).

With the previous version, we did not have any issue. But now communication seems to be messed up after multiple OTA cycles. The problem does not appear systematically though? Although, we can't log the Modbus exchanged frames. So far, we don't have any ideas on how to reproduce this easily. But we always end up there after roughly 15/20 OTAs...

What I want to point out is the error message: we've never have it before, that's why I suspect something new that could desynchronize.

Thank you for your help.

// If possible, attach a picture of your setup/wiring here.

Code to reproduce this issue

NA

Debug Logs

Cf. Actual Behavior

Other items if possible

SDKconfig attached as JPG otherwise GitHub rejects it...

sdkconfig eMBMasterPoll.txt

Alvin1Zhang commented 4 years ago

Thanks for the detailed report and letting us know, we would look into.

alisitsyn commented 4 years ago

@simon-thiebaut,

I am sorry to hear about these issues in v4.0-rc. It is true, there are several issues exist in this release and fixes were not back ported correctly. The most important things are: Wi-Fi coexist, NVS read/write, UART fifo read issues. I am working on it and already ported fixes of UART driver as well as Modbus and example fixes. Currently, my project is able to pass most of tests and I will be able to provide fixes for you as soon as possible. Preliminary results can be ready for test at the end of this week or start of next week.

ghost commented 4 years ago

Hello @alisitsyn ,

I'm working with @simon-thiebaut we would like to know where your progress is on these issues? Moreover, you talking about on the tag v4.0-rc, but we are on the tag v4.1-rc, it's the same fixes?

alisitsyn commented 4 years ago

Hello @Zodiac-Onnillon, I apologize for the delays. Have some unexpected issues. The fixes will be ported to v4.1 as well. I had hard time to reproduce your issues and will continue with this as soon as possible.

ghost commented 4 years ago

Hello @alisitsyn, Are you about to fix this issue? Do you know how long you need to finish and merge the correction on the mainline? Please keep me posted.

alisitsyn commented 4 years ago

Hello @Zodiac-Onnillon,

Could you answer my questions below:

Do you have tasks with a priority higher than CONFIG_FMB_SERIAL_TASK_PRIO and what is the priority compare to Modbus tasks priority?
Do you have NVS R/W tasks that have a priority higher than Modbus tasks?

The priority of Modbus tasks is important for v4.1-rc and should be configured to be higher than other tasks priority. The Modbus stack uses polling mode to read/send data instead of custom interrupts and other tasks can delay the processing of the events by the Modbus FSM. Then the T35 timer is triggered that causes the timer expired function to send EV_MASTER_FRAME_RECEIVED, but the input buffer is not processed correctly due to processing delays. This behavior depends on the activation timing of the higher priority tasks and the receive state machine can go to STATE_M_RX_ERROR and modify the set the error code to EV_ERROR_RECEIVE_DATA. The above can cause the FSM may not be released from the previous transactions correctly. In later releases, the T35 feature is modified to monitor the timeout using UART timeout feature instead. There are some other reasons for this behavior.

My proposals:

Increase the priority of CONFIG_FMB_SERIAL_TASK_PRIO in your application to be higher than the highest priority of your other tasks.
I am going to send you the patch for v4.1-rc to solve known issues up to Friday this week.
If the fixes do not solve this behavior I will follow with investigation and need more information from you about your application.

ghost commented 4 years ago

Hello @alisitsyn, Thank you for your reply,

The task CONFIG_FMB_SERIAL_TASK_PRIO is configured with priority 10 and no application task has a higher priority. Our Modbus application task has a priority configured at 5.
Yes, we have NVS R/W used on the application tasks. All the application tasks not exceed priority 7. So we had some tasks set as the same priority as Modbus task and a few higher than Modbus like the Wifi task.

Therefore we waiting for your patch then we will check and change all the priority for the Modbus task like your proposal. Then we can pass some test and inform you about the correction.

alisitsyn commented 4 years ago

Hello @Zodiac-Onnillon,

Please apply the patch below to v4.1-rc. patch_drv_modbus_v41_rc_1.zip

Please perform your tests and let me know the result and send me the logs. Also make sure the RO, DI of your RS485 driver are pulled up and RE/DE pin is pulled down.

Thanks.

alisitsyn commented 4 years ago

Hello @Zodiac-Onnillon,

If you have the test results for the patch could you share them here? On my side the Modbus with the patch applied was able to work when tasks with even significantly higher priority than Modbus tasks exist as well as with NVS access.

Thanks.

alisitsyn commented 4 years ago

Dear @Zodiac-Onnillon,

Please let me know the results of your testing with the applied patch. Thank you.

simon-thiebaut commented 4 years ago

Hi @alisitsyn ,

Sorry for the delay, we were going on production on a project.

So we've finally tested your patch, and we still have issues. We need to do more tests, but please find below some information:

when our product is in normal mode, issues do not seem to happen
when the device is in a special mode, we encounter weird behavior (we can't tell yet for sure that are the source of the problem we have, but we wanted to share that with you).

This special mode is an over the air FW update of the product. We are performing stress loops of OTA to validate our FW, and we encounter issues with both the tag v4.1-rc and your patch. However we do not have such issues when we are on our commit (that is based on fork of Espressif official repo, somewhere before v4.1-rc).

What happens is that during OTA, we perform an erase of the Espressif partition that will receive the new FW. This erasing is made in SDK API esp_ota_begin, and is blocking the Espressif for the duration of the erase. However, we've observed that some Modbus frames are still being issued during this operation (though most of the time there is no Modbus activity at all during it). It is not the case on our previous "custom" version. Also, these Modbus requests are weird: indeed, they are split in 2 or 3 packets, separated by a period between roughly 100 and 200 ms, it varies.

Could this cause a Modbus stack desynchro ? We've taken a logic analyzer (Saleae) capture of this issue, however the capture is around 130 MB, I don't know how to share it with you... I've attached a picture of the issue as a starting point. In the orange box, 2 frames are sent although they shall not. Requests are on the bottom graph, and responses on the top one. The first request is split in 2, and the 2nd in 3. And obviously there is no response as the slave do not understand these split frames. Modbus fragmented request

Please let us know. On our side, we'll force loops of blocking flash erase, so that hopefully the problem is reproduced more easily. But in the meantime, any advice would be appreciated.

Thank you for your feedback.

Best regards,

alisitsyn commented 4 years ago

Hi @Zodiac-Onnillon,

Thank you for the update. I did not complete detailed investigation of the behavior during FW update yet and below I will try to share some thoughts related to showed behavior.

The processing of incoming Modbus frame is performed in the functions port/portserial_m.c::usMBMasterPortSerialRxPoll(), xMBMasterPortSerialTxPoll(). In this release, these functions are used to read/write the incoming data byte by byte from/to UART FIFO buffer. The current approach was selected intentionally to use existing freemodbus architecture for RTU, ASCII, and allow to define custom buffer size for Modbus stack. Unfortunately, this does not work reliably when the processing task is suspended periodically by higher priority tasks. During transmission, it can make gaps when feeding the UART FIFO in xMBMasterPortSerialTxPoll() function. The same behavior applies to usMBMasterPortSerialRxPoll() and when the FMB_TIMER_PORT_ENABLED is disabled it helps to process incoming frames correctly but increases frame processing time. The Flash write/erase during OTA update also causes IRQ processing delays because it disables CPU cache and the FMB_TIMER_ISR_IN_IRAM option allows to mitigate the delays. It is possible the next portion of data in the ring buffer be transferred into FIFO during OTA update that explains the packets splitting. In the previous release, the read of the full FIFO is performed as one function call using a temporary buffer to keep data then the poll function was performing processing of the buffer byte by byte. This explains why the previous version of Modbus was working for you better. The above could make Modbus FSM be unreliable and cause desynchronization of the stack due to fragmented packets. The stack was not designed to be functional in such a case. What I can propose is to destroy Modbus before the start of OTA update and start it again after reset. This is a really unusual case from my point of view. Do you really need the Modbus to be functional during OTA update?

simon-thiebaut commented 4 years ago

Hello @alisitsyn , Again sorry for delay. Finally we have decided to not upgrade to this tag at this stage of the project, due to the uncertainty we sill have around Modbus. So we have kept the previous version for THIS project

BUT: we're starting a new project based on the previous one, and as we have a bit more time we'd like to push the investigations to decide whether or not migrating.

Just to give you the info: we wanted to upgrade the SDK because we had a Wifi memory leak, that could only be fixed by upgrading the SDK (based on what we've been told in another tick). But by doing this, we a have to take the full SDK, including this Modbus upgrade (though we were pretty much satisfied with the previous version).

Also, to clarify our projects:

the system is based on 2 boards, communicating over Modbus (Espressif board being the master)
the firmwares of the 2 boards are downloaded through Wifi
when time comes to update the slave board, the Espressif sends the slave board FW chunkwise over Modbus (write multiple registers, 64 bytes/32 registers; so yes we need Modbus during OTA...)
also, one more requirement is that the OTA is transparent to the user, as one board is an HMI with displayed data read from the 2nd board. So again, this means that during OTA, Modbus comm shall be up and running.

So basically during OTA, we have :

sequences of wifi download, partition write, partition read, and "big" transfers over Modbus using multiple registers
in parallel, our Modbus table application keeps being updated periodically by sending Modbus requests

As far as I understand from our discussion, the new Modbus implementation is not compliant with these requirements, so here are my questions:

do you think the new Modbus stack could be compliant to this requirements in a near future ? Or absolutely not ?
would there be a "softer" way to reinit the stack states machines without destroying the stack ? we're a bit afraid with malloc/frees, we dontt want a memory leak so we avoid them as much as possible
if we want to stick to the previous implementation of Modbus stack, can we "simply" keep the whole SDK corresponding to the v4.1-rc, but replacing the folder freemodbus by the previous version that we use and that we have "qualified" on our previous project ? Shall we expect any side effects ?

Thank you for your feedback.

Best regards,

simon-thiebaut commented 4 years ago

Hi @alisitsyn ,

May I update my previous comment: actually I think I've pointed out the issue... Some of our Modbus requests can take longer to be processed. I've checked the timings: we can take 337 ms before replying to a Modbus request. But the timeout is set to 400 ms on Espressif.

I think that the issue comes from the fact that the first request gets a timeout, but the reply is always sent, even after time out. I think we are borderline on timings, that's why we do not have the problem all the time. However, I can reproduce it almost always bby adding a delay before replying to Modbus requests on the slave side.

Question is: what is supposed to happen if bytes are received AFTER request is considered as timeout by the stack ? Is it safely managed ? From our observations, it seems that we are desynchronized, and never be good again. Also, we have the following error logs:

[0;31mE (23403) MB_PORT_COMMON: xMBMasterRunResTake(159): xMBMasterRunResTake:Take resource failure.[0m [0;31mE (23405) MB_CONTROLLER_MASTER: mbc_master_get_parameter(111): SERIAL master get parameter failure error=(0x103) (ESP_ERR_INVALID_STATE).[0m [0;31mE (23418) HMI_MB: hmi_modbus_read_value CID = 7 error=(0x103) (ESP_ERR_INVALID_STATE)[0m [0;31mE (23426) MODBUS: Read error on CID 7[0m [0;31mE (23427) MB_PORT_COMMON: vMBMasterRunResRelease(172): vMBMasterRunResRelease: resource release failure.[0m [0;31mE (24456) MB_PORT_COMMON: 0[0m [0;31mE (24457) MB_PORT_COMMON: xMBMasterRunResTake(159): xMBMasterRunResTake:Take resource failure.[0m [0;31mE (24459) MB_CONTROLLER_MASTER: mbc_master_get_parameter(111): SERIAL master get parameter failure error=(0x103) (ESP_ERR_INVALID_STATE).[0m [0;31mE (24472) HMI_MB: hmi_modbus_read_value CID = 8 error=(0x103) (ESP_ERR_INVALID_STATE)[0m [0;31mE (24480) MODBUS: Read error on CID 8[0m [0;31mE (24481) MB_PORT_COMMON: vMBMasterRunResRelease(172): vMBMasterRunResRelease: resource release failure.[0m [0;32mI (24730) SHADOW_Z: Shadow buffer: {"state":{"reported":{"equipment":{"hp_0":{"cl":0}}}}}[0m [0;31mE (25516) MB_PORT_COMMON: xMBMasterRunResTake(159): xMBMasterRunResTake:Take resource failure.[0m [0;31mE (25516) MB_PORT_COMMON: 0[0m [0;31mE (25517) MB_CONTROLLER_MASTER: mbc_master_get_parameter(111): SERIAL master get parameter failure error=(0x103) (ESP_ERR_INVALID_STATE).[0m [0;31mE (25531) HMI_MB: hmi_modbus_read_value CID = 9 error=(0x103) (ESP_ERR_INVALID_STATE)[0m [0;31mE (25540) MODBUS: Read error on CID 9[0m [0;31mE (25540) MB_PORT_COMMON: vMBMasterRunResRelease(172): vMBMasterRunResRelease: resource release failure.[0m [0;32mI (25995) SHADOW_Z: Shadow buffer: {"state":{"reported":{"equipment":{"hp_0":{"cl":1}}}}}[0m [0;31mE (26569) MB_PORT_COMMON: xMBMasterRunResTake(159): xMBMasterRunResTake:Take resource failure.[0m [0;31mE (26570) MB_PORT_COMMON: 0[0m [0;31mE (26570) MB_CONTROLLER_MASTER: mbc_master_get_parameter(111): SERIAL master get parameter failure error=(0x103) (ESP_ERR_INVALID_STATE).[0m [0;31mE (26584) HMI_MB: hmi_modbus_read_value CID = 11 error=(0x103) (ESP_ERR_INVALID_STATE)[0m [0;31mE (26593) MODBUS: Read error on CID 11[0m [0;31mE (26593) MB_PORT_COMMON: vMBMasterRunResRelease(172): vMBMasterRunResRelease: resource release failure.[0m [0;31mE (27630) MB_PORT_COMMON: 0[0m [0;31mE (27631) MB_PORT_COMMON: xMBMasterRunResTake(159): xMBMasterRunResTake:Take resource failure.[0m [0;31mE (27633) MB_CONTROLLER_MASTER: mbc_master_get_parameter(111): SERIAL master get parameter failure error=(0x103) (ESP_ERR_INVALID_STATE).[0m [0;31mE (27646) HMI_MB: hmi_modbus_read_value CID = 46 error=(0x103) (ESP_ERR_INVALID_STATE)[0m [0;31mE (27653) MB_PORT_COMMON: vMBMasterRunResRelease(172): vMBMasterRunResRelease: resource release failure.[0m

Thank you for your help,

Best regards,

alisitsyn commented 4 years ago

Hi @simon-thiebaut,

Thank you for your update. Sorry for the delays with answer. I am pretty busy with other things after my vacation and need more time to analyse the results. This issue the next in my queue and I will respond ASAP.

Hope for understanding. Thank you.

simon-thiebaut commented 4 years ago

Hi @alisitsyn ,

No problem, I understand, don't worry.

Just a new info: I think I understand why the previous stack was "more stable" for us... actually I've found an old patch from us on the stack that workarounds an unknown problem we faced in the past. I don't think this was notified to Espressif at that time, so please find the diff below:

I will apply it on the new stack to see if we're still getting stuck, but I'm pretty sure we'll not be. So I think the real questions in all my comments is the one from the previous comment: "what is supposed to happen if bytes are received AFTER request is considered as timeout by the stack ? Is it safely managed ?" (cf. log errors we have).

Thank you.

alisitsyn commented 4 years ago

Hi @simon-thiebaut,

In normal processing the release of semaphore should be performed in one place - in the modbus event handler: https://github.com/espressif/esp-idf/blob/357a2776032299b8bc4044900a8f1d6950d7ce89/components/freemodbus/modbus/mb_m.c#L454 In cases when the event handler and serial processing task are often suspended by other high priority tasks the release can be delayed. In this case the release is delayed but response timeout received and the stack starts new transaction after timeout. This patch can be useful in this case only and can prevent blocking of stack when incorrect packet received. But blocking is fixed in later modbus updates, in worst case the release of resource must called after several transactions. Your patch does not allow to avoid errors when timeout expired but packet received and the same way it will release sema after 5 continuous failed transactions in this case . It could be just part of solution and not complete. From my point of view there is other thing that explains why the v4.0 release was working better for you: https://github.com/espressif/esp-idf/blob/v4.0/components/freemodbus/port/portserial_m.c#L109 In this code we read the complete message just once into temporary buffer using blocking function. The packet processing of the incoming buffer faster without blocking in this case. The same is for xMBMasterPortSerialTxPoll(). I think you need to port this part from your previous code. In the v4.1 I updated this part to do blocking call to read each byte from buffer and this is required to support custom modbus buffer size and due to UART read/write and ring buffer behavior this switching context often. This old code will allow to decrease polling time significantly in case when your OTA task and other high priority tasks do actively their work . I think you need to port just this serial code from v4.0.

"what is supposed to happen if bytes are received AFTER request is considered as timeout by the stack ? Is it safely managed ?"

I would say it is not safety managed. The received packet will cause incorrect processing.

simon-thiebaut commented 4 years ago

Thank you for your analysis @alisitsyn .

So to sum up: 1) semaphore management:

I should remove our patch with the 5 retries
I should replace and adapt vMBMasterPortSerialRxPoll and xMBMasterPortSerialTxPoll functions with the one in provided in your link.

2) task desynchronization risk:

so we need to absolutely respect timeouts value, even if we need to increase it, as the stack might not support to receive too late responses.

Is that correct ?

I will do the modifications and trigger tests probably on monday now. I'll keep you updated.

Thank you again, and have a nice week-end.

alisitsyn commented 4 years ago

@simon-thiebaut,

I should remove our patch with the 5 retries

You could verify this but from my point of view it useless and will not solve the issue but can have side effects;

I should replace and adapt vMBMasterPortSerialRxPoll and xMBMasterPortSerialTxPoll functions with the one in provided in your link.

Yes, I think this might help in your particular case of using Modbus with OTA update but has to be checked carefully.

so we need to absolutely respect timeouts value, even if we need to increase it, as the stack might not support to receive too late responses.

When the response timeout is expired the master starts the next transaction. Once it is sent the master does not have any way to determine if the received frame in the fuffer is response for expired transaction or for the new one (as it is for Modbus TCP with MBAP TID field). I think what master can do in this case is to flush received buffer right before sending a new request:

void vMBMasterPortSerialEnable(BOOL bRxEnable, BOOL bTxEnable)
{
    // This function can be called from xMBRTUTransmitFSM() of different task
    if (bTxEnable) {
        bTxStateEnabled = TRUE;
        uart_flush_input(ucUartNumber); // <<< flush input buffer with expired response before send the new master transaction
    } else {
        bTxStateEnabled = FALSE;
    }
    if (bRxEnable) {
        bRxStateEnabled = TRUE;
        vMBMasterRxSemaRelease();
        vTaskResume(xMbTaskHandle); // Resume receiver task
    } else {
        vTaskSuspend(xMbTaskHandle); // Block receiver task
        bRxStateEnabled = FALSE;
    }
}

Master also is able to drop fragmented frames. If the response is received but during processing the additional frame is received it will process just first frame and drop second one:

// UART receive event task
static void vUartTask(void* pvParameters)
{
    uart_event_t xEvent;
    USHORT usResult = 0;
    for(;;) {
        if (xQueueReceive(xMbUartQueue, (void*)&xEvent, portMAX_DELAY) == pdTRUE) {
            ESP_LOGD(TAG, "MB_uart[%d] event:", ucUartNumber);
            switch(xEvent.type) {
                //Event of UART receiving data
                case UART_DATA:
                    ESP_LOGD(TAG,"Data event, len: %d.", xEvent.size);
                    // This flag set in the event means that no more
                    // data received during configured timeout and UART TOUT feature is triggered
                    if (xEvent.timeout_flag) {
                        // Workaround: The UART driver can cut mb frame
                        // (generate timeouts for incomplete frame) when wifi is active
                        if (xEvent.size <= MB_SER_PDU_SIZE_MIN) {
                            break;
                        }
                        // Response is received but previous packet processing is pending
                        // Do not wait completion of processing and just discard received data
                        // as incorrect (fragmentation of response).
                        if (vMBMasterRxSemaIsBusy()) {
                            uart_flush_input(ucUartNumber);
                            break;
                        }
                        // Get buffered data length
                        ESP_ERROR_CHECK(uart_get_buffered_data_len(ucUartNumber, &xEvent.size));
                        // Read received data and send it to modbus stack
                        usResult = usMBMasterPortSerialRxPoll(xEvent.size);
                        ESP_LOGD(TAG,"Timeout occurred, processed: %d bytes", usResult);
                        // Block receiver task until data is not processed
                        vTaskSuspend(NULL);
                    }
                    break;

Please try to test the changes and let me know if you need help with this. Thanks.

simon-thiebaut commented 4 years ago

Hi @alisitsyn ,

So I've tried 2 solutions today: 1) with vMBMasterPortSerialRxPoll and xMBMasterPortSerialTxPoll "reverted" to previous version => in that case, it seems the Modbus gets locked. I've added log in xMBMasterPortSerialTxPoll, I only pass once inside it 2) with only vMBMasterPortSerialRxPoll "reverted": in that case, communication is good, until I start file transfer... Each chunk transfer request takes 337 ms to respond, so that causes issues. What I wanted to check is if the stack manages to goes back to normal state even if the communication fails (as before the patch so). Behavior is slightly different: if I abort the transfer (by stopping then rebooting the slave), communication errors stop, but now it seems that the read data are no more aligned with the physical slave board that has been rebooted (read SW version of the slave does not correspond, LEDs do not match the salve state, ...)

I have attached:

the patch diff (only difference between points 1 et 2 is that xMBMasterPortSerialTxPoll is not reverted)
the log where I have troubles: the transfer starts, but after a while I have invalid responses (not only timeouts). Also, I have resource take/give issues (vMBMasterRunResRelease), though it does not seem to be blocked by that as before (our 5 retries patch has been removed).

Do you have any idea ? Maybe it is due to the missing xMBMasterPortSerialTxPoll revert that I've not succeeded yet... Can you help please ? 0001-test.log log issue.txt

Thank you.

Best regards,

alisitsyn commented 4 years ago

Hi @simon-thiebaut,

Sorry for the delays. The reason for the errors is that the xMBMasterPortSerialTxPoll() should be reverted to previous version and both polling functions TX/RX should be reverted in your slave code as well.

I try to check the approach with reverted poll functions as well as other idea. Unfortunately I don't have your code to check. I will try other way to check.

simon-thiebaut commented 4 years ago

Hello @alisitsyn No problem.

Note for your investigations, thank you.

Please note: on our slave, we are not using an Espressif chip, but we have integrated FreeModbus as a standalone librarie (quite old version, taken directly from https://www.embedded-solutions.at/en/freemodbus/). It has been kept unmodified for months now).

However for our Espressif master: I'd like to share new findings regarding my previous comment. As I said, with the partially patched polling functions (RX patched, but not TX), I had after some request timeouts a misalignment between requests and responses. I've added more log, and I've manage to point out where the shift occurs (cf. attached "Timeout shift issue.txt"). For a request, the log sequence is the following:

HMI_MB: CID R1 XXX => XXX is CID value; request start at our application level MB_MASTER_SERIAL: uart rx break. => UART task event (please note I have not enabled all logs in this task, just unexpected/failure cases) MB_PORT_COMMON: RX 0 => FreeModbus RX FSM has processed first byte of response MB_PORT_COMMON: RX E => FreeModbus RX FSM has ended processing of full response frame MB_PORT_COMMON: eMBMasterPoll: Packet data received successfully (0). POLL RCV buffer: . => response frame data bytes HMI_MB: CID R2 XXX => XXX is CID value; request end at our application level

What we can see at t = 68423 is that CID 72 is requested by our app. Then OTA stuff occurs, probably causing then the subsequent timeout in the log at t = 68578. Due to this error, our app considers the request ended and failed at t = 68588. However, we can see that we then receive the response frame, before our app performs a new request to the Modbus stack (at t = 68966). This causes the mutex give failure at t = 69022, as it has already been released when timeout occured before. Finally, later we can see our app requests CID 73 at t = 69325. And request is considered over right after, before anything gets log from Modbus stack, at t = 69336. Following that, all requests and responses are misaligned: request N receives response N-1. So it seems that the Modbus stack does not clear/ignores properly received bytes outside request context (as we mentioned in an earlier post).

I think this was probably hidden by our 5 retries fix in previous version. However following this log I've tried to patch the library to recover from this shift. Things seems better now, but I need to do more tests. You will find attached the 2 Git patches I have applied for my tests (sorry I had to rename .patch extension to .log for attaching files):

0001-bugfix-Espressif-workaround-proposal-for-Modbus-late.patch.log => RX/TX poll functions rework for latency improvements
0002-bugfix-workaround-to-clear-Modbus-stack-misalignment.patch.log => Modbus stack resources/variables clearing force between each requests

Very probably my patch does not cover all Modbus use cases and is not the most optimized, but as it gives results I will test it "as it is". Please let me know if you see better ways to handle that, and also if I must be aware of possible dangerous side effects of it.

I'll keep you updated of my stress test results, but they are quite long (OTA loops). So if good, it can take 4/5 days. If not, it will be shorter.

Thank you for your feedback on this. Best regards, 0001-bugfix-Espressif-workaround-proposal-for-Modbus-late.patch.log 0002-bugfix-workaround-to-clear-Modbus-stack-misalignment.patch.log Timeout shift issue.txt

alisitsyn commented 4 years ago

Hello @simon-thiebaut,

Thank you for extended information and patches. The expired response can be received at any time and really can cause an issue. I verified your logs and checked the patch. I was able to reproduce the issues with my environment. The patch works good according to my tests and is able to repair FSM from failure. The same approach should be applied for the mb_ascii_m.c. For now I think it does its work good but still have some stress tests to check.

Thank you for your contribution.

DaniusKalv commented 3 years ago

Hello @simon-thiebaut

I just tried the fix that you committed to BadrBouaddi repository and it fixes the issue that I've been battling with for a week.

Could you create a PR on esp-idf repository so that everyone can enjoy the fix?

Thanks

alisitsyn commented 3 years ago

Hello @simon-thiebaut,

Do you have any new results of the testing? Could you create the PR for misalignment fix to merge it to master and backport to other branches?

Thanks.

simon-thiebaut commented 3 years ago

Hello @DaniusKalv and @alisitsyn ,

Sorry for delay, last week was a delivery week...

So for the test results:

we made 200 OTA loops, that causes the exchange of roughly 567000 chunks of 64 data bytes over Modbus with blocking delay close to the Modbus timeout => device is no more stuck
normal long term running: we have currently 14 devices running for more than a week without troubles. Still to monitor over time, but encouraging.

FYI, before OTA loops were failing after ~15 loops, and devices stopped to run after ~3 days.

OK I will do the PR, but probably tomorrow or on wednesday.

Thank you.

Best regards,

simon-thiebaut commented 3 years ago

Hi @DaniusKalv and @alisitsyn ,

Ok so actually I've not created the PR yet, because we still have an issue regarding the 2nd point:

"normal long term running: we have currently 14 devices running for more than a week without troubles. Still to monitor over time, but encouraging."

Devices were not monitored properly. After a new diagnostic, it appears that we have roughly 1 communication loss per day with the new stack. It is not the case with the previous stack (we compare also devices with this old stack). After analysis, it appears there is one scenario in the new stack where the client is not notified of an error, and can be stuck in the request finish waiting API - considered as a "communication freeze". I've patched it last week, then we've restarted the tests. We have a meeting this afternoon to check the test results.

If they are OK, I will add this patch before creating the PR.

Sorry again for the delay, I'll keep you updated.

Thank you,

Best regards,

atanisoft commented 2 years ago

@simon-thiebaut Was a PR ever raised for your fixes?

simon-thiebaut commented 2 years ago

@atanisoft thanks for digging this out, it was completely out of my mind, too many things to handle this year, I apologize for such a delay.

So to answer your question, no I haven't created a PR on that. Due to our project timeline we decided to keep our "patched but qualified" ESP SDK to go to production.

Anyway I need to take time to gather the last statuses on these points before creating the PR. Especially I need to go back in history to check if we did not add more patches.

I'm creating a ticket right now in our system to not forget. But unfortunately I can't tell you when I'll be able to handle that, we are very busy until march 22... I'll keep you informed ASAP.

Thanks for your understanding.

atanisoft commented 2 years ago

Thanks @simon-thiebaut for the info. The project I'm working on currently is facing the same sort of resource release / acquire failures that you reported. I've added retries on parameter retrieval and that helps for a while but I have seen a few cases where reading two float registers (four total registers) back to back can result in the same value for both under some to-be-determined conditions. I've also added a bus restart (mbc_master_destroy() followed by mbc_master_init()....), it isn't perfect but it does help improve stability.

MichaelUray commented 1 year ago

It looks to me that the issue is still present. I am getting the following errors after some time of Modbus reading with the master.

I (30925454) count_task: RAM left 128216
E (30925994) MB_PORT_COMMON: xMBMasterRunResTake(183): Take resource failure.
E (30925994) MB_CONTROLLER_MASTER: mbc_master_get_parameter(85): Master get parameter failure, error=(0x103) (ESP_ERR_INVALID_STATE).    
E (30925994) MASTER_TEST: Characteristic #0 (Data_channel_0) read fail, err = 0x103 (ESP_ERR_INVALID_STATE).

alisitsyn commented 1 year ago

This issue have been fixed in the esp-modbus component which can be used from esp-idf v4.1. In order to use the new component please follow the requirements below:

Include the line below to the root CMakeLists.txt file of your project folder:
```
set(EXCLUDE_COMPONENTS freemodbus)
```
The main component folder of the project shall include the packet manager manifest file as in example.
The slave response time CONFIG_FMB_MASTER_TIMEOUT_MS_RESPOND shall be configured to be higher than the maximum slave response time for slaves in your network.

The component freemodbus provided in old revisions of esp-idf will not be updated. Please use the esp-modbus component. This ticket is going to be closed. Feel free to reopen new issue if you have one.

Thank you.

espressif / esp-idf