The Modbus TCP master can restart with a WDT timeout after ~10 hours of continuous work (IDFGH-7204)

alisitsyn commented 2 years ago

This issue is about the crash of Modbus master TCP after a long time. This issue corresponds to the discussion in ESP_IDF.

Environment:

Module or chip used: [ESP32-WROOM-32] IDF version (run git describe --tags to find it): v4.4-dev-1404-gc13afea635 Build System: [|idf.py Eclipse IDE] Compiler version (run xtensa-esp32-elf-gcc --version to find it): xtensa-esp32-elf-gcc (crosstool-NG esp-2020r3) 8.4.0 Operating System: [W10] Using an IDE?: [No|Yes (please give details)] Power Supply: [USB]

The artifacts from that issue are included below: The complete code + binaries of the example to reproduce the issue (120422.ZIP)

20220413093143.log 20220414172430.log backtrace of the calls from log file above

alisitsyn commented 2 years ago

Hi @pr200sd,

I had hard time to reproduce the issue with WDT reset and did not able to get this trigger reset. I was able to get similar log, however in my case the master is able to detect error and restart connection cycle.

E (209000) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (209010) MASTER_TEST: Characteristic #1 (Humidity_1) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
E (211520) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (211520) MASTER_TEST: Characteristic #0 (Data_channel_0) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
E (213530) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (213540) MASTER_TEST: Characteristic #1 (Humidity_1) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
E (216050) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (216050) MASTER_TEST: Characteristic #0 (Data_channel_0) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
E (218070) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (218070) MASTER_TEST: Characteristic #1 (Humidity_1) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
I (218620) MB_TCP_MASTER_PORT: Slave #0, Socket(#54)(192.168.1.2), slave is down, off_time[r][w](us) = [190804081][41950].
I (218620) MB_TCP_MASTER_PORT: Connecting to slaves...
-.I (218650) MB_TCP_MASTER_PORT: Connected 1 slaves, start polling...
E (220580) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (220580) MASTER_TEST: Characteristic #0 (Data_channel_0) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
E (226710) MB_TCP_MASTER_PORT: Socket(#54)(192.168.1.2) connection closed.
E (226710) MB_TCP_MASTER_PORT: Slave #0, Socket(#54)(192.168.1.2), critical error=-11.
I (226720) MB_TCP_MASTER_PORT: Connecting to slaves...
-I (226720) MB_TCP_MASTER_PORT: Connected 1 slaves, start polling...
E (228700) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (228700) MASTER_TEST: Characteristic #0 (Data_channel_0) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
E (228720) MB_TCP_MASTER_PORT: Socket(#54)(192.168.1.2) connection closed.
E (228730) MB_TCP_MASTER_PORT: Slave #0, Socket(#54)(192.168.1.2), critical error=-11.
I (228730) MB_TCP_MASTER_PORT: Connecting to slaves...
-I (228740) MB_TCP_MASTER_PORT: Connected 1 slaves, start polling...
E (230720) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (230720) MASTER_TEST: Characteristic #1 (Humidity_1) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
E (231230) MB_TCP_MASTER_PORT: Socket(#54)(192.168.1.2) connection closed.
E (231230) MB_TCP_MASTER_PORT: Slave #0, Socket(#54)(192.168.1.2), critical error=-11.
I (231240) MB_TCP_MASTER_PORT: Connecting to slaves...
-I (231250) MB_TCP_MASTER_PORT: Connected 1 slaves, start polling...
E (233230) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (233230) MASTER_TEST: Characteristic #0 (Data_channel_0) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
E (233250) MB_TCP_MASTER_PORT: Socket(#54)(192.168.1.2) connection closed.
E (233260) MB_TCP_MASTER_PORT: Slave #0, Socket(#54)(192.168.1.2), critical error=-11.
I (233260) MB_TCP_MASTER_PORT: Connecting to slaves...
-I (233270) MB_TCP_MASTER_PORT: Connected 1 slaves, start polling...
E (235240) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (235250) MASTER_TEST: Characteristic #1 (Humidity_1) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
E (235760) MB_TCP_MASTER_PORT: Socket(#54)(192.168.1.2) connection closed.
E (235760) MB_TCP_MASTER_PORT: Slave #0, Socket(#54)(192.168.1.2), critical error=-11.
I (235770) MB_TCP_MASTER_PORT: Connecting to slaves...
-I (235770) MB_TCP_MASTER_PORT: Connected 1 slaves, start polling...
E (237760) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (237760) MASTER_TEST: Characteristic #0 (Data_channel_0) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
E (237780) MB_TCP_MASTER_PORT: Socket(#54)(192.168.1.2) connection closed.
E (237780) MB_TCP_MASTER_PORT: Slave #0, Socket(#54)(192.168.1.2), critical error=-11.
I (237790) MB_TCP_MASTER_PORT: Connecting to slaves...
-I (237800) MB_TCP_MASTER_PORT: Connected 1 slaves, start polling...
E (239780) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (239780) MASTER_TEST: Characteristic #1 (Humidity_1) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
E (240290) MB_TCP_MASTER_PORT: Socket(#54)(192.168.1.2) connection closed.
E (240290) MB_TCP_MASTER_PORT: Slave #0, Socket(#54)(192.168.1.2), critical error=-11

So, my guess of the reason:

The Modbus TCP slave device is completed its Modbus loop and the Modbus controller is destroyed. The wifi interface can still be working.
The Modbus TCP master still try to read the slave and even try to reconnect with ERR_CONN (-11) means that it can not connect to slave. In some cases the slave may allow the connection.
The long parameter read cycle before line I (218620) MB_TCP_MASTER_PORT: Slave #0, Socket(#54)(192.168.1.2), slave is down, off_time[r][w](us) = [190804081][41950]. causes the WDT triggering. This happens because the master tries to read data from slave but it does not have data and the non blocking recv() immediately returns EAGAIN means read timeout. Then doing the cycle while slave is not able to return data (can cause forever loop). The code is here.

In spite I was not able to get this situation I think this is the only possible reason for this restart behavior. This happens since the update of the recv() mode to unblocking. In order to check it on your side I would ask you to change the code as below and restart test of your example.

static int xMBTCPPortMasterGetBuf(MbSlaveInfo_t* pxInfo, UCHAR* pucDstBuf, USHORT usLength)
{
    int xLength = 0;
    UCHAR* pucBuf = pucDstBuf;
    USHORT usBytesLeft = usLength;

    MB_PORT_CHECK((pxInfo && pxInfo->xSockId > -1), -1, "Try to read incorrect socket = #%d.", pxInfo->xSockId);

    // Receive data from connected client
    while (usBytesLeft > 0) {
        xMBTCPPortMasterCheckShutdown();
        // none blocking read from socket with timeout
        xLength = recv(pxInfo->xSockId, pucBuf, usBytesLeft, MSG_DONTWAIT);
        if (xLength < 0) {
            if (errno == EAGAIN) {
                // Read timeout occurred, continue reading
                /// change the lines as below: <<<<<<<<<<<<<<<<<<<<<<<<<<<<
                ESP_LOGE(TAG, "Socket (IP:%s) read timeout.", pxInfo->pcIpAddr);
                vTaskDelay(10);
                continue;

Let me know the result with log.

pr200sd commented 2 years ago

Thanks for the created task, all this time I have been testing an example, and from the logs I also noticed that the module is not reloaded, so the behavior is the same for me and for you, I will make corrections in the code and report the results. Thanks for the detailed analysis of the problem.

pr200sd commented 2 years ago

20220420145351.log Here is the log after editing.

alisitsyn commented 2 years ago

@pr200sd , Thank you for update. Unfortunately, I can not get backtrace from your log because do not have your binary. Possible options for backtrace:

to start idf.py monitor in the command line, and get parced backtrace of functions in the log.
setup the coredump + provide log.
provide the binary with the log and get backtrace manually (not very good option).

pr200sd commented 2 years ago

After fixing the file, I can't get the module to reload, wd also stopped working, but now sometimes there are timeout messages. In principle, in this version, I can already use master tcp, if you plan to modify something else, then I think you will have more information if you run my project with a fix and modbus slave on a PC.

alisitsyn commented 2 years ago

@pr200sd,

Thank you for the information.

After fixing the file, I can't get the module to reload, wd also stopped working, if you plan to modify something else, then I think you will have more information if you run my project with a fix and Modbus slave on a PC.

According to your log above, I could see that the watchdog is still triggered and it seems that the reason is different. Could happen in the connection loop too. Unfortunately, this does not happen on my device and I do not have the opportunity, for now, to test it for a long time. However, I will check this as soon as I can to fix it finally.

but now sometimes there are timeout messages.

It is still possible that the timeout messages happen especially if you use many other high-priority tasks in your application. In this case, you can try to retry the transaction after a timeout.

I can already use master TCP, if you plan to modify something else, then I think you will ...

The change I described above is not a complete fix actually but the way to check read timeouts during master-slave communication on your side. In order to not allow the watchdog triggering the code can be modified as below but later I plan to return to blocking read in recv() and avoid delay in the loop. Note: Updated code to use recv() in blocking mode to read up to maximum response time

#define MB_TCP_PORT_RESPOND_TIMEOUT     (CONFIG_FMB_MASTER_TIMEOUT_MS_RESPOND * 1000)

static int xMBTCPPortMasterGetBuf(MbSlaveInfo_t* pxInfo, UCHAR* pucDstBuf, USHORT usLength)
{
    int xLength = 0;
    UCHAR* pucBuf = pucDstBuf;
    USHORT usBytesLeft = usLength;
    struct timeval xTime;

    MB_PORT_CHECK((pxInfo && pxInfo->xSockId > -1), -1, "Try to read incorrect socket = #%d.", pxInfo->xSockId);

    // Set receive timeout for socket <= slave respond time
    xTime.tv_sec = 0;
    xTime.tv_usec = MB_TCP_PORT_RESPOND_TIMEOUT;
    setsockopt(pxInfo->xSockId, SOL_SOCKET, SO_RCVTIMEO, &xTime, sizeof(xTime));

    // Receive data from connected client
    while (usBytesLeft > 0) {
        xMBTCPPortMasterCheckShutdown();
        xLength = recv(pxInfo->xSockId, pucBuf, usBytesLeft, 0); 
        if (xLength < 0) {
            if (errno == EAGAIN) {
                // Read timeout occurred, continue reading
            } else if (errno == ENOTCONN) {
                // Socket connection closed
                ESP_LOGE(TAG, "Socket(#%d)(%s) connection closed.",
                                            pxInfo->xSockId, pxInfo->pcIpAddr);
                return ERR_CONN;
            } else {
                // Other error occurred during receiving
                ESP_LOGE(TAG, "Socket(#%d)(%s) receive error, length=%d, errno=%d",
                                            pxInfo->xSockId, pxInfo->pcIpAddr, xLength, errno);
                return -1;
            }
        } else if (xLength) {
            pucBuf += xLength;
            usBytesLeft -= xLength;
        }
        if (xMBTCPPortMasterGetRespTimeLeft(pxInfo) == 0) {
            return ERR_TIMEOUT;
        }
    }
    return usLength;
}

I will keep you posted about the results as soon I return to it. Your result is also useful to me.

Thanks.

alisitsyn commented 2 years ago

Hello @pr200sd,

Could you please try and confirm that the fix above solves your problem? Thank you.

pr200sd commented 2 years ago

Hello @pr200sd,

Could you please try and confirm that the fix above solves your problem? Thank you.

I just now saw that a change has been added to the code, I will try to replace it with myself in the near future and check it out.

pr200sd commented 2 years ago

(CONFIG_FMB_MASTER_TIMEOUT_MS_RESPOND * 1000) Is the value 1000 in the line not an error? The value is now set in ms.

alisitsyn commented 2 years ago

The tv.tv_usec is defined in microseconds. The code above uses a blocking call to read the answer up to CONFIG_FMB_MASTER_TIMEOUT_MS_RESPOND milliseconds.

pr200sd commented 2 years ago

Good afternoon. I managed to test the library with the updated code, now there are no reboots, uptime was 300000sec, but there are messages about a timeout and an unrecognized response in the logs, I'll try to increase the time in the settings. I also noticed one feature, if the master starts up without an available slave, then after the slave is turned on, the master cannot establish a connection, while if the master starts up and the slave is present, and then turns off, then in this case the connection is resumed. Because of what it can be?

alisitsyn commented 2 years ago

I managed to test the library with the updated code, now there are no reboots, uptime was 300000sec, but there are messages about a timeout and an unrecognized response in the logs,

Good. The timeout errors are possible when the master could not get the response during the response timeout. It might be several reasons for this and you can do retries in this case.

and an unrecognized response in the logs,

If response timeouts for some slaves happened the master still needs to read the buffer of each slave, check the TID field, and drop the buffer if this is a response to a previous transaction (the response time out error happened in the previous transaction). The messages that you see are probably shown because of this behavior above.

I also noticed one feature, if the master starts up without an available slave, then after the slave is turned on, the master cannot establish a connection,

So, the slave becomes active during the master connection cycle (repeated . is shown in the console) but master cannot connect to it? I need to check this.

The log of your communication could be helpful. Thank you.

pr200sd commented 2 years ago

I found a solution to my question, I increased the timeout to 3600 seconds, now even if the master tcp starts without a slave tcp on the network, then during this time the connection is restored.

alisitsyn commented 2 years ago

This issue has been fixed with commit: 0b1f0d41d976e3ba38ceca246eced7890ecf7748

espressif / esp-modbus

The Modbus TCP master can restart with a WDT timeout after ~10 hours of continuous work (IDFGH-7204) #4