Closed alisitsyn closed 2 years ago
Hi @pr200sd,
I had hard time to reproduce the issue with WDT reset and did not able to get this trigger reset. I was able to get similar log, however in my case the master is able to detect error and restart connection cycle.
E (209000) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (209010) MASTER_TEST: Characteristic #1 (Humidity_1) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
E (211520) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (211520) MASTER_TEST: Characteristic #0 (Data_channel_0) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
E (213530) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (213540) MASTER_TEST: Characteristic #1 (Humidity_1) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
E (216050) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (216050) MASTER_TEST: Characteristic #0 (Data_channel_0) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
E (218070) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (218070) MASTER_TEST: Characteristic #1 (Humidity_1) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
I (218620) MB_TCP_MASTER_PORT: Slave #0, Socket(#54)(192.168.1.2), slave is down, off_time[r][w](us) = [190804081][41950].
I (218620) MB_TCP_MASTER_PORT: Connecting to slaves...
-.I (218650) MB_TCP_MASTER_PORT: Connected 1 slaves, start polling...
E (220580) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (220580) MASTER_TEST: Characteristic #0 (Data_channel_0) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
E (226710) MB_TCP_MASTER_PORT: Socket(#54)(192.168.1.2) connection closed.
E (226710) MB_TCP_MASTER_PORT: Slave #0, Socket(#54)(192.168.1.2), critical error=-11.
I (226720) MB_TCP_MASTER_PORT: Connecting to slaves...
-I (226720) MB_TCP_MASTER_PORT: Connected 1 slaves, start polling...
E (228700) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (228700) MASTER_TEST: Characteristic #0 (Data_channel_0) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
E (228720) MB_TCP_MASTER_PORT: Socket(#54)(192.168.1.2) connection closed.
E (228730) MB_TCP_MASTER_PORT: Slave #0, Socket(#54)(192.168.1.2), critical error=-11.
I (228730) MB_TCP_MASTER_PORT: Connecting to slaves...
-I (228740) MB_TCP_MASTER_PORT: Connected 1 slaves, start polling...
E (230720) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (230720) MASTER_TEST: Characteristic #1 (Humidity_1) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
E (231230) MB_TCP_MASTER_PORT: Socket(#54)(192.168.1.2) connection closed.
E (231230) MB_TCP_MASTER_PORT: Slave #0, Socket(#54)(192.168.1.2), critical error=-11.
I (231240) MB_TCP_MASTER_PORT: Connecting to slaves...
-I (231250) MB_TCP_MASTER_PORT: Connected 1 slaves, start polling...
E (233230) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (233230) MASTER_TEST: Characteristic #0 (Data_channel_0) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
E (233250) MB_TCP_MASTER_PORT: Socket(#54)(192.168.1.2) connection closed.
E (233260) MB_TCP_MASTER_PORT: Slave #0, Socket(#54)(192.168.1.2), critical error=-11.
I (233260) MB_TCP_MASTER_PORT: Connecting to slaves...
-I (233270) MB_TCP_MASTER_PORT: Connected 1 slaves, start polling...
E (235240) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (235250) MASTER_TEST: Characteristic #1 (Humidity_1) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
E (235760) MB_TCP_MASTER_PORT: Socket(#54)(192.168.1.2) connection closed.
E (235760) MB_TCP_MASTER_PORT: Slave #0, Socket(#54)(192.168.1.2), critical error=-11.
I (235770) MB_TCP_MASTER_PORT: Connecting to slaves...
-I (235770) MB_TCP_MASTER_PORT: Connected 1 slaves, start polling...
E (237760) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (237760) MASTER_TEST: Characteristic #0 (Data_channel_0) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
E (237780) MB_TCP_MASTER_PORT: Socket(#54)(192.168.1.2) connection closed.
E (237780) MB_TCP_MASTER_PORT: Slave #0, Socket(#54)(192.168.1.2), critical error=-11.
I (237790) MB_TCP_MASTER_PORT: Connecting to slaves...
-I (237800) MB_TCP_MASTER_PORT: Connected 1 slaves, start polling...
E (239780) MB_CONTROLLER_MASTER: mbc_master_get_parameter(76): Master get parameter failure, error=(0x108) (ESP_ERR_INVALID_RESPONSE).
E (239780) MASTER_TEST: Characteristic #1 (Humidity_1) read fail, err = 0x108 (ESP_ERR_INVALID_RESPONSE).
E (240290) MB_TCP_MASTER_PORT: Socket(#54)(192.168.1.2) connection closed.
E (240290) MB_TCP_MASTER_PORT: Slave #0, Socket(#54)(192.168.1.2), critical error=-11
So, my guess of the reason:
ERR_CONN (-11)
means that it can not connect to slave. In some cases the slave may allow the connection.I (218620) MB_TCP_MASTER_PORT: Slave #0, Socket(#54)(192.168.1.2), slave is down, off_time[r][w](us) = [190804081][41950].
causes the WDT triggering. This happens because the master tries to read data from slave but it does not have data and the non blocking recv()
immediately returns EAGAIN
means read timeout.
Then doing the cycle while slave is not able to return data (can cause forever loop). The code is here.In spite I was not able to get this situation I think this is the only possible reason for this restart behavior. This happens since the update of the recv() mode to unblocking. In order to check it on your side I would ask you to change the code as below and restart test of your example.
static int xMBTCPPortMasterGetBuf(MbSlaveInfo_t* pxInfo, UCHAR* pucDstBuf, USHORT usLength)
{
int xLength = 0;
UCHAR* pucBuf = pucDstBuf;
USHORT usBytesLeft = usLength;
MB_PORT_CHECK((pxInfo && pxInfo->xSockId > -1), -1, "Try to read incorrect socket = #%d.", pxInfo->xSockId);
// Receive data from connected client
while (usBytesLeft > 0) {
xMBTCPPortMasterCheckShutdown();
// none blocking read from socket with timeout
xLength = recv(pxInfo->xSockId, pucBuf, usBytesLeft, MSG_DONTWAIT);
if (xLength < 0) {
if (errno == EAGAIN) {
// Read timeout occurred, continue reading
/// change the lines as below: <<<<<<<<<<<<<<<<<<<<<<<<<<<<
ESP_LOGE(TAG, "Socket (IP:%s) read timeout.", pxInfo->pcIpAddr);
vTaskDelay(10);
continue;
Let me know the result with log.
Thanks for the created task, all this time I have been testing an example, and from the logs I also noticed that the module is not reloaded, so the behavior is the same for me and for you, I will make corrections in the code and report the results. Thanks for the detailed analysis of the problem.
20220420145351.log Here is the log after editing.
@pr200sd , Thank you for update. Unfortunately, I can not get backtrace from your log because do not have your binary. Possible options for backtrace:
idf.py monitor
in the command line, and get parced backtrace of functions in the log.After fixing the file, I can't get the module to reload, wd also stopped working, but now sometimes there are timeout messages. In principle, in this version, I can already use master tcp, if you plan to modify something else, then I think you will have more information if you run my project with a fix and modbus slave on a PC.
@pr200sd,
Thank you for the information.
After fixing the file, I can't get the module to reload, wd also stopped working, if you plan to modify something else, then I think you will have more information if you run my project with a fix and Modbus slave on a PC.
According to your log above, I could see that the watchdog is still triggered and it seems that the reason is different. Could happen in the connection loop too. Unfortunately, this does not happen on my device and I do not have the opportunity, for now, to test it for a long time. However, I will check this as soon as I can to fix it finally.
but now sometimes there are timeout messages.
It is still possible that the timeout messages happen especially if you use many other high-priority tasks in your application. In this case, you can try to retry the transaction after a timeout.
I can already use master TCP, if you plan to modify something else, then I think you will ...
The change I described above is not a complete fix actually but the way to check read timeouts during master-slave communication on your side. In order to not allow the watchdog triggering the code can be modified as below but later I plan to return to blocking read in recv() and avoid delay in the loop. Note: Updated code to use recv() in blocking mode to read up to maximum response time
#define MB_TCP_PORT_RESPOND_TIMEOUT (CONFIG_FMB_MASTER_TIMEOUT_MS_RESPOND * 1000)
static int xMBTCPPortMasterGetBuf(MbSlaveInfo_t* pxInfo, UCHAR* pucDstBuf, USHORT usLength)
{
int xLength = 0;
UCHAR* pucBuf = pucDstBuf;
USHORT usBytesLeft = usLength;
struct timeval xTime;
MB_PORT_CHECK((pxInfo && pxInfo->xSockId > -1), -1, "Try to read incorrect socket = #%d.", pxInfo->xSockId);
// Set receive timeout for socket <= slave respond time
xTime.tv_sec = 0;
xTime.tv_usec = MB_TCP_PORT_RESPOND_TIMEOUT;
setsockopt(pxInfo->xSockId, SOL_SOCKET, SO_RCVTIMEO, &xTime, sizeof(xTime));
// Receive data from connected client
while (usBytesLeft > 0) {
xMBTCPPortMasterCheckShutdown();
xLength = recv(pxInfo->xSockId, pucBuf, usBytesLeft, 0);
if (xLength < 0) {
if (errno == EAGAIN) {
// Read timeout occurred, continue reading
} else if (errno == ENOTCONN) {
// Socket connection closed
ESP_LOGE(TAG, "Socket(#%d)(%s) connection closed.",
pxInfo->xSockId, pxInfo->pcIpAddr);
return ERR_CONN;
} else {
// Other error occurred during receiving
ESP_LOGE(TAG, "Socket(#%d)(%s) receive error, length=%d, errno=%d",
pxInfo->xSockId, pxInfo->pcIpAddr, xLength, errno);
return -1;
}
} else if (xLength) {
pucBuf += xLength;
usBytesLeft -= xLength;
}
if (xMBTCPPortMasterGetRespTimeLeft(pxInfo) == 0) {
return ERR_TIMEOUT;
}
}
return usLength;
}
I will keep you posted about the results as soon I return to it. Your result is also useful to me.
Thanks.
Hello @pr200sd,
Could you please try and confirm that the fix above solves your problem? Thank you.
Hello @pr200sd,
Could you please try and confirm that the fix above solves your problem? Thank you.
I just now saw that a change has been added to the code, I will try to replace it with myself in the near future and check it out.
(CONFIG_FMB_MASTER_TIMEOUT_MS_RESPOND * 1000) Is the value 1000 in the line not an error? The value is now set in ms.
The tv.tv_usec
is defined in microseconds. The code above uses a blocking call to read the answer up to CONFIG_FMB_MASTER_TIMEOUT_MS_RESPOND milliseconds.
Good afternoon. I managed to test the library with the updated code, now there are no reboots, uptime was 300000sec, but there are messages about a timeout and an unrecognized response in the logs, I'll try to increase the time in the settings. I also noticed one feature, if the master starts up without an available slave, then after the slave is turned on, the master cannot establish a connection, while if the master starts up and the slave is present, and then turns off, then in this case the connection is resumed. Because of what it can be?
I managed to test the library with the updated code, now there are no reboots, uptime was 300000sec, but there are messages about a timeout and an unrecognized response in the logs,
Good. The timeout errors are possible when the master could not get the response during the response timeout. It might be several reasons for this and you can do retries in this case.
and an unrecognized response in the logs,
If response timeouts for some slaves happened the master still needs to read the buffer of each slave, check the TID field, and drop the buffer if this is a response to a previous transaction (the response time out error happened in the previous transaction). The messages that you see are probably shown because of this behavior above.
I also noticed one feature, if the master starts up without an available slave, then after the slave is turned on, the master cannot establish a connection,
So, the slave becomes active during the master connection cycle (repeated .
is shown in the console) but master cannot connect to it? I need to check this.
The log of your communication could be helpful. Thank you.
I found a solution to my question, I increased the timeout to 3600 seconds, now even if the master tcp starts without a slave tcp on the network, then during this time the connection is restored.
This issue has been fixed with commit: 0b1f0d41d976e3ba38ceca246eced7890ecf7748
This issue is about the crash of Modbus master TCP after a long time. This issue corresponds to the discussion in ESP_IDF.
Environment:
Module or chip used: [ESP32-WROOM-32] IDF version (run git describe --tags to find it): v4.4-dev-1404-gc13afea635 Build System: [|idf.py Eclipse IDE] Compiler version (run xtensa-esp32-elf-gcc --version to find it): xtensa-esp32-elf-gcc (crosstool-NG esp-2020r3) 8.4.0 Operating System: [W10] Using an IDE?: [No|Yes (please give details)] Power Supply: [USB]
The artifacts from that issue are included below: The complete code + binaries of the example to reproduce the issue (120422.ZIP)
20220413093143.log 20220414172430.log backtrace of the calls from log file above