espressif / arduino-esp32

Arduino core for the ESP32
GNU Lesser General Public License v2.1
13.73k stars 7.43k forks source link

Truncated UART packets when BREAK_ERROR occours #10578

Open hitecSmartHome opened 2 weeks ago

hitecSmartHome commented 2 weeks ago

Board

esp32 wrover

Device Description

psram, eth, flash etc..

Hardware Configuration

psram, eth, flash etc..

Version

latest master (checkout manually)

IDE Name

PlatformIO

Operating System

Windows 10

Flash frequency

80

PSRAM enabled

yes

Upload speed

115200

Description

I often get truncated packets on uart when accessing either the flash with LittleFS wrapper or when making an http call from client to esp32. The way I test this is that I get an interrupt from Serial1.onReceive() with 0x00 data in it when I save something to flash or when making an http request. http server uses the IDF server implementation as well as the arduino Serial wrapper.

I expect the onReceive function to call my cb only when the set timeout is triggered. It is UART_SYMBOL_TIMEOUT 1 in this case. The problem is triggered every time when I write to flash. I can access menuconfig so I made sure that the UART functions and variables are in IRAM and interrupts too.

Sketch

void Modbus::init() {
    Serial1.setRxBufferSize(MAX_RX_BUFFER_SIZE);
    Serial1.setTxBufferSize(MAX_MBUS_DATA_LENGTH);
    Serial1.begin(MBUS_BAUD, SERIAL_8N1, MBUS_RX, MBUS_TX);
    Serial1.setPins(-1, -1, -1, MBUS_RTS);
    Serial1.setMode(UART_MODE_RS485_HALF_DUPLEX);
    Serial1.setRxTimeout(MBUS_RX_TIMEOUT);

    Serial1.onReceive(
        std::bind(&Modbus::handlePacket, this),
        PACKET_TRIGGER_ONLY_ON_TIMEOUT
    );
    Serial1.onReceiveError(
        std::bind(&Modbus::handleReceiveError, this, std::placeholders::_1)
    );
}

Debug Message

Debug messages coming from my API

E (39603) Modbus: CRC error in response packet: 0x00
E (39603) Modbus: Invalid packet. Can't process it.
Raw Packet: 00
E (39608) HardwareHandler: Packet error code: 1

Other Steps to Reproduce

Give repeated modbus packets to your esp, wait for them and access the flash meanwhile.

I have checked existing issues, online documentation and the Troubleshooting Guide

SuGlider commented 2 weeks ago

onReceive() will be called only when: 1- The UART FIFO is full. 2- When RX timeout occurs.

From the comments in https://github.com/espressif/arduino-esp32/blob/master/cores/esp32/HardwareSerial.h

  // onReceive will setup a callback that will be called whenever an UART interruption occurs (UART_INTR_RXFIFO_FULL or UART_INTR_RXFIFO_TOUT)
  // UART_INTR_RXFIFO_FULL interrupt triggers at UART_FULL_THRESH_DEFAULT bytes received (defined as 120 bytes by default in IDF)
  // UART_INTR_RXFIFO_TOUT interrupt triggers at UART_TOUT_THRESH_DEFAULT symbols passed without any reception (defined as 10 symbols by default in IDF)
  // onlyOnTimeout parameter will define how onReceive will behave:
  // Default: true -- The callback will only be called when RX Timeout happens.
  //                  Whole stream of bytes will be ready for being read on the callback function at once.
  //                  This option may lead to Rx Overflow depending on the Rx Buffer Size and number of bytes received in the streaming
  //         false -- The callback will be called when FIFO reaches 120 bytes and also on RX Timeout.
  //                  The stream of incommig bytes will be "split" into blocks of 120 bytes on each callback.
  //                  This option avoid any sort of Rx Overflow, but leaves the UART packet reassembling work to the Application.
  void onReceive(OnReceiveCb function, bool onlyOnTimeout = false);

What is the value of PACKET_TRIGGER_ONLY_ON_TIMEOUT?

SuGlider commented 2 weeks ago

Could it be a memory allocation issue? Please provide a minimum sketch that can be used to confirm the issue and to debug it.

hitecSmartHome commented 2 weeks ago

PACKET_TRIGGER_ONLY_ON_TIMEOUT is true in this case. The fifo can not be full since I got messages only when asked and no message is bigger than the fifo

hitecSmartHome commented 2 weeks ago

These are standard modbus messages

SuGlider commented 2 weeks ago

OK. I can't relate UART interrupts to flash accessing / httpd task. From your description, it sounds like this is an Arduino as IDF component project with specific IDF sdkconfig settings that may be very particular to this project.

Could you please provide a minimum sketch that can be built using Arduino Core standard sdkconfig that demonstrates such issue?

SuGlider commented 2 weeks ago

What is the value of MAX_RX_BUFFER_SIZE, MAX_MBUS_DATA_LENGTH and MBUS_BAUD?

SuGlider commented 2 weeks ago

OK. I can't relate UART interrupts to flash accessing / httpd task. From your description, it sounds like this is an Arduino as IDF component project with specific IDF sdkconfig settings that may be very particular to this project.

Could you please provide a minimum sketch that can be built using Arduino Core standard sdkconfig that demonstrates such issue?

Could it be an issue with flash access / httpd tasks? Arduino setup()/loop() will run under the lowest FreeRTOS priority. By other hand onReceive() related task will run under priority ARDUINO_SERIAL_EVENT_TASK_PRIORITY. This can be set as a global #define for the whole project.

#ifndef ARDUINO_SERIAL_EVENT_TASK_PRIORITY
#define ARDUINO_SERIAL_EVENT_TASK_PRIORITY (configMAX_PRIORITIES - 1)
#endif

You can also change the ESP32-WROVER CPU that is running the UART event Loop with ARDUINO_SERIAL_EVENT_TASK_RUNNING_CORE

#ifndef ARDUINO_SERIAL_EVENT_TASK_RUNNING_CORE
#define ARDUINO_SERIAL_EVENT_TASK_RUNNING_CORE -1
#endif

Or change the Task Stack size with ARDUINO_SERIAL_EVENT_TASK_STACK_SIZE

#ifndef ARDUINO_SERIAL_EVENT_TASK_STACK_SIZE
#define ARDUINO_SERIAL_EVENT_TASK_STACK_SIZE 2048
#endif
SuGlider commented 2 weeks ago

About stack size... onReceive() callback will be called from a separated UART Event Task. If the callback function overflows the Task stack, it can cause a similar issue.

hitecSmartHome commented 2 weeks ago
#define MBUS_BAUD 115200
#define PACKET_TRIGGER_ONLY_ON_TIMEOUT true
#define MBUS_RX_TIMEOUT 1
#define MAX_MBUS_DATA_LENGTH 256
#define MAX_RX_BUFFER_SIZE 80

This is from my sdkconfig.

CONFIG_ARDUINO_SERIAL_EVENT_RUN_NO_AFFINITY=y
CONFIG_ARDUINO_SERIAL_EVENT_TASK_RUNNING_CORE=-1
CONFIG_ARDUINO_SERIAL_EVENT_TASK_STACK_SIZE=4048
CONFIG_ARDUINO_SERIAL_EVENT_TASK_PRIORITY=24

I can increase the event stack size for testing and also the rx buffer size.

Could you please provide a minimum sketch that can be built using Arduino Core standard sdkconfig that demonstrates such issue?

I will create one for sure because I want to get to the bottom of this.

hitecSmartHome commented 2 weeks ago

Here is my whole sdkconfig: https://pastebin.com/w4wPRxDK

SuGlider commented 2 weeks ago

define MAX_RX_BUFFER_SIZE 80

Serial1.setRxBufferSize(MAX_RX_BUFFER_SIZE);

It can't be lower than 129. Please enable DEBUG level at least to INFO in order to check error and warning messages.

From HardwareSerial Source Code:

// minimum total RX Buffer size is the UART FIFO space (128 bytes for most SoC) + 1. IDF imposition.
size_t HardwareSerial::setRxBufferSize(size_t new_size) {

  if (_uart) {
    log_e("RX Buffer can't be resized when Serial is already running. Set it before calling begin().");
    return 0;
  }

  if (new_size <= SOC_UART_FIFO_LEN) {
    log_w("RX Buffer set to minimum value: %d.", SOC_UART_FIFO_LEN + 1);  // ESP32, S2, S3 and C3 means higher than 128
    new_size = SOC_UART_FIFO_LEN + 1;
  }

  _rxBufferSize = new_size;
  return _rxBufferSize;
}
hitecSmartHome commented 2 weeks ago

Well, I see that it changes the value back.

hitecSmartHome commented 2 weeks ago

Firstly

I couldn't see shit so decreased debug back to info. I could reproduce it without flash calls.

I (117321) System: Getting task info
E (117329) Modbus: CRC error in response packet: 0x00
E (117329) Modbus: Invalid packet. Can't process it.
Raw Packet: 00
E (117334) HardwareHandler: Packet error code: 1

E (117339) Modbus: CRC error in response packet: 0x14
E (117345) Modbus: Invalid packet. Can't process it.
Raw Packet: 14 00 00 ed 85
E (117353) HardwareHandler: Packet error code: 1

The packet also has missing bytes and it is splitted. The task priority is default 19 and it is amongst the biggest priority tasks the system has.

uart_event_task is priority 24 wifi is 23 esp_timer is 22 sys_evt is 20 and arduino_events is 19

I can increase the arduino_events task further hovewer. ( I want to try these things before I go write a reproduction example because it will probably involve http requests. )

hitecSmartHome commented 2 weeks ago

For some strange reason, in the menuconfig under Arduino configuration the Serial Event task priority shows up as 24 but when I do a check with uxTaskGetSystemState(taskStatusArray, uxArraySize, &ulTotalRunTime); it reports the arduino_events priority as 19. Strange...

hitecSmartHome commented 2 weeks ago

Does Arduino Serial Event task the same as arduino_events ?

SuGlider commented 2 weeks ago

Does Arduino Serial Event task the same as arduino_events ?

No. It is a separated task and settings: https://github.com/espressif/arduino-esp32/blob/0d5d50eb4186cb5ef0b7c108f87fae1cb688b84f/cores/esp32/HardwareSerial.cpp#L14-L24

https://github.com/espressif/arduino-esp32/blob/0d5d50eb4186cb5ef0b7c108f87fae1cb688b84f/cores/esp32/HardwareSerial.cpp#L123-L132

hitecSmartHome commented 2 weeks ago

So the ARDUINO_SERIAL_EVENT_TASK name is uart_event_task. In this case, the uart_event_task priority is 24 for me...

What does arduino_events task do? Like OTA and things like that?

SuGlider commented 2 weeks ago

What does arduino_events task do? Like OTA and things like that?

It is a general Event Loop that catches most of internal Arduino Core events. Most Arduino Event Callback are based on it.

SuGlider commented 2 weeks ago

@hitecSmartHome - I have marked this issue as awaiting response because we need a minimum sketch that demonstrates the issue in order to move forward.

hitecSmartHome commented 2 weeks ago

Yeah, sorry Iam thinking and testing what could cause this. At first I tought that it is general flash access but it is triggered on other http requests too. I also put a lot of critical things into iram like flash functions and rtos functions as well as uart interrupts. I can definitely trigger it with http requests but iam not sure about what operation does it exactly. Sometimes it is easily triggered with some request other times the same request does nothing even with multiple tries.

SuGlider commented 2 weeks ago

We can keep this issue open for longer. Take your time. Let us know about your progress. Thanks.

hitecSmartHome commented 2 weeks ago

Thank you very much

hitecSmartHome commented 2 weeks ago

There is definietly something to do with the rx buffer size. If I read a lot of bytes from the slaves, let's say 32x2 bytes + header and crc so like max 80 bytes it seems to overflow something. I can see the packet second half as a separate pacage and it blows my data validator since the CRC is wrong for both half.

It doesn't matter how much I increase the RX_BUFFER_SIZE. I have tried to call uart_flush_input(UART_NUM_1); on every packet so it should empty the fifo or reset the ring buffer so it can start over after every packet but it did not help.

hitecSmartHome commented 2 weeks ago

Can this be? I added 1ms wait after my callback called but before I read the serial data and it seems to work reliably. I can read huge amounts of data and I also uncommented uart_flush_input

Serial1.onReceive(
    std::bind(&Modbus::handlePacket, this),
    PACKET_TRIGGER_ONLY_ON_TIMEOUT
);

void Modbus::handlePacket() {
    delayMicroseconds(1000); // Added 1ms wait
    int available = Serial1.available();
    uint8_t rawPacket[available] = {0};
    int readBytes = Serial1.readBytes(rawPacket, available);
    //uart_flush_input(UART_NUM_1);
    if (!isPacketValid(rawPacket, readBytes)) {

        return;
    }
}

Further testing required...

SuGlider commented 2 weeks ago

The data packet is sent in a single flow? What if the slave sends a "first half" of data, delays a bit and then sends the remaining "second half". It would generate an "RX Timeout" after the first half, hence it would call onReceive for this "first half".

What if the RX Timeout is set to the maximum allowed by the protocol? It could force the "onReceive callback" to be called only after the system is sure that no more "extra" data could arrive.

SuGlider commented 2 weeks ago

#define MBUS_RX_TIMEOUT 1 is used in Serial1.setRxTimeout(MBUS_RX_TIMEOUT); This means that onReceive callback occurs after 1 symbol timeout. This is about 10bits (8 data bits+ 1 stopbit + 1 parity bit) on 115200 bps => 87 microseconds (short time). Make it 1 millisecond at 115200 ==> #define MBUS_RX_TIMEOUT 12 12 symbols of 10 bits at 115200 ==> (1/115200)*10*12 = 0,00104166 seconds --> about 1ms.

hitecSmartHome commented 1 week ago

I can't make MBUS_RX_TIMEOUT more than 1 because the idf instantly prints out an error and sets it back to 1. But I will try it again.

hitecSmartHome commented 1 week ago

E (7572) uart: tout_thresh = 12 > maximum value = 1

hitecSmartHome commented 1 week ago

What if the slave sends a "first half" of data, delays a bit and then sends the remaining "second half".

No. We measured with an oscilloscope and the slaves are sending the packets at once with consistent timings. We also in controll of the slaves firmware and we are sure that they are good. They use simple uart interrupts and hardware timers to send out the data.

Also sometimes there is a 0x00 at the first byte of the data which we do not see with an oscilloscope. That 0x00 sometimes there even with this 1ms delay after an onRecive callback.

00 0e 03 01 00 00 1c 45 should be 0e 03 01 00 00 1c 45 ( address, func code, reg num, data first, data second, crc, crc ) registers are 16 bit and sending two 8 bit data. But that 0x00 is mistery still.

If I have this 1ms delay the rest of the data is good, I don't see any error just sometimes this 0x00 byte. uart_flush_input(UART_NUM_1); is no help again in this situation

hitecSmartHome commented 1 week ago

It is a little bit modified modbus because the slaves are not waiting for the standard response timeout but sending the response back as soon as they can in order to speed up the communication. Our system can handle 254 expander modules and there are some type of slaves which can only handle 115200 baud so we need to cut timings everywhere we can. Sometimes I see 0x00 alone without any indication about other data.

E (18958) Modbus: CRC error in response packet: 0x00
CRC error raw packet: : 00
E (18958) Modbus: Invalid packet. Can't process it.
MODBUS INVALID PACKET: 00
E (18965) HardwareHandler: Packet error code: 1

This particular packet was a single byte and no previous or next packet received in half. Interesting...

hitecSmartHome commented 1 week ago

Otherwise with 1ms delay the system seems stable but slow compared to without 1ms delay. The communication is continous and we can't afford to wait on anything. The last slave should send their data as fast as it can so we don't lose packets. There can be critical data for the system to work reliable. There are short and long button push timings which can be missed if there are lot of slaves on the bus

hitecSmartHome commented 1 week ago

I think I have to switch to raw uart interrupt handling and rewrite my implementation because either arduino Serial1 is acting crazy or this whole event driven approach laggs behind.

hitecSmartHome commented 1 week ago

Currently the system has only the default tasks and the uart task has the highest priority in the system. I think I can't fine tune this further. It just doesn't work properly

hitecSmartHome commented 1 week ago

I got one again

E (601441) Modbus: CRC error in response packet: 0x00
CRC error raw packet: : 00 0a 03 01 00 00 ed 85
E (601442) Modbus: Invalid packet. Can't process it.
MODBUS INVALID PACKET: 00 0a 03 01 00 00 ed 85
E (601450) HardwareHandler: Packet error code: 1

The first 0x00 can't be the last packet last byte since the last byte must be a crc and if it was 0x00 it would catch it with a CRC error. So I don't know where that 0x00 coming from..

If it would have been this 0a 03 01 00 00 edBRAKE00 0a 03 01 00 00 ed 85 The system would have reported two packet errors in this case with CRC error but it was just one so this 0x00 is a magic byte coming from deep of the uart implementation or somewhere :D

TD-er commented 1 week ago

It is a little bit modified modbus because the slaves are not waiting for the standard response timeout but sending the response back as soon as they can in order to speed up the communication. Our system can handle 254 expander modules and there are some type of slaves which can only handle 115200 baud so we need to cut timings everywhere we can. Sometimes I see 0x00 alone without any indication about other data.

E (18958) Modbus: CRC error in response packet: 0x00
CRC error raw packet: : 00
E (18958) Modbus: Invalid packet. Can't process it.
MODBUS INVALID PACKET: 00
E (18965) HardwareHandler: Packet error code: 1

This particular packet was a single byte and no previous or next packet received in half. Interesting...

Could it be the DE/RE pin is pulled at the wrong moment? On ESP32's you can enable collision detection for RS485. For this you need to only operate the DE pin and have the /RE pin pulled to GND all the time. Have to check the souce to see how it is called, but I think the function to enable this was almost literally called RS485CollisionDetection.

hitecSmartHome commented 1 week ago

We have an rs485 chip and one controll pin. Serial is configured to use this pin Serial1.setPins(-1, -1, -1, MBUS_RTS); the MBUS_RTS pin is the transmit pin which is raised to high on transmit to indicate it to the rs485 chip

TD-er commented 1 week ago

So DE and /RE are not separated on the RS485 chip?

Maybe you could check with a scope to see if the timings of switching this are a bit too tight?

hitecSmartHome commented 1 week ago

The chip is ST485EXD

https://4donline.ihs.com/images/VipMasterIC/IC/SGST/SGSTS31136/SGSTS31136-1.pdf?hkey=6D3A4C79FDBF58556ACFDE234799DDF0

hitecSmartHome commented 1 week ago

Messenger_creation_7CF85FB8-3074-40E5-BB01-F422F6A9387D.jpeg

It is pulled high with a 2.2kohm

TD-er commented 1 week ago

So DE and /RE are separated on the chip. image

Anyway, it should be possible to see any 'collision' on the scope.

hitecSmartHome commented 1 week ago

Will investigate... i dont know why my engineer pulled it to high since the pdf also says that it should be pulled to gnd

hitecSmartHome commented 1 week ago

Anyway, it should be possible to see any 'collision' on the scope.

The problem is that these magic 0 bytes is so rare and it does not easily caught

TD-er commented 1 week ago

Will investigate... i dont know why my engineer pulled it to high since the pdf also says that it should be pulled to gnd

Well it is the 'standard' way of connecting DE and /RE together. However this makes it impossible to use 'collision detection', which is a feature the ESP32-xx have but not all devices have.

hitecSmartHome commented 1 week ago

We have an esp32 wrover - e rev 3 in theory

TD-er commented 1 week ago

ESP32-classic (as I tend to call them) also has RS485 'collision detection' available as a feature. I have used it on a few of my own boards too.

hitecSmartHome commented 1 week ago

Do you think this 0 byte is a collision? How? When the esp sent out all data, it pulls the pin low and waiting for response. What data collides?

hitecSmartHome commented 1 week ago

Also this collision does not answers to other things like truncated packets. Sometimes I have a packet in half and no extra or missing bytes. I have observed that sometimes only the last byte is truncated only. The second half of the CRC. Will read about this collision detection feature, thank you very much.

TD-er commented 1 week ago

It is just a theory as adding a delay does seem to solve your issue. So maybe the /RE pin is not held low long enough to capture all the data. With RS485 you have the added complexity of the DE/RE signal which essentially blocks any traffic in a specific direction.

I assume you let the ESP handle the DE/RE ? I don't know if this is done in hardware or in software using interrupts (could be in the SDK so you don't see it) But this could explain why underload there might be bytes missing, especially at the beginning or end of a packet.

hitecSmartHome commented 1 week ago
/**
 * @brief UART mode selection
 */
typedef enum {
    UART_MODE_UART = 0x00,                      /*!< mode: regular UART mode*/
    UART_MODE_RS485_HALF_DUPLEX = 0x01,         /*!< mode: half duplex RS485 UART mode control by RTS pin */
    UART_MODE_IRDA = 0x02,                      /*!< mode: IRDA  UART mode*/
    UART_MODE_RS485_COLLISION_DETECT = 0x03,    /*!< mode: RS485 collision detection UART mode (used for test purposes)*/
    UART_MODE_RS485_APP_CTRL = 0x04,            /*!< mode: application control RS485 UART mode (used for test purposes)*/
} uart_mode_t;

Well ..... (used for test purposes)

I assume you let the ESP handle the DE/RE ?

Yes, I let the esp do it. That was the logical way since I also can't see the buffers

TD-er commented 1 week ago

Do you think this 0 byte is a collision? How? When the esp sent out all data, it pulls the pin low and waiting for response. What data collides?

It is not really about data colliding, but more like the /RE pin not being pulled down quickly enough to read the incomming data.

Remember this is just my theory....