EdgeTX / edgetx

EdgeTX is the cutting edge open source firmware for your R/C radio
https://edgetx.org
GNU General Public License v2.0
1.61k stars 340 forks source link

Infinite loop in crossfireProcessFrame #5238

Closed LupusTheCanine closed 3 months ago

LupusTheCanine commented 4 months ago

Is there an existing issue for this problem?

What part of EdgeTX is the focus of this bug?

Transmitter firmware

Current Behavior

Packets that fail address or CRC checks are not correctly discarded. pkt_len is threated as valid even though it isn't. If Pkt_len is set to 0x100 (truncated to 0x00) the function goes into an infinite loop,

Expected Behavior

  1. trailing packet beggining not to be discarded
  2. invalid packets be correctly discarded ex. setting pkt_len to 1
  3. len field set to 0xFE not to cause infinite loop

Steps To Reproduce

  1. send a lot of packets filled with 0xFE bytes. Packets must be sent in bursts that will not be read in one step so trailing packet head drop will happen.
  2. Observe GUI lockup

Version

2.10.1

Transmitter

RadioMaster TX16S / TX16SMK2

Operating System (OS)

Windows

OS Version

No response

Anything else?

image Issue was first observed on EdgeTX2.10 then extensive testing of the issue was done on EdgeTX2.11 with expanded tracing. Snipped Debug log with marked issue bytes EdgeTXDebugCommnet.txt

raphaelcoeffic commented 4 months ago

If Pkt_len is set to 0x100 (truncated to 0x00) the function goes into an infinite loop

I'm wondering how this is supposed to happen, considering that we uint8_t as the type. That being said, we should check for 0 length packets anyway, that's right.

Regarding "Dropped start of the frame", the current implementation assumes that we don't receive such frames which would require a continuation packet where the beginning of the frame was at the end of another packet. We could however add support for these if necessary.

What bugs me however is: do we really have such jumbo frames (length = 0x1F)? As of now the maximum frame length supported is based on the maximum buffer size, which is currently TELEMETRY_RX_PACKET_SIZE = 128. If you want to support frames up to the full 256 bytes (254 + 2), we need to change the types from uint8_t to uint16_t at some other places.

Also, which kind of receiver / module supports these jumbo frames? What did you use for your tests?

raphaelcoeffic commented 4 months ago

As of now, and to make things more stable, I'll focus on unit tests with different types of frames / potential errors.

LupusTheCanine commented 4 months ago

If Pkt_len is set to 0x100 (truncated to 0x00) the function goes into an infinite loop

I'm wondering how this is supposed to happen, considering that we uint8_t as the type. That being said, we should check for 0 length packets anyway, that's right.

The issue is that even though 0xFE(254) is not a valid packet length it is treated as such leading to total packet length being set to 0 ((254+2)%256). 0xFE may appear in p_buf[1] aka [len] if begining of the packet was dropped, previous packet was malformed, current packet is malformed, or non CRSF data is received on the port.

Regarding "Dropped start of the frame", the current implementation assumes that we don't receive such frames which would require a continuation packet where the beginning of the frame was at the end of another packet. We could however add support for these if necessary.

IMHO there is no guarantee that is the case so it would be best to assume that serial reads do not align with CRSF frames.

What bugs me however is: do we really have such jumbo frames (length = 0x1F)? As of now the maximum frame length supported is based on the maximum buffer size, which is currently TELEMETRY_RX_PACKET_SIZE = 128. If you want to support frames up to the full 256 bytes (254 + 2), we need to change the types from uint8_t to uint16_t at some other places.

I use 63 byte frames for Mavlink passthrough. As far as I know CRSF is limited to 64 byte frames (source) so I adhered to that.

Also, which kind of receiver / module supports these jumbo frames? What did you use for your tests?

I used Radiomaster Ranger Micro (ESP32) with my expansion of ExpressLRS (which now supports Mavlink in master) that supports sending mavlink frames to and from the handset through CRSF passthrough [sync/address][len][type(0xFF)][p_len][mavlink][crc].

I wouldn't call 31 bytes a jumbo frame when mavlink packets can be 280 bytes 😅. I use modified ExpressLRS (derivative of the current work to support mavlink) that sends mavlink to the handset through CRSF passthrough. I have a working script that can send and receive mavlink (right now it is "Hello world" microcontroller world aka blinking diode). I plan to add further support for mavlink in lua when I have a working proof of concept written in lua.

raphaelcoeffic commented 4 months ago

0xFE may appear in p_buf[1] aka [len] if the beginning of the packet was dropped

Thx for the clarification.

IMHO there is no guarantee that is the case so it would be best to assume that serial reads do not align with CRSF frames.

Can be dealt with, I have no issue with that.

I use 63 byte frames for Mavlink passthrough. As far as I know CRSF is limited to 64 byte frames.

Good, so let's keep the max buffer size at 128 bytes.

I use modified ExpressLRS (derivative of the current work to support mavlink) that sends mavlink to the handset through CRSF passthrough.

Ok, cool, then let's make it work.

LupusTheCanine commented 4 months ago

I would check if the new frame is guaranteed to be no larger than half the buffer length, especially when the buffer is limited to 19 bytes.

LupusTheCanine commented 4 months ago

In the worst case it is possible to have a read with 63 trailing bytes followed by 128 bytes read, we should either limit read to one packet or increase buffer, I would also apply the buffer/read size logic to small packet radio since they appear to be potentially affected though FrSky appears to use some different process as I don't see the parser for it when looking for "processFrame".

ExpressLRS appears to behave well and only send chunks small enough to be handled with 128 byte buffer I don't have other hardware to test that except XJT though given how S.Port works we are unlikely to run into saturation issues with it.

If I set too serial baud to 921k I get quick trip to Emergency Mode.

raphaelcoeffic commented 4 months ago

@LupusTheCanine I've created #5243 with unit tests so we can verify if things are the way we think they are. Let me know if I should add more tests.

raphaelcoeffic commented 4 months ago

In the worst case it is possible to have a read with 63 trailing bytes followed by 128 bytes read, we should either limit read to one packet or increase buffer,

I'm really not sure I understand properly what you mean. What do you mean with "read" in this context? The serial receiver is handling this either based on IRQ (for each received byte) or DMA based, where the received payload is read directly into a circular buffer. Most radios use IRQ-based. IRQ-based receiver will stop writing into the RX buffer when full.

I would also apply the buffer/read size logic to small packet radio since they appear to be potentially affected though FrSky appears to use some different process as I don't see the parser for it when looking for "processFrame".

processFrame API is used only for CRSF for now. We might apply the same logic to other protocols later if we see fit. The main difference is that we make use of the IDLE IRQ to trigger frame processing. The IDLE IRQ is triggered when the RX line is idle for at least a byte length (idle time >= (1 / baudrate) * 10).

When the IDLE IRQ triggers, we schedule the processing function asynchronously so that FreeRTOS will execute that function out of the IRQ context. The latency between reception of the frame and processing is much shorter than when period polling of the buffers is used.

LupusTheCanine commented 4 months ago

I'm really not sure I understand properly what you mean. What do you mean with "read" in this context? The serial receiver is handling this either based on IRQ (for each received byte) or DMA based, where the received payload is read directly into a circular buffer. Most radios use IRQ-based. IRQ-based receiver will stop writing into the RX buffer when full.

I mean the situation where we end up with a rxBuffer containing 63 bytes (worst case incomplete frame) and inconsiderate module sends 128 byte burst that we can't handle because it won't fit in the rxBuffer along with the old frame.

I extended the tests to include following situations

  1. continuation frame contains start of another frame (module cutting frames as it wants instead of on packet boundary)
  2. invalid frame recovery on length errors
  3. length error f the frame is 65 bytes long (len=63)
  4. various invalid frames
    • Random bytes
    • A frame with invalid CRC
  5. Jumbo frames

I also wrote convenience macros to make writing frameParser tests a bit less tedious. https://github.com/LupusTheCanine/edgetx/tree/fix-5238

raphaelcoeffic commented 4 months ago

length error f the frame is 65 bytes long (len=63)

Should we really consider frames of 65 bytes an issue? IMHO, we should only care about our buffer size, which is 128 bytes right now. Testing for a maximum frame length of 64 bytes would be something new that might confuse other modules (it's not just ELRS and original Crossfire using CRSF as a proto).

Regarding one of the last tests you added (badFrames): you cannot expect to add some random bytes between 2 frames sent together at the same time and expect we could/should sort these out. What happens here is that we just throw away the rest of the buffer, in case that happens.

IMHO, it makes no sense to consider the case where a module would send a random stream of bytes sent in batches that have nothing to do with the underlying framing. That module won't behave like a TCP stack, on top of which you're sending some payload chunks. This would be much better comparable to UDP datagrams, where you have to consider the MTU to avoid causing some IP fragmentation.

Note: your tests did not compile as-is, I had to fix a few things, so I guess you did not try them out, right?

LupusTheCanine commented 4 months ago

Should we really consider frames of 65 bytes an issue?

IMHO yes because it is above what CRSF allows according to the working group so it is unlikely to be a valid frame. AFAIK there are no frame format with total length larger than 64.

Regarding one of the last tests you added (badFrames): you cannot expect to add some random bytes between 2 frames sent together at the same time and expect we could/should sort these out.

These could as well be a frame that got a single bit flipped as is the case with the second invalid frame 😁 (I accidentally nailed CRC).

IMHO, it makes no sense to consider the case where a module would send a random stream of bytes sent in batches that have nothing to do with the underlying framing.

I would consider module that received a burst of data and utilizes the entire slots without attempting to align packet starts with the slot though we would have to increase buffer size to 192 to fully handle that >(63+128-3).

Note: your tests did not compile as-is, I had to fix a few things, so I guess you did not try them out, right?

I did.

My bad I didn't push the last commit.

raphaelcoeffic commented 4 months ago

I ended up adding some simple mechanism to skip invalid start bytes at the beginning of a frame, so that all the tests you proposed are now running more less as-is.

Also, the new version is able to process frames directly out of the RX buffer, using the telemetry buffer only for re-assembly and only if necessary. That saves a few memcpy.

IMHO yes because it is above what CRSF allows according to the working group so it is unlikely to be a valid frame.

I'll check with the others whether or not this is really set in stone.

LupusTheCanine commented 4 months ago

I checked your fix with my setup and I can't see any signs of dropped or discarded frames in the debug.

LupusTheCanine commented 4 months ago

Update, few dropped frames.

LupusTheCanine commented 4 months ago

Did a longer 10+ hour test, unfortunately Debug output quit (before showing any bad frames).

LupusTheCanine commented 4 months ago

Another test, this time debug cooperated. 14 hours, about 400 fails to parse but no lockups

EdgeTXDebug.txt