arduino-libraries / ArduinoBLE

ArduinoBLE library for Arduino
GNU Lesser General Public License v2.1
307 stars 205 forks source link

Possible signal corruption from other code/modules + filter code for RX #213

Open mattleesmi opened 2 years ago

mattleesmi commented 2 years ago

Hello all,

I have been working on trying to stop the BLE on my IoT's from crashing (#73). Originally I focused on catching the _recvIndex value so it didn't go too high (similar to this #202 and possibly #187). However, as I mentioned in #73, I continued to look at specific causes of the runaway _recvIndex and I found that what was happening was that the BLE was somehow receiving packets of data that would break the if statements meant to handle it.

The Filter Fix

Anyway, I set about writing some simple filters to catch these issues (again similar to #202) these were as follows:

In HCI.cpp I changed the while loop in Poll to this:

while (HCITransport.available()) {
        byte b = HCITransport.read();
        _recvBuffer[_recvIndex++] = b;
        //Legacy index catch left as backup, hopefully redundant
        if (_recvIndex > sizeof(_recvBuffer) - 1) {
            if (_debug) _debug->println("Index catch");
            recvIndex_catch = true;
            _recvIndex = 0;
        }
        if (_recvBuffer[0] == HCI_ACLDATA_PKT) {
            if ((_recvIndex > 5) && _recvIndex >= (5 + (_recvBuffer[3] + (_recvBuffer[4] << 8)))) {
                if (_debug) {
                    dumpPkt("HCI ACLDATA RX <- ", _recvIndex, _recvBuffer);
                }
#ifdef ARDUINO_AVR_UNO_WIFI_REV2
                digitalWrite(NINA_RTS, HIGH);
#endif
                int pktLen = _recvIndex - 1;
                _recvIndex = 0;

                handleAclDataPkt(pktLen, &_recvBuffer[1]);

#ifdef ARDUINO_AVR_UNO_WIFI_REV2
                digitalWrite(NINA_RTS, LOW);
#endif
            } else if (_recvIndex > 5 && ((5 + (_recvBuffer[3] + (_recvBuffer[4] << 8))) > (sizeof(_recvBuffer) - 1))) {
                if (_debug) {
                    _debug->print("ACL too big: "), _debug->println((5 + (_recvBuffer[3] + (_recvBuffer[4] << 8))));
                    dumpPkt("HCI ACLDATA RX <- ", _recvIndex, _recvBuffer);
                    _debug->println();
                }
                _recvIndex = 0;
            }
        } else if (_recvBuffer[0] == HCI_EVENT_PKT) {
            if (_recvIndex > 3 && _recvIndex >= (3 + _recvBuffer[2])) {
                bool ok_to_handle = false;
                //Incoming data quality filter
                switch (_recvBuffer[1]) {
                    case EVT_DISCONN_COMPLETE:
                    case EVT_CMD_COMPLETE:
                    case EVT_CMD_STATUS:
                    case EVT_NUM_COMP_PKTS:
                        ok_to_handle = true;
                        break;
                    case EVT_LE_META_EVENT:
                        //Advertisement quality filter, advert payloads are max 37 bytes, eirLength is on 14th byte eirData is 32 bytes
                        if (_recvBuffer[3] == 0 || _recvBuffer[3] > EVT_LE_ADVERTISING_REPORT) {
                            if (_debug) {
                                _debug->println("Bad Subevent Read");
                                dumpPkt("HCI EVENT RX <- ", _recvIndex, _recvBuffer);
                                _debug->println();
                            }
                        } else if ((3 + _recvBuffer[2]) > 48) {
                            if (_debug) {
                                _debug->println("Advert too big");
                                dumpPkt("HCI EVENT RX <- ", _recvIndex, _recvBuffer);
                                _debug->println();
                            }
                        } else if (_recvBuffer[13] > 32) {
                            if (_debug) {
                                _debug->print("eirLength too big: "), _debug->println(_recvBuffer[13]);
                                dumpPkt("HCI EVENT RX <- ", _recvIndex, _recvBuffer);
                                _debug->println();
                            }
                        } else {
                            ok_to_handle = true;
                        }
                        break;
                    default:
                        if (_debug) {
                            _debug->println("Bad RX Event");
                            dumpPkt("HCI EVENT RX <- ", _recvIndex, _recvBuffer);
                            _debug->println();
                        }
                        break;
                }
                if (ok_to_handle) {
                    if (_debug) dumpPkt("HCI EVENT RX <- ", _recvIndex, _recvBuffer);
#ifdef ARDUINO_AVR_UNO_WIFI_REV2
                    digitalWrite(NINA_RTS, HIGH);
#endif

                    // received full event
                    int pktLen = _recvIndex - 1;
                    _recvIndex = 0;
                    handleEventPkt(pktLen, &_recvBuffer[1]);

#ifdef ARDUINO_AVR_UNO_WIFI_REV2
                    digitalWrite(NINA_RTS, LOW);
#endif
                } else {
                    _recvIndex = 0;
                }
            } else if ((3 + _recvBuffer[2]) > sizeof(_recvBuffer) - 1) {
                if (_debug) {
                    _debug->print("Event too big: "), _debug->println(3 + _recvBuffer[2]);
                    dumpPkt("HCI EVENT RX <- ", _recvIndex, _recvBuffer);
                    _debug->println();
                }
                _recvIndex = 0;
            }
        } else {
            _recvIndex = 0;
            // if (_debug) _debug->println(b, HEX);
        }
    }

I read on https://microchipdeveloper.com/wireless:ble-link-layer-packet-types that the advertisement packets should only be up to 37 bytes but I manage to make valid one with my phone that was bigger.

This catches a lot of errors, however, I noticed a similar issue to #202 where the scan would stop taking new devices after a while. I managed to pin that down to GAP.cpp, in particular, the part if (_discoveredDevices.size() >= GAP_MAX_DISCOVERED_QUEUE_SIZE) was getting stuck and permanently remaining true (and therefore dropping new devices), so I did this:

        if (_discoveredDevices.size() >= GAP_MAX_DISCOVERED_QUEUE_SIZE) {
            if (_debug) _debug->println("Queue full");
            queue_Stuck++;
            if (queue_Stuck == 5) {
                if (_debug) _debug->println("Queue clear");
                queue_Stuck = 0;
                for (unsigned int i = 0; i < _discoveredDevices.size(); i++) {
                    _discoveredDevices.remove(i);
                }
            }
            // drop
            return;
        }

which is a little crude but seems to work.

Here are some logging extracts of these filters at work:

Bad Subevent Read
HCI EVENT RX <- 04|3E|BD|04|3E|1E|02|01|00|01|F9|ED|A8|56|DF|59|12|02|01|1A|02|0A|08|0B|FF|4C|00|10|06|67|1E|6D|68|31|99|BD|04|3E|0C|02|01|04|01|F9|ED|A8|56|DF|59|00|BD|04|3E|1E|02|01|00|01|F9|ED|A8|56|DF|59|12|02|01|1A|02|0A|08|0B|FF|4C|00|10|06|67|1E|6D|68|31|99|B1|04|3E|0C|02|01|04|01|F9|ED|A8|56|DF|59|00|B2|04|3E|BC|04|3E|1E|02|01|00|01|F9|ED|A8|56|DF|59|12|02|01|1A|02|0A|08|0B|FF|4C|00|10|06|67|1E|6D|68|31|99|B9|04|3E|0C|02|01|04|01|F9|ED|A8|56|DF|59|00|B9|04|3E|14|02|01|03|01|F1|BB|C7|67|BC|D2|08|07|FF|4C|00|12|02|00|00|A2|04|3E|21|02|01|03|00|AE|D8|97|7A|FA|28|15|14|2B|00|A8|01| Size:192

HCI ACLDATA RX <- 02|03|00|00|00|A2| Size:6
HCI EVENT RX <- 04|3E|1E|02|01|00|01|F9|ED|A8|56|DF|59|12|02|01|1A|02|0A|B8|04|3E|A2|04|3E|1E|02|01|00|01|F9|ED|A8| Size:33
SUB_EVENT: 2
type: 0 peerBdaddrType: 1 peerBdaddr: F9|ED|A8|56|DF|59| Size:6
 eirLength: 18 eirData: 02|01|1A|02|0A|B8|04|3E|A2|04|3E|1E|02|01|00|01|F9|ED| Size:18
 rssi: -88
GAP Handled

ACL too big: 2567
HCI ACLDATA RX <- 02|01|1A|02|0A|08| Size:6

HCI EVENT RX <- 04|3E|0C|02|01|04|01|F9|ED|A8|56|DF|59|00|BE| Size:15
SUB_EVENT: 2
type: 4 peerBdaddrType: 1 peerBdaddr: F9|ED|A8|56|DF|59| Size:6
 eirLength: 0 eirData:  Size:0
 rssi: -66
Discovered!
Deleted!
GAP Handled

HCI EVENT RX <- 04|3E|1E|02|01|00|01|F9|ED|A8|56|DF|59|12|02|01|1A|02|0A|08|0B|FF|4C|00|10|06|2F|1E|6D|68|31|99|B6| Size:33
SUB_EVENT: 2
type: 0 peerBdaddrType: 1 peerBdaddr: F9|ED|A8|56|DF|59| Size:6
 eirLength: 18 eirData: 02|01|1A|02|0A|08|0B|FF|4C|00|10|06|2F|1E|6D|68|31|99| Size:18
 rssi: -74
Queue full
GAP Handled

HCI EVENT RX <- 04|3E|0C|02|01|04|01|F9|ED|A8|56|DF|59|00|B6| Size:15
SUB_EVENT: 2
type: 4 peerBdaddrType: 1 peerBdaddr: F9|ED|A8|56|DF|59| Size:6
 eirLength: 0 eirData:  Size:0
 rssi: -74
Queue full
GAP Handled

HCI EVENT RX <- 04|3E|1E|02|01|00|01|F9|ED|A8|56|DF|59|12|02|01|1A|02|0A|08|0B|FF|4C|00|10|06|2F|1E|6D|68|31|99|BA| Size:33
SUB_EVENT: 2
type: 0 peerBdaddrType: 1 peerBdaddr: F9|ED|A8|56|DF|59| Size:6
 eirLength: 18 eirData: 02|01|1A|02|0A|08|0B|FF|4C|00|10|06|2F|1E|6D|68|31|99| Size:18
 rssi: -70
Queue full
GAP Handled

HCI EVENT RX <- 04|3E|0C|02|01|04|01|F9|ED|A8|56|DF|59|00|BA| Size:15
SUB_EVENT: 2
type: 4 peerBdaddrType: 1 peerBdaddr: F9|ED|A8|56|DF|59| Size:6
 eirLength: 0 eirData:  Size:0
 rssi: -70
Queue full
GAP Handled

22:49:58

Bad Subevent Read
HCI EVENT RX <- 04|3E|B0|04|3E|B9|04|3E|1E|02|01|00|01|F9|ED|A8|56|DF|59|12|02|01|1A|02|0A|08|0B|FF|4C|00|10|06|2F|1E|6D|68|31|99|BA|04|3E|0C|02|01|04|01|F9|ED|A8|56|DF|59|00|BB|04|3E|A2|04|3E|1E|02|01|00|01|F9|ED|A8|56|DF|59|12|02|01|1A|02|0A|08|0B|FF|4C|00|10|06|2F|1E|6D|68|31|99|B6|04|3E|0C|02|01|04|01|F9|ED|A8|56|DF|59|00|B7|04|3E|1E|02|01|00|01|F9|ED|A8|56|DF|59|12|02|01|1A|02|0A|08|0B|FF|4C|00|10|06|2F|1E|6D|68|31|99|B6|04|3E|0C|02|01|04|01|F9|ED|A8|56|DF|59|00|B5|04|3E|A3|04|3E|B9|04|3E|1E|02|01|00|01|F9|ED|A8|56|DF|59|12|02|01|1A|02|0A|08| Size:179

HCI EVENT RX <- 04|3E|0C|02|01|04|01|F9|ED|A8|56|DF|59|00|B9| Size:15
SUB_EVENT: 2
type: 4 peerBdaddrType: 1 peerBdaddr: F9|ED|A8|56|DF|59| Size:6
 eirLength: 0 eirData:  Size:0
 rssi: -71
Queue full
Queue clear
GAP Handled

Bad Subevent Read
HCI EVENT RX <- 04|3E|B8|04|3E|21|02|01|03|00|AE|D8|97|7A|FA|28|15|14|2B|00|A8|01|51|A9|0E|56|00|AE|D8|97|7A|FA|28|02|03|00|00|00|A5|04|3E|B1|04|3E|21|02|01|03|00|AE|D8|97|7A|FA|28|15|14|2B|00|A8|01|51|A9|0E|56|00|AE|D8|97|7A|FA|28|02|03|00|00|00|A0|04|3E|1E|02|01|00|01|F9|ED|A8|56|DF|59|12|02|01|1A|02|0A|08|0B|FF|4C|00|10|06|2F|1E|6D|68|31|99|B5|04|3E|0C|02|01|04|01|F9|ED|A8|56|DF|59|00|B4|04|3E|1E|02|01|00|01|F9|ED|A8|56|DF|59|12|02|01|1A|02|0A|08|0B|FF|4C|00|10|06|2F|1E|6D|68|31|99|B6|04|3E|0C|02|01|04|01|F9|ED|A8|56|DF|59|00|B4|04|3E|1E|02|01|00|01|F9|ED|A8|56|DF|59| Size:187

ACL too big: 2567
HCI ACLDATA RX <- 02|01|1A|02|0A|08| Size:6

HCI EVENT RX <- 04|3E|0C|02|01|04|01|F9|ED|A8|56|DF|59|00|BC| Size:15
SUB_EVENT: 2
type: 4 peerBdaddrType: 1 peerBdaddr: F9|ED|A8|56|DF|59| Size:6
 eirLength: 0 eirData:  Size:0
 rssi: -68
Discovered!
Deleted!
GAP Handled

In this log, you can see instances where the RX data is mashed up together causing the code to think it is getting a really long RX or giving the wrong subevent type. Also, there is a case of the queue getting stuck. But you can also see that the code rectifies itself and is still able to detect new devices afterwards.

Although not the most elegant of solutions, these fixes seem to work and I can leave the BLE running for a long time with lots of other stuff going on, my current project has neopixels, WIFI, Sevros, and motors, and it can still detect/connect to a new device.

However!

So this is where it gets weird, I can't replicate the errors without lots of other code on the go. In fact, I have run the following code with my adjusted HCI.cpp and GAP.cpp with intervals of around 5 - 10 seconds and it still gives clean RX data but will eventually crash. A 1-second interval works fine. So far there hasn't been a single error. It does still crash with these long intervals but this somewhat contradicts the general advice that it is the time between BLE.Poll() that causes the issue (e.g., #130). Also, oddly if you just call HCITransport.available() it never seems to crash. I feel like I am missing something here.

void setup() {
    Serial.begin(9600);
    while (!Serial)
        ;

    // begin initialization
    if (!BLE.begin()) {
        Serial.println("starting BLE failed!");

        while (1)
            ;
    }

    Serial.println("BLE Central scan callback");
    HCI.debug(Serial);
    GAP.debug(Serial);
    // set the discovered event handle
    BLE.setEventHandler(BLEDiscovered, bleCentralDiscoverHandler);

    // start scanning for peripherals with duplicates
    BLE.scan(true);
}

void loop() {
    unsigned long currentMillis = millis();

    if (currentMillis - previousMillis >= interval) {
        // poll the central for events
        BLE.poll();
        previousMillis = currentMillis;
    }
}

So the only conclusion I can draw is that some other piece of code or something is introducing some sort of noise or clash meaning that the BLE is reading incorrectly or the signals are overlapping. I am aware of #96, but I don't think this is just slow, this data seems corrupted somehow. Am I barking up the wrong tree here, should it try adding other libraries until I get the corruption? Are the filters introducing the errors I am catching? I can propose this as a pull request or remove it if it is too similar to the old stuff.

Matt

DaveWare commented 2 years ago

I am using the Redboard Artemis and experiencing numerous problems with using it as a central. The first I encountered were these random faults and hangs. The memory overruns and packet corruption is cause by a thread race condition in HCITransport.cpp. HCICordioTransportClass uses a circular buffer to store incoming bytes. Depending on how long your event handler takes to process a scan will determine how long it takes the reader to start throwing away data and hence corrupting packets. After that the outcome is unpredicatable. I solved this one with putting a yield() in handleRxData until the read() makes enough room in the circular buffer.

void HCICordioTransportClass::handleRxData(uint8_t* data, uint8_t len)
{
  while (_rxBuf.availableForStore() < len)
    rtos::ThisThread::yield();

  for (int i = 0; i < len; i++) {
    _rxBuf.store_char(data[i]);
  }

  bleEventFlags.set(0x01);
}

I also took a FIFO approach to the discoveredDevice list in GAPClass::handleLeAdvertisingReport. When it fills I remove the one that has been in the longest and free it. If you don't free the device there will be a memory leak that will eventually cause you to run out of heap.

 if (discoveredDevice == NULL) {
    if (_discoveredDevices.size() >= GAP_MAX_DISCOVERED_QUEUE_SIZE) {
      // Pop the top
      BLEDevice* device = _discoveredDevices.remove(0);
      if (device != NULL)
        delete device;
    }

    discoveredDevice = new BLEDevice(addressType, address);

    _discoveredDevices.add(discoveredDevice);
    discoveredIndex = _discoveredDevices.size() - 1;
  }

I still experience the scan stopping after a couple of seconds but it doesn't hardfault anymore.