don / cordova-plugin-ble-central

Bluetooth Low Energy (BLE) Central plugin for Apache Cordova (aka PhoneGap)
Apache License 2.0
951 stars 606 forks source link

Notifications can be received out of order on Android #625

Closed timburke closed 4 years ago

timburke commented 6 years ago

TLDR: We think there is an internal race in the way Android invokes the onCharacteristicChanged callback which can very occasionally cause notifications to be received out of order with a provided workaround at API level 26 and above that should be included in the cordova-plugin-ble-central Android handler code.

Do you agree with this analysis and would you be open to accepting a PR implementing the fix as described in Proposed Fix?

Description

We have a bluetooth peripheral that can occasionally be triggered to stream high speed packets over a BLE characteristic using notifications. We have, say, a 2 kb file that the peripheral chunks into 20-byte chunks and sends serially over notifications of 20-bytes each. We have a SHA-256 checksum at the end to make sure there were no issues receiving the file.

We observe that, only on Android, we occasionally have invalid checksums and have tracked the issue down to the notification packets being received out of order very occasionally. We specifically see if the original order of the notifications was N, N+1, N+2, we receive N+2, N, N+1 for example.

We have confirmed that:

Investigation

We suspect that there may be a race condition inside the Android bluetooth stack related to the calls to onCharacteristicChanged. In particular at API level 26 and above Android introduced the ability to force asynchronous callbacks from the BLE stack to occur on a specific thread (rather than on an unspecified background thread as stated in the public API documentation).

Previously the Android code that invoked the onCharacteristicChanged() callback was: https://android.googlesource.com/platform/frameworks/base/+/8e970d6ab4850ca23163bad5773cf5cc2e7ff896/core/java/android/bluetooth/BluetoothGatt.java#348

/**
 * Remote characteristic has been updated.
 * Updates the internal value.
 * @hide
 */
public void onNotify(String address, int handle, byte[] value) {
    if (VDBG) Log.d(TAG, "onNotify() - Device=" + address + " handle=" + handle);
    if (!address.equals(mDevice.getAddress())) {
        return;
    }
    BluetoothGattCharacteristic characteristic = getCharacteristicById(mDevice, handle);
    if (characteristic == null) return;
    characteristic.setValue(value);
    try {
        mCallback.onCharacteristicChanged(BluetoothGatt.this, characteristic);
    } catch (Exception ex) {
        Log.w(TAG, "Unhandled exception in callback", ex);
    }
}

Current Android master changed this to:

https://github.com/aosp-mirror/platform_frameworks_base/blob/e7b05b87979dcaebcb22bda18e44f7bd37cddfc2/core/java/android/bluetooth/BluetoothGatt.java#L451

 * Remote characteristic has been updated.
 * Updates the internal value.
 * @hide
 */
@Override
public void onNotify(String address, int handle, byte[] value) {
    if (VDBG) Log.d(TAG, "onNotify() - Device=" + address + " handle=" + handle);

    if (!address.equals(mDevice.getAddress())) {
        return;
    }

    BluetoothGattCharacteristic characteristic = getCharacteristicById(mDevice,
            handle);
    if (characteristic == null) return;

    runOrQueueCallback(new Runnable() {
        @Override
        public void run() {
            final BluetoothGattCallback callback = mCallback;
            if (callback != null) {
                characteristic.setValue(value);
                callback.onCharacteristicChanged(BluetoothGatt.this,
                        characteristic);
            }
        }
    });
}

The key change is that now they update the characteristic value and call the callback usingrunOrQueueCallback which defaults to the old behavior unless you call connectGatt and pass a Handler object, in which case the callable is queued in the handler thread you pass.

See: https://github.com/aosp-mirror/platform_frameworks_base/blob/e7b05b87979dcaebcb22bda18e44f7bd37cddfc2/core/java/android/bluetooth/BluetoothGatt.java#L767 https://github.com/aosp-mirror/platform_frameworks_base/commit/b0f647492f6d262d13825b14bbada140effe8394#diff-abc02213082938a215d4eb5c60add4afR1679

There is a new connectGatt override that allows you to trigger this behavior: https://developer.android.com/reference/android/bluetooth/BluetoothDevice#connectGatt(android.content.Context,%20boolean,%20android.bluetooth.BluetoothGattCallback,%20int,%20int,%20android.os.Handler)

Proposed Fix

It seems like adding another check for Android API level >= 26 in the Peripheral constructor and passing in a Handler object to serialize callbacks to a single thread should resolve the issue:

https://github.com/don/cordova-plugin-ble-central/blob/master/src/android/Peripheral.java#L100

OS and Version Distribution Data

Here is a capture of our Sentry logging data showing the error occurring only on Android and more likely on Android 8.

image

Similar Issues

This is similar to what was reported in #515 but we have new information and the issue does not appear to be resolved.

don commented 6 years ago

Do you agree with this analysis and would you be open to accepting a PR implementing the fix as described in Proposed Fix?

This seems reasonable. A pull request would be appreciated.

timburke commented 6 years ago

@don

Thanks for the quick response. We will start working on the PR internally and open when we have it ready. We're preparing a simplified test case on an Android 8.0 device to confirm the issue and fix and will report back once its ready.

kabaehr commented 5 years ago

Did you fixed the issue and is the pull request merged already?

timburke commented 5 years ago

Hi @kabaehr,

Unfortunately we were not able to fix this issue. I spent a lot of time looking over the public android source code and how it interacts with cordova-plugin-ble-central and ultimately decided that I could not see a race condition inside cordova-plug-ble-central itself so we didn't go with the proposed fix above since it seemed like it would have a low probability of actually fixing the issue.

My recollection from what we concluded and where that would point for the issue are that either:

From our look at the code, we concluded that there was a single synchronous chain of function calls with no inter-thread handoffs between onNotify being called and the notification being handed to cordova to queue the callback on the javascript side. This implies that the onNotify function does not return until the callback is queued in cordova. Android's binder IPC mechanism is supposed to guarantee that there can be no concurrent invocations of a callback so it should block all future notifications until onNotify finishes, guaranteeing in-order delivery.

Since in-order delivery is definitely not actually observed, that lead us to suspect that either the data was already out-of-order by the time onNotify was called or that once it (in-order) queued into the cordova webview as a PluginResult, that the order was not preserved or that there was an issue with the binder IPC mechanism that was allowing concurrent callbacks. We did not succeed in conclusively proving which of these things was the culprit though.

In our case we wrote an application level notification packet re-orderer that was able to use embedded sequence numbers and rules in how we construct our notifications to discover and automatically fix the out-of-order packets.

I hope this helps.

kabaehr commented 5 years ago

@timburke thanks for this detailed answer.

I observed the same but with another cordova BLE plugin. See this issue: https://github.com/randdusing/cordova-plugin-bluetoothle/issues/419 and so I was wondering if you were able to fix it within this plugin.

I will go for a similar solution as you mentioned. What a pity that we all have to hack our way around it...

timburke commented 5 years ago

We have an update on this issue. This started happening to us much more frequently after we upgraded the cordova-ble-central plugin to the latest 1.2.2 and started testing more on Android 8.0 and 8.1. On some phones we would get out of order notifications ~5% of the time, which was bad enough that we invested a lot in figuring out where the issue was.

Based on our tests, I can now conclusively say that the issue is not the android bluetooth stack but rather something inside either cordova-ble-central or cordova itself, which is great because it means that it's fixable once we find out exactly what it happening.

Conclusive Test Description

At a high level, since we had such a consisted reproduction of the issue, we edited the java code of cordova-ble-central to directly log the binary notifications it was receiving in the onCharacteristicChanged handler. We also logged the binary notifications we received inside our javascript app (which should correspond 1-1 with what was given to onCharacteristicChanged.

We also directly extracted and parsed the binary data dumped from the new log message in onCharacteristicChanged to verify that it was received properly in order.

Data As Received in onCharacteristicChanged

1 - 15:13:13.155: 00000F10A50800008948000089480000AE6D0000
2 - 15:13:13.157: 00001610A50800008948000089480000568B0100
3 - 15:13:13.160: 00001910A50800008948000089480000EB080000
4 - 15:13:13.235: 01AC0100A5080000781F0000894800000000FFD7
5 - 15:13:13.236: 225000004F1F0000B42D0000EB69000021500000
6 - 15:13:13.238: 501F0000B42D0000AF8B010023500000511F0000
7 - 15:13:13.241: B42D0000D508000022500000521F000038310000
8 - 15:13:13.243: DB69000021500000531F000038310000A78B0100
9 - 15:13:13.245: 23500000541F000038310000DE08000022500000

Data As Received in Javsacript

1 - 15:13:13.186: 00000f10a50800008948000089480000ae6d0000
2 - 15:13:13.193: 00001610a50800008948000089480000568b0100
3 - 15:13:13.199: 00001910a50800008948000089480000eb080000
6 - 15:13:13.247: 501f0000b42d0000af8b010023500000511f0000
7 - 15:13:13.258: b42d0000d508000022500000521f000038310000
8 - 15:13:13.269: db69000021500000531f000038310000a78b0100
4 - 15:13:13.279: 01ac0100a5080000781f0000894800000000ffd7
5 - 15:13:13.282: 225000004f1f0000b42d0000eb69000021500000
9 - 15:13:13.287: 23500000541f000038310000de08000022500000

I included 3 lines before and 1 line after the issue for context.

Note how lines 1-3 and 9 are exactly the same but lines 4-8 are mixed up in the javascript side: They are received in order 6, 7, 8, 4, 5 rather than 4, 5, 6, 7, 8. We have verified explicitly as well that the 1-9 ordering is the correct order, as sent from the ble device.

We don't yet know exactly where the issue is on the cordova side but this does show that the issue is somewhere after Android hands the notification over to cordova-plugin-ble-central and before it gets to our javscript onNotification callback.

We'll update this issue as we identify where the actual problem is.

For reference, the code we inserted to log the notifications on the java side of the plugin consisted of:

  1. Adding this logging method into Peripheral from StackOverflow to convert bytes to hex

    private final static char[] hexArray = "0123456789ABCDEF".toCharArray();
    
    public static String bytesToHex(byte[] bytes) {
        char[] hexChars = new char[bytes.length * 2];
        for ( int j = 0; j < bytes.length; j++ ) {
            int v = bytes[j] & 0xFF;
            hexChars[j * 2] = hexArray[v >>> 4];
            hexChars[j * 2 + 1] = hexArray[v & 0x0F];
        }
    
        return new String(hexChars);
    }
  2. Patching the LOG.d statement in onCharacteristicChanged to read:

    LOG.d(TAG, "onCharacteristicChanged " + characteristic + ": " + this.bytesToHex(characteristic.getValue()));
  3. Dumping the log after seeing the issue using:

    adb logcat -d | grep onCharacteristicChanged
timburke commented 5 years ago

Digging into this issue in more detail, it seems like what is happening is that there is a race condition if onCharacteristicWrite is fired at the same time that onCharacteristicChanged. It's not clear yet exactly how this works but the empirical result is that the notification value received in onCharacteristicChanged is moved to the top of the list of plugin results sent back to javscript rather than being sent in order.

To demonstrate this, I patched the onCharacteristicChanged PluginResult to be a multipart result with an incrementing counter that increments every time a new notification is received. You can clearly see in the below trace how notification 94 is received on the javscript side out of order. The Cordova callback line is triggered every time a ble notification is received (inside the javascript callback that receives it). It should be called in order. However, it goes 87, 88, 94, 89. It's likely not a coincidence that notification 94 comes at the same time as the onCharacteristicWrite ["OK"] result :

Cordova callback: BLE677958192 ["OK"]
14:45:03.567 vendor.js:104658 Cordova callback: BLE677958193 ["OK"]
14:45:03.585 vendor.js:104658 Cordova callback: BLE677958161 (2) [ArrayBuffer(4), 84]
14:45:03.589 vendor.js:104658 Cordova callback: BLE677958162 (2) [ArrayBuffer(20), 85]
14:45:03.604 vendor.js:104658 Cordova callback: BLE677958188 (2) [ArrayBuffer(20), 86]
14:45:03.608 vendor.js:104658 [StreamingData] 00001210a5080000f4e40200f4e402000e000000
14:45:03.636 vendor.js:104658 Cordova callback: BLE677958194 ["OK"]
14:45:03.652 vendor.js:104658 Cordova callback: BLE677958195 ["OK"]
14:45:03.674 vendor.js:104658 Cordova callback: BLE677958161 (2) [ArrayBuffer(4), 87]
14:45:03.678 vendor.js:104658 Cordova callback: BLE677958162 (2) [ArrayBuffer(20), 88]
14:45:03.694 vendor.js:104658 Cordova callback: BLE677958188 (2) [ArrayBuffer(20), 94]
14:45:03.706 vendor.js:104658 [StreamingData] 23500000541f000038310000de08000022500000
14:45:03.719 vendor.js:104658 Cordova callback: BLE677958188 (2) [ArrayBuffer(20), 89]
14:45:03.723 vendor.js:104658 [StreamingData] 014c2500a5080000f6210000f4e402000000ffd7
14:45:03.725 vendor.js:104658 Cordova callback: BLE677958188 (2) [ArrayBuffer(20), 90]
14:45:03.728 vendor.js:104658 [StreamingData] 225000004f1f0000b42d0000eb69000021500000
14:45:03.730 vendor.js:104658 Cordova callback: BLE677958188 (2) [ArrayBuffer(20), 91]
14:45:03.733 vendor.js:104658 [StreamingData] 501f0000b42d0000af8b010023500000511f0000
14:45:03.735 vendor.js:104658 Cordova callback: BLE677958188 (2) [ArrayBuffer(20), 92]
14:45:03.738 vendor.js:104658 [StreamingData] b42d0000d508000022500000521f000038310000
14:45:03.741 vendor.js:104658 Cordova callback: BLE677958188 (2) [ArrayBuffer(20), 93]
14:45:03.745 vendor.js:104658 [StreamingData] db69000021500000531f000038310000a78b0100
14:45:03.747 vendor.js:104658 Cordova callback: BLE677958196 ["OK"]
14:45:03.759 vendor.js:104658 Cordova callback: BLE677958188 (2) [ArrayBuffer(20), 95]
14:45:03.762 vendor.js:104658 [StreamingData] 561f0000bc340000716a000021500000571f0000
14:45:03.764 vendor.js:104658 Cordova callback: BLE677958188 (2) [ArrayBuffer(20), 96]
14:45:03.768 vendor.js:104658 [StreamingData] bc340000a28b010023500000581f0000bc340000

So, the issue on Android is evidently some sort of ordering issue with the Cordova native to javascript bridge that does not in all cases guarantee callback ordering relative to when they are queued.

However, all of the logic inside of cordova NativeToJsMessageQueue seem to be properly synchronized and correct in enforcing FIFO behavior on callback messages so I am starting to suspect that the actual Chrome Webview method evaluateJavscript used to pass the result back to javsacript inside the webview is not guaranteed to execute in order of invocation.

timburke commented 5 years ago

Root Cause Determined

I have finally tracked down what is going on. The short story is that Cordova does not actually guarantee that plugin results will be returned to javascript in order. As I read the code, there are 3 ways that a plugin result can come back from Native to Javascript (in android at least):

  1. It can be directly returned as the result of an exec call. So, when you call out to a cordova plugin, it can synchronously return a result which will be given to javascript synchronously as the return value of the exec function. Importantly, any callback results that have been queued but not yet dispatched via method 2 below will be appended to this result (if they fit in the max message size). [1, 2, 3]
  2. It can be returned asynchronously by being appended to NativeToJSQueue and sent asynchronously using evaluateJavascript which invokes cordova.callbackFromNative at an unspecified future time. [4]
  3. It can be returned by javascript calling pollOnce to get all currently pending messages. This appears to only be used if there are too many messages in method 2 to fit in a single call to evaluateJavascript. [5]

Most ble notifications are returned to javascript using method 2 because they are multiple callbacks on a single callback object. However, if an intervening call on the ble plugin (or probably any other plugin) happens then its result will be returned immediately using method 1. If there happens to be a ble notification that has been queued on the native to js message queue for async delivery but not yet popped and handed to evaluateJavascript, which happens asynchonously via a Runnable in the UI thread, it will be appended and sent back immediately with the exec result. This causes the ble notification to be received out of order in javascript since it gets synchronously sent to javascript whereas previous notifications are waiting in a queue somewhere inside evaluateJavascript waiting for an idle moment to be dispatched.

Potential Fixes

  1. Looking at the code, it seems like there's a Cordova flag DISABLE_EXEC_CHAINING that prevents this behavior: https://github.com/apache/cordova-android/blob/master/framework/src/org/apache/cordova/CordovaBridge.java#L61

  2. We can add a sequence number to all of the notifications and keep a sliding window on the javascript side of the in-flight notifications that are out of order to reorder them before dispatching to the actual callback registered by user code. This will guarantee that ble notifications remain in order without needing to make any changes to Cordova itself.

@don I recommend option 2 above to fix this since it can be done entirely within this plugin and I think the cordova exec behavior is there for performance reasons so it probably won't be easily fixable otherwise. I'll open a PR with this fix but please let me know if you would prefer an alternate method to resolve the issue.

Smoking Gun Log

In the log file below, every time a callback is invoked on a BLE notification results in a Cordova callback entry with the ArrayBuffer containing the results and a sequence number that is monotonically increasing and set when the notification is received in Java. There is a second [Cordova] exec entry every time Cordova directly invokes a plugin result as the result of an exec call.

Look at the line 15:39:02.679 where notification 155 is received out of order. It is because it is appended to a Cordova exec response, which skips the plugin result queue.

Cordova callback: BLE331831983 ["OK"]
15:39:02.620 vendor.js:104658 Cordova callback: BLE331831974 (2) [ArrayBuffer(20), 111]
15:39:02.622 vendor.js:104658 [StreamingData] e45700001b8b010023500000891f0000e4570000

...

15:39:02.656 vendor.js:104658 Cordova callback: BLE331831974 (2) [ArrayBuffer(20), 120]
15:39:02.658 vendor.js:104658 [StreamingData] 4b73000021500000941f0000f46500000e8b0100
15:39:02.659 vendor.js:104658 Cordova callback: BLE331831947 (2) [ArrayBuffer(4), 121]
15:39:02.662 vendor.js:104658 Cordova callback: BLE331831948 (2) [ArrayBuffer(20), 122]
15:39:02.672 vendor.js:104658 [Cordova] exec returned 56 S11 BLE331831974 M29 AIlAAANcfAAAslwAAP3oAACFQAAA=4 n155
15:39:02.679 vendor.js:104658 Cordova callback: BLE331831974 (2) [ArrayBuffer(20), 155]
15:39:02.687 vendor.js:104658 [StreamingData] 22500000d71f00002c9700003f7a000021500000
15:39:02.695 vendor.js:104658 Cordova callback: BLE331831974 (2) [ArrayBuffer(20), 123]
15:39:02.697 vendor.js:104658 [StreamingData] 23500000951f0000f46500009308000022500000

[1] https://github.com/apache/cordova-android/blob/master/cordova-js-src/exec.js#L97 [2] https://github.com/apache/cordova-android/blob/master/framework/src/org/apache/cordova/CordovaBridge.java#L62 [3] https://github.com/apache/cordova-android/blob/master/framework/src/org/apache/cordova/NativeToJsMessageQueue.java#L168 [4] https://github.com/apache/cordova-android/blob/master/framework/src/org/apache/cordova/NativeToJsMessageQueue.java#L363 [5] https://github.com/apache/cordova-android/blob/master/framework/src/org/apache/cordova/NativeToJsMessageQueue.java#L199

mingdee commented 5 years ago

So is this issue affect both iOS and android?

timburke commented 5 years ago

@mingdee I believe that it only occurs on android.

The ordering issue is introduced by the way cordova bridges events from native code into javascript land. That implementation is platform dependent based on the hooks available in the browser process hosting the javascript code. I have not specifically evaluated that ios implementation but it is very different than the android one that has issues and we personally have never seen this problem on ios whereas we saw it very often on Android.