chipweinberger / flutter_blue_plus

Flutter plugin for connecting and communicationg with Bluetooth Low Energy devices, on Android, iOS, macOS
Other
788 stars 478 forks source link

android: stay disconnected #935

Closed jasaw closed 4 months ago

jasaw commented 4 months ago

Addresses issue #934

chipweinberger commented 4 months ago
  1. needs more comments

  2. we already have this

   private final Map<String, BluetoothGatt> mConnectedDevices = new ConcurrentHashMap<>();
    private final Map<String, BluetoothGatt> mCurrentlyConnectingDevices = new ConcurrentHashMap<>();

I dont think we need another map. We can just check !mCurrentlyConnectingDevices && !mAutoConnect

  1. im not confident this code is compatible with autoconnect. please check closely
jasaw commented 4 months ago

@chipweinberger Thank you for taking time to review this. Your comments are very helpful.

why not just have the 2 second delay?

Just for clarification (for other developers as well), a 2 second delay won't solve this race condition. This is the sequence of events of the race condition with 2s delay or any delay:

  1. App calls FBP connect, which calls connectGatt.
  2. App calls FBP disconnect, which delays 2s then calls disconnect then close. While calling disconnect, the Android Bluetooth stack is in the middle of establishing Bluetooth connection. FBP thinks that connect call has been cancelled, so notifies the app that it's disconnected.
  3. FBP gatt callback function gets called with a successful Bluetooth connection, FBP stores the gatt info away and notifies the app about connection state change, but the app isn't paying attention anymore with the assumption that it had already cancelled the connection.

I dont think we need another map. We can just check !mCurrentlyConnectingDevices && !mAutoConnect

Thank you for your input. That's very helpful. I've updated the code as suggested. This should also address the autoconnect scenario right?

chipweinberger commented 4 months ago

your update looks better.

i'm still trying to think if this is the best solution

in the dart code, there are already mutexes to help connect and disconnect interact properly

i assume this issue only happens with device.disconnect(queue: false), right?

chipweinberger commented 4 months ago

also trying to think if this issue would happen on other platforms, and would then make more sense in Dart

chipweinberger commented 4 months ago

and lastly thinking about if there are still race conditions if connect and disconnect are called a lot

i.e. does your fix introduce other race conditions

chipweinberger commented 4 months ago

also, this is going to result in device.connectionState to be called many times (1. disconnect, 2. connect, 3. disconnect)

please think more about these issues

jasaw commented 4 months ago

i assume this issue only happens with device.disconnect(queue: false), right?

Yes, you are absolutely right.

also trying to think if this issue would happen on other platforms

That is a good question. I have been focusing on Android so far because it is way more flaky than iOS. It would be good to do some extensive testing on iOS platform as well.

also, this is going to result in device.connectionState to be called many times (1. disconnect, 2. connect, 3. disconnect)

Yes, I have thought about this issue and thought it's safer to notify the app about the connection state rather than hiding it. If the app is listening to the connection state change event, the app can call disconnect if it wants to (it's safe to call disconnect multiple times).

jasaw commented 4 months ago

if connect and disconnect are called a lot

This is just a brain dump of what I think will happen when connect and disconnect are called continuously quickly.

Scenario 1:

  1. Connect gets called, Android schedules to establish connection later.
  2. Disconnect gets called. FBP calls gatt disconnect and close while Android is establishing connection, so both disconnect and close are silently ignored by Android.
  3. Android successfully established the connection, but FBP disconnects it immediately as a result of this patch.

Scenario 2:

  1. Connect gets called, Android schedules to establish connection later.
  2. Disconnect gets called. FBP calls gatt disconnect and close while Android is establishing connection, so both disconnect and close are silently ignored by Android.
  3. Connect gets called straight away. FBP puts the same remoteId back onto connecting map. Immediately after that, Android successfully established the previous connection. All good.

Scenario 3:

  1. Connect gets called, Android schedules to establish connection later.
  2. Disconnect gets called. FBP calls gatt disconnect and close while Android is establishing connection, so both disconnect and close are silently ignored by Android.
  3. Connect gets called straight away. FBP puts the same remoteId back onto connecting map.
  4. Disconnect gets called right after Android successfully established connection. This patch would have disconnected the connection in the gatt callback, so FBP will say already disconnected. All good.

Scenario 4:

  1. Connect gets called, Android schedules to establish connection later.
  2. Disconnect gets called. FBP calls gatt disconnect and close while Android is establishing connection, so both disconnect and close are silently ignored by Android.
  3. Connect gets called straight away. FBP puts the same remoteId back onto connecting map.
  4. Disconnect gets called straight away. FBP repeats step 2. This will keep cycling through steps 2, 3, 4 until scenario 2 or 3 is met.

Scenario 5:

  1. Connect gets called, Android schedules to establish connection later.
  2. Disconnect gets called. FBP calls gatt disconnect and close while Android is establishing connection, so both disconnect and close are silently ignored by Android.
  3. Android successfully establish connection but FBP disconnects it immediately because of this patch, but app wasn't listening to connection state change event.
  4. App calls connect before gatt disconnect event comes through. FBP adds the remoteId onto connecting map and calls gatt.connect. Android either schedules the connect or ignores the connect. If Android ignores the connect, FBP connect timeout should catch it. If Android schedules the connect, then we go to step 5.
  5. Gatt disconnect event eventually comes through. App waiting for connect event gets a disconnect event instead. App can retry the connect if it wants to. Not a catastrophe.

Those are the scenarios I can think of.

chipweinberger commented 4 months ago

appreciate your thoughts

on your analysis, i think there is an assumption that the "handle method call" thread and the gatt connection callback thread are mutually exclusive.

i'm not sure if that is true

we might want to add a mutex so that connect/disconnect is not called while the gatt connection callback is still processing

we can call it the "mConnectionCallbackMutex" or something

chipweinberger commented 4 months ago

also I think we should hide the extra connection/disconnection events

connection event: only invoked if mCurrentlyConnectingDevices or mAutoConnect

disconnect event: only invoked if mConectedDevices

ignore / log other connection & disconnection events

i think this is what we would want

jasaw commented 4 months ago

@chipweinberger

assumption that the "handle method call" thread and the gatt connection callback thread are mutually exclusive

I can't find any information on handle-method-call thread and gatt connection callback thread on Android documentation. Maybe I didn't look hard enough. I agree it's safer to add mutex to protect the critical sections. I've added mutex.

also I think we should hide the extra connection/disconnection events

Done.

chipweinberger commented 4 months ago

thanks

Is there a mutex that can't fail? Acquiring a mutex should not fail.

if we are interrupted we should just try again.

perhaps define a new mutex object that does not throw.

i really hate how complicated handling this issue is. such ugly complicated code (not your fault).

chipweinberger commented 4 months ago

also, i prefer not to use try/finally

just release the mutex before every "return"

chipweinberger commented 4 months ago

thanks btw

chipweinberger commented 4 months ago

also i'd make the comment this

"Android has an annoying edge case. If disconnect is called right when the connection is being established, android sometimes ignores the request to disconnect and completes the connection anyway. To handle this case, we make sure the device is still in our currently connecting devices, otherwise kill the connection, since the user was not expecting it to connect"

something like this

jasaw commented 4 months ago

also, i prefer not to use try/finally just release the mutex before every "return"

The code does look ugly with the try blocks around but I'm not convinced that it's easier to read or more robust removing the try/finally. I have a few concerns about removing try/finally:

  1. I am not sure whether any of the API calls can throw exceptions or not, like connectGatt and getRemoteDevice. If they can throw exceptions, then FBP will end up in a dead lock because mutex is not released. What is the current behaviour if those API calls throw exception?
  2. By removing try/finally, we need to release the mutex before every single return. I think that makes the code more prone to error because if we make changes to the code, we have to remember to release the mutex before a return. Sprinkling mutex release everywhere also makes the code harder to read because it's visual noise.

Is there a mutex that can't fail? Acquiring a mutex should not fail. if we are interrupted we should just try again.

I have to admit that I do not know all the conditions that could interrupt the acquiring of mutex. Would suspending the app interrupt it? Would putting the app into the background interrupt it? If we are going to try again, how many times do we try before we give up? If we keep trying, never give up, what is the likelihood of FBP locking up?

Would the code be easier to read if we combine the try/finally and try/catch into try/catch/finally and add a boolean to track whether we have acquired the mutex so we know to release it in the finally?

chipweinberger commented 4 months ago

I'm not sure what the solution is, but right now the diff is too ugly / hard to follow to be merged.

I am not sure whether any of the API calls can throw exceptions or not, l

good point.

If we are going to try again, how many times do we try before we give up? I

never give up. just log and retry. If we can't ever acquire the mutex that is a very serious problem and the program should not continue.

jasaw commented 4 months ago

never give up. just log and retry. If we can't ever acquire the mutex that is a very serious problem and the program should not continue.

Yes, I can see the logic here. Happy to retry forever, so we remove 1 try/catch indentation.

Regarding the try/finally, I think it's safer to leave it in there. Can we hide the mutex acquire/release try/catch/finally in a wrapper function? What's the best Java implementation for a wrapper function?

jasaw commented 4 months ago

never give up. just log and retry. If we can't ever acquire the mutex

Done. It keeps retrying until mutex is acquired.

I've also hidden the try/catch/finally and acquire/release mutex in runCriticalSection wrapper that calls the closure. The code looks more readable to me. Thoughts?

chipweinberger commented 4 months ago

ya its an improvement. I still hate all of it though (not meant to be an insult)

it seems so overly complicated.

chipweinberger commented 4 months ago

having these critical sections is not ideal.

are they necessary? I proposed them, but they're a hammer that I wasn't sure if was needed.

I do know we have a lot of state that both the callback & method call handler are changing at the same time.

and now that we are now disconnecting in the callback handler, adding complexity.

But I don't like this. overall this code just seems bad.

Maybe we need a better way to update the state in a more atomic fashion. I'm not sure.

chipweinberger commented 4 months ago

I know I dont want to break a bunch of stuff and cause more work for myself.

I have a lot things to focus on right now.

So unless we can come up with a solution that is cleaner, I think I may not merge this.

chipweinberger commented 4 months ago

as you know, adding complexity to code has a permanent maintenance cost.

a cost that I am the one paying, as the maintainer.

jasaw commented 4 months ago

Yes the code does look more complex now. The problem here is we are trying to fix 2 problems with this PR.

  1. Workaround Android ignoring disconnect call and establish the connection anyway.
  2. Potential concurrency issue when updating our maps.

I'm going to peel this PR back to only address issue 1. Issue 2 can be addressed by a separate PR if there is evidence that the callback and method call are concurrent.

chipweinberger commented 4 months ago

I agree. But I don't want to change #1 without addressing #2.

they can be separate PR. but #2 is more important.

jasaw commented 4 months ago

@chipweinberger Would the diff look better if I open a new PR just for the addition of critical section wrapper? Would you accept that PR?

If yes, then the diff for this PR will look cleaner.

jasaw commented 4 months ago

Just FYI, I found the answer to this question.

I am not sure whether any of the API calls can throw exceptions or not

I found this example code in Android documentation.

        try {
            final BluetoothDevice device = bluetoothAdapter.getRemoteDevice(address);
            // connect to the GATT server on the device
            bluetoothGatt = device.connectGatt(this, false, bluetoothGattCallback);
            return true;
        } catch (IllegalArgumentException exception) {
            Log.w(TAG, "Device not found with provided address.  Unable to connect.");
            return false;
        }
jasaw commented 4 months ago

if there is evidence that the callback and method call are concurrent.

I ran a quick test and can confirm that the gatt callback and method-call are concurrent.

chipweinberger commented 4 months ago

please rebase on master :)

jasaw commented 4 months ago

Done. Rebased.

chipweinberger commented 4 months ago

thanks it's looking good

i think we should put this new code in its own function

bool handleAccidentalConnectionEvents(event)

returns true if it was an accidental connection / disconnection

jasaw commented 4 months ago

I've move the code to handleUnexpectedConnectionEvents function. Let me know if you prefer handleAccidentalConnectionEvents and I'll change it.

chipweinberger commented 4 months ago

good name!

chipweinberger commented 4 months ago

if unexpected event, we can just return early

chipweinberger commented 4 months ago

thanks for you being receptive to me micromanaging your code change :) haha

chipweinberger commented 4 months ago

i'll merge it soon and make any other small formatting changes myself

i will not test it myself, so please be confident it solves your issue :)

thanks!!

jasaw commented 4 months ago

if unexpected event, we can just return early

Done.

thanks for you being receptive to me micromanaging your code change

No worries. Your review and comments are high appreciated because you are most familiar with the code base. Thank you for reviewing my PRs.

chipweinberger commented 4 months ago

merged