Jakeler / ble-serial

"RFCOMM for BLE" a UART over Bluetooth low energy (4+) bridge for Linux, Mac and Windows
https://blog.ja-ke.tech/tags/#bluetooth
MIT License
264 stars 37 forks source link

ble-scan switch -t not working #30

Closed mzdaniel closed 3 years ago

mzdaniel commented 3 years ago

Thank you @Jakeler for this great project. For years I was trying to do something very similar like this following a endless rabbit hole I am hopping to streamline.

Describe the bug If my understanding is correct the semantics of -t should be a timeout of how much time is allowed for a possible advertisement to be listened and logged. That is not the case on a couple of systems I tried.

Log messages ble-scan -d 01:02:03:04:05:06 -t 30 --help usage: ble-scan [-h] [-t SEC] [-d ADDR]

time ble-scan -d 01:02:03:04:05:06 -t 30 Started deep scan of 01:02:03:04:05:06 Traceback (most recent call last): ... bleak.exc.BleakError: Device with address 01:02:03:04:05:06 was not found. real 0m10.621s

ble-serial -v -d 01:02:03:04:05:06 08:13:28.674 | DEBUG | main.py: Running: Namespace(verbose=True, device='01:02:03:04:05:06', addr_type='public', adapter='hci0', mtu=20, write_uuid=None, filename=None, binlog=False, port='/tmp/ttyBLE', read_uuid=None) 08:13:28.675 | DEBUG | selector_events.py: Using selector: EpollSelector 08:13:28.675 | INFO | linux_pty.py: Slave created on /tmp/ttyBLE -> /dev/pts/3 08:13:28.676 | INFO | ble_interface.py: Receiver set up 08:13:28.679 | INFO | ble_interface.py: Trying to connect with 01:02:03:04:05:06 08:13:39.020 | ERROR | main.py: Bluetooth connection failed: Device with address 01:02:03:04:05:06 was not found. 08:13:39.020 | WARNING | main.py: Shutdown initiated 08:13:39.020 | INFO | linux_pty.py: Serial reader and symlink removed 08:13:39.020 | INFO | main.py: Shutdown complete.

Setup (please complete the following information):

Jakeler commented 3 years ago

Hi @mzdaniel, that is indeed a bug, thanks for the report. Currently the custom timeout is only applied to the normal scan function (ble-scan without -d), not the deep scan that actually connects to the device. Then it just uses the default timeout from bleak, which is 10 seconds long. Same for ble-serial.

I will fix this soon and make a new release.

mzdaniel commented 3 years ago

Fantastic. Thank you for looking into this issue.

Jakeler commented 3 years ago

Version 2.2.0 with the fix is released now. You can update to it with pip install -U ble-serial.

mzdaniel commented 3 years ago

Thank you again @Jakeler for looking into this issue

Unfortunately my systems are still experiencing instability. The timeout is not truly respected by bleak, as it just relays a kryptic dbus message: bleak.exc.BleakDBusError: org.bluez.Error.Failed expecting the higher levels to address lower level issues or even worse, the user doing manual retries before an explicit timeout. In this sense, bluepy seem to work better though it uses C code plus all the other important considerations in https://github.com/Jakeler/ble-serial/issues/5

Temporarely removing the try block from the latest ble-serial release gives this traceback

time ble-scan -d 01:02:03:04:05:06 -t 30 Started deep scan of 01:02:03:04:05:06

Traceback (most recent call last): File "/home/admin/.local/bin/ble-scan", line 8, in sys.exit(main()) File "/home/admin/.local/lib/python3.9/site-packages/ble_serial/scan/init.py", line 55, in main asyncio.run(scan(args)) File "/usr/lib/python3.9/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete return future.result() File "/home/admin/.local/lib/python3.9/site-packages/ble_serial/scan/init.py", line 9, in scan await deep_scan(args.addr, args.sec) File "/home/admin/.local/lib/python3.9/site-packages/ble_serial/scan/init.py", line 28, in deep_scan async with BleakClient(dev, timeout=time) as client: File "/home/admin/.local/lib/python3.9/site-packages/bleak/backends/client.py", line 61, in aenter await self.connect() File "/home/admin/.local/lib/python3.9/site-packages/bleak/backends/bluezdbus/client.py", line 296, in connect assert_reply(reply) File "/home/admin/.local/lib/python3.9/site-packages/bleak/backends/bluezdbus/utils.py", line 23, in assert_reply raise BleakDBusError(reply.error_name) bleak.exc.BleakDBusError: org.bluez.Error.Failed

real 0m5.314s

Jakeler commented 3 years ago

Hmm, I can't reproduce this, it works as expected here:

$ ble-scan -d 01:02:03:04:05:06 -t 30
Started deep scan of 01:02:03:04:05:06

ERROR: Device with address 01:02:03:04:05:06 was not found.
                                                                 [ 0 ] [ 30s091 | 21:15:54 ]

$ ble-scan -d 01:02:03:04:05:06 -t 1
Started deep scan of 01:02:03:04:05:06

ERROR: Device with address 01:02:03:04:05:06 was not found.
                                                                 [ 0 ] [ 1s088 | 21:20:27 ]

I have tested it with two adapters (Intel AC 7260, Realtek RTL8761B) on latest Arch Linux, BlueZ 5.58. Maybe you can experiment with other hardware/software and find out what causes this? Does it fail with all timeout values (including the default 5s)?

mzdaniel commented 3 years ago

What I'm saying is if 01:02:03:04:05:06 is a valid peripheral, after some tries you can get something like:

time ble-scan -d FC:69:47:01:75:D0 -t 30 Started deep scan of FC:69:47:01:75:D0

ERROR: org.bluez.Error.Failed

real 0m11.055s

Placing a breakpoint when the exception originates in bleak uncovers some more cues about the error: 'Software caused connection abort' which other people had also raised in bleak https://github.com/hbldh/bleak/issues/239 and most likely other projects https://github.com/chrvadala/node-ble/issues/13

If the timeout is shorten, it increases the likelihood of reaching it and failing with Device not found, which is uninteresting for this bug report.

Unfortunately, at the moment I don't have more hardware available to try other software configurations beyond python wheels or what is available on debian repos.

On Thu, 27 May 2021 12:27:35 -0700 Jake @.***> wrote:

Hmm, I can't reproduce this, it works as expected for here:


$ ble-scan -d 01:02:03:04:05:06 -t 30
Started deep scan of 01:02:03:04:05:06

ERROR: Device with address 01:02:03:04:05:06 was not found.
                                                                 [ 0 ]
[ 30s091 | 21:15:54 ]

$ ble-scan -d 01:02:03:04:05:06 -t 1
Started deep scan of 01:02:03:04:05:06

ERROR: Device with address 01:02:03:04:05:06 was not found.
                                                                 [ 0 ]
[ 1s088 | 21:20:27 ] ```
I have tested it with two adapters (Intel AC 7260, Realtek RTL8761B)
on latest Arch Linux, BlueZ 5.58. Maybe you can experiment with other
hardware/software and find out what causes this? Does it fail with
all timeout values (including the default 5s)?
Jakeler commented 3 years ago

Okay I see, sorry that I misunderstood you. So it has not much to do with the timeout (at least not the scan or advertisement timeout), it is a general connection problem with a valid device address.

I made a another test script now: https://github.com/Jakeler/ble-serial/blob/master/tests/test_conn_reliability.py Let it run for 100 attempts, but zero errors again... About 4s average runtime. The test device was on my desk, only a a few cm away from the Bluetooth adapter on the PC, which results in a strong signal (RSSI about -35). Then I moved the device to another room and tried again with the weak signal (RSSI about -85). Now the avg. runtime increased to 8s and sure enough 5% result in org.bluez.Error.Failed! Close at the range limit I get the error always in 5-10 tries.

Looking at what happens in Wireshark shows that after this request:

Bluetooth HCI Event - Command Status
    Event Code: Command Status (0x0f)
    Parameter Total Length: 4
    Status: Pending (0x00)
    Number of Allowed Command Packets: 2
    Command Opcode: LE Read Remote Features (0x2016)
    [Command in frame: 603]
    [Command-Pending Delta: 1,773ms]

It fails exactly 225 ms later:

Bluetooth HCI Event - Disconnect Complete
    Event Code: Disconnect Complete (0x05)
    Parameter Total Length: 4
    Status: Success (0x00)
    Connection Handle: 0x0010
    Reason: Connection Failed to be Established (0x3e)

So the host receives an advertisment, opens a connection, tries to read remote features and never gets a response. The 225 ms (7.5ms*30) could be the connection interval, but I am not sure about that. That means the root cause is very low level, probably even the physical layer, which causes packet loss. I agree that the BlueZ errors are pretty cryptic and unhelpful, but it might not be a bug.

Side note, one time I got this unusual error message:

ERROR: Device with address 20:91:48:DF:76:D9 could not be found. Try increasing `timeout` value or moving the device closer.

A valid attempt would instead follow up with some data:

Bluetooth HCI Event - LE Meta
    Event Code: LE Meta (0x3e)
    Parameter Total Length: 12
    Sub Event: LE Read Remote Features Complete (0x04)
    Status: Success (0x00)
    Connection Handle: 0x0010
    Supported LE Features: 0x0000000000000001, LE Encryption
    [Command in frame: 22]
    [Pending in frame: 23]
    [Pending-Response Delta: 300,165ms]
    [Command-Response Delta: 301,935ms]

(+ many other packets) After all it normally disconnects with Reason: Connection Terminated by Local Host (0x16) and then Reason: Remote User Terminated Connection (0x13).


Can you confirm the correlation between RSSI and connection failures? If so, how is the success rate? If I understand your last comment correctly it works most of the time? Because the bleak issue mentions <10% working. Wireshark captures could be also interesting.