balena-os / wifi-connect

Easy WiFi setup for Linux devices from your mobile phone or laptop
Apache License 2.0
1.29k stars 361 forks source link

list of detected wifi access points is empty #206

Open bbinet opened 6 years ago

bbinet commented 6 years ago

We often have the following case: when the rpi starts, it correctly connects to the wifi access point (that was previously configured using wifi-connect captive portal), but after a few minutes or a few hours, the wifi-connect is starting (probably because the connection was temporarily lost), and the captive portal shows an empty list of detected wifis.

But we have a lot of wifi access points around, so there must be an issue here as the list should not be empty...

The rpi never connects back to our initial wifi access point, until we manually reboot the device.

Here is my start.sh script:

$ cat /usr/src/app/start.sh
#!/usr/bin/env bash
export DBUS_SYSTEM_BUS_ADDRESS=unix:path=/host/run/dbus/system_bus_socket
while true
do
    iwgetid -r
    if [ $? -ne 0 ]; then
        printf 'Starting WiFi Connect\n'
        ./wifi-connect --portal-ssid helioslite-tracker --portal-passphrase XXX --activity-timeout 600
    fi
    sleep 10
done

Here are the logs of NetworkManager service: https://gist.github.com/bbinet/d0ed2e5465824ef2594589b383cc5f93#file-networkmanager-log

Here are the logs of the container that runs wifi-connect: https://gist.github.com/bbinet/d0ed2e5465824ef2594589b383cc5f93#file-wifi-connect-log

Front logo Front conversations

bbinet commented 6 years ago

I'm happy to provide more details as I can connect to the rpi through the wifi-connect access point, so if think some other logs might be helpful to investigate this issue, please tell me.

Also do you know a temporary workaround that I can run to force network manager to reconnect to my initial wifi access point, so that I don't need to hard reboot the device?

majorz commented 6 years ago

Hi, I will check the logs now and let you know if I find anything.

majorz commented 6 years ago

The wifi-connect logs definitely reveal an issue: it should not start the access point when no SSIDs are available.

The empty SSIDs list is another issue that we need to look into. I am thinking about elevating the log level of NetworkManager and wpa_supplicant, and running some diagnostic code that we can try a few things with it. I will do some research tomorrow on this and let you know how that goes.

I had a non-related to WiFi Connect report about RPi 3 and NetworkManager not reporting any networks, so I am definitely interested in figuring out how to workaround this issue without reboots or driver reloads.

bbinet commented 6 years ago

Ok, thanks

majorz commented 6 years ago

@bbinet I started implementing this here: https://github.com/resin-io-playground/networkmanager-empty-ssids

It elevates the log levels of NetworkManager and wpa_supplicant, and then starts to check whether the SSID list is empty each 30 seconds. If the list is empty it triggers a rescan and waits 10 seconds to check whether they are available.

I am going to do local tests with it tomorrow and hopefully I will be able to reproduce an empty list. I am thinking to try it using a router and also a hotspot on my phone.

If I am not able to reproduce, I may ask you to run the script on a device that has this issue. But let's see first what will happen with the local testing. I will keep you posted.

bbinet commented 6 years ago

Sure, I'm also running your ssids.py script on my rpi zero w since this morning, so I let you know the results too.

majorz commented 6 years ago

I was able to reproduce zero access points available locally. The RequestScan method brought back the access point list, which is good news. I will check the logs now and do more testing.

bbinet commented 6 years ago

I've also been able to reproduce the issue.

Here are the logs of the ssids.py script: https://gist.github.com/bbinet/e53e33236a9862bb2c70a4c6de4bbf36#file-ssids-log

Here are the logs of NetworkManager service: https://gist.github.com/bbinet/e53e33236a9862bb2c70a4c6de4bbf36#file-nm-log

Here are the logs of the container that runs wifi-connect: https://gist.github.com/bbinet/e53e33236a9862bb2c70a4c6de4bbf36#file-container-log

majorz commented 6 years ago

@bbinet that is really useful! In your case the RequestScan call did not bring back the list with SSIDs, which is more like the actual problem, than what I was able to reproduce here.

I still have not got to chance to check the NM logs you attached for the specifics, but I will do that first thing tomorrow and will follow up with you. Thanks again!

majorz commented 6 years ago

@bbinet I did different can on the instrumentation script: if the RequestScan does not work, it now tells NM to unmanage and then manage the device again. If that does not work, it then restarts the NM service. I also added timestamped logging.

Can you please run this again, but this time do not start WiFi Connect together with the script (assuming that the connection profile is defined)? Also you may reboot the device after the application update, so that we get only fresher logs.

If you reproduce something, please attach the logs of wpa_supplicant as well. Thanks!

bbinet commented 6 years ago

@majorz but if I don't run WiFi Connect together with the script, I won't be able to log in to get the logs... Why not run WiFi Connect together with the script?

majorz commented 6 years ago

Actually now on second thought, you may leave it running, it should not be a problem.

bbinet commented 6 years ago

@majorz Let's run it

majorz commented 6 years ago

@bbinet one thing I forgot to mention is that I changed the implementation to use libnm as a higher level alternative to using dbus directly, so the Dockerfile has changed.

majorz commented 6 years ago

@bbinet I found an issue with this version and applied a fix. Can you please run the latest version instead?

I also made the project to use docker-compose for easier local testing, but the old structure is preserved in the ssids directory.

bbinet commented 6 years ago
Traceback (most recent call last):
  File "/ssids.py", line 8, in <module>
    gi.require_version("NM", "1.0")
  File "/usr/lib/python3/dist-packages/gi/__init__.py", line 118, in require_version
    raise ValueError('Namespace %s not available' % namespace)
ValueError: Namespace NM not available
bbinet commented 6 years ago

(I've installed python3-gi, but it may not be sufficient?)

majorz commented 6 years ago

It will need libnm-dev as well for this to work. I am using a Debian Buster base image, so that it is more recent and compatible with the one running on the host OS.

bbinet commented 6 years ago

@majorz the new ssids.py script is now running (still on debian jessie image though).

I hope I'll be able to reproduce the issue this afternoon, as I'm leaving tonight for 5 days off.

bbinet commented 6 years ago

Sorry, I meant Debian Stretch (not jessie)

majorz commented 6 years ago

Thanks! I think it should be fine. I am running it locally as well, but not sure whether I will be able to reproduce. I will run it on a few other devices as well later on. Good luck!

majorz commented 6 years ago

I made some adjustments on the diagnostic script, as there is a caching bug with python-gir, which led to incorrect reporting of access point count.

bbinet commented 6 years ago

I'm back at work today, so I've updated the ssids.py script, but I have some encoding issues:

--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.5/logging/__init__.py", line 983, in emit
    stream.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2014' in position 24: ordinal not in range(128)
Call stack:
  File "/ssids.py", line 191, in <module>
    main()
  File "/ssids.py", line 128, in main
    set_nm_log_level("trace")
  File "/ssids.py", line 29, in set_nm_log_level
    debug("Setting NetworkManager log level to", level)
  File "/ssids.py", line 24, in debug
    LOGGER.debug(" ".join([str(arg) for arg in args]))
Message: 'Setting NetworkManager log level to trace'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.5/logging/__init__.py", line 983, in emit
    stream.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2014' in position 24: ordinal not in range(128)
Call stack:
  File "/ssids.py", line 191, in <module>
    main()
  File "/ssids.py", line 129, in main
    set_wpa_log_level("msgdump")
  File "/ssids.py", line 34, in set_wpa_log_level
    debug("Setting wpa_supplicant log level to", level)
  File "/ssids.py", line 24, in debug
    LOGGER.debug(" ".join([str(arg) for arg in args]))
Message: 'Setting wpa_supplicant log level to msgdump'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.5/logging/__init__.py", line 983, in emit
    stream.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2014' in position 24: ordinal not in range(128)
Call stack:
  File "/ssids.py", line 191, in <module>
    main()
  File "/ssids.py", line 131, in main
    print_device_info()
  File "/ssids.py", line 86, in print_device_info
    debug("Device:", device.get_iface())
  File "/ssids.py", line 24, in debug
    LOGGER.debug(" ".join([str(arg) for arg in args]))
Message: 'Device: wlan0'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.5/logging/__init__.py", line 983, in emit
    stream.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2014' in position 24: ordinal not in range(128)
Call stack:
  File "/ssids.py", line 191, in <module>
    main()
  File "/ssids.py", line 131, in main
    print_device_info()
  File "/ssids.py", line 87, in print_device_info
    debug("Driver:", device.get_driver())
  File "/ssids.py", line 24, in debug
    LOGGER.debug(" ".join([str(arg) for arg in args]))
Message: 'Driver: brcmfmac'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.5/logging/__init__.py", line 983, in emit
    stream.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2014' in position 24: ordinal not in range(128)
Call stack:
  File "/ssids.py", line 191, in <module>
    main()
  File "/ssids.py", line 131, in main
    print_device_info()
  File "/ssids.py", line 88, in print_device_info
    debug("Driver version:", device.get_driver_version())
  File "/ssids.py", line 24, in debug
    LOGGER.debug(" ".join([str(arg) for arg in args]))
Message: 'Driver version: 7.45.41.46'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.5/logging/__init__.py", line 983, in emit
    stream.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2014' in position 24: ordinal not in range(128)
Call stack:
  File "/ssids.py", line 191, in <module>
    main()
  File "/ssids.py", line 131, in main
    print_device_info()
  File "/ssids.py", line 89, in print_device_info
    debug("Firmware version:", device.get_firmware_version())
  File "/ssids.py", line 24, in debug
    LOGGER.debug(" ".join([str(arg) for arg in args]))
Message: 'Firmware version: 01-f8a78378'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.5/logging/__init__.py", line 983, in emit
    stream.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2014' in position 24: ordinal not in range(128)
Call stack:
  File "/ssids.py", line 191, in <module>
    main()
  File "/ssids.py", line 134, in main
    wait(10)
  File "/ssids.py", line 119, in wait
    debug("Sleeping", seconds, "seconds...")
  File "/ssids.py", line 24, in debug
    LOGGER.debug(" ".join([str(arg) for arg in args]))
Message: 'Sleeping 10 seconds...'
Arguments: ()
bbinet commented 6 years ago

It seems to be fixed if I set the env variable: PYTHONIOENCODING=utf-8

majorz commented 6 years ago

In my testing here with RPi 3 I also reproduced multiple times a scenario where 0 access points were reported by NetworkManager. However NM detected this and triggered a RequestScan which brought back the list with access points.

I will test today with a RPi 1, since I do not have a zero w, but it has the same wifi chip, so let's see what happens.

majorz commented 6 years ago

I got confused, the RPi 1 does not have a built-in wireless chip. RPi 3 B has the same wireless chip as Zero W.

bbinet commented 6 years ago

@majorz my rpi zero w is currently running ssids.py and I'll drop you a message as soon as I've reproduced the issue.

majorz commented 6 years ago

Fingers crossed!

majorz commented 6 years ago

I am thinking about the following modifications in WiFi Connect:

  1. Always trigger RequestScan before retreiving the list of access points. Since with the current implementation there is not a way to subscribe for the event where the scan has completed, I will do some experimentation and see how to do it best with sleeping. Later on when we use libnm this will change. I did a few days ago the first steps towards this: https://github.com/resin-io-playground/libnm-rs. libnm is the NetworkManager official library. nmcli is using it and it handles all D-Bus communication itself + adds useful utility functions on top. Another nice thing about it is that it provides GObject introspection, so the Rust bindings will be autogenerated for the most part. The Rust autogeneration is not perfect, so I will have to do manual stuff as well unfortunately, but hopefully that will not be much effort.

  2. If a list with access points is not available then restart NetworkManager from inside WiFi Connect (if our findings confirm that this will fix the issue).

bbinet commented 6 years ago

@majorz this sounds good.

Also, I was able to reproduce the issue this night, but sadly I had used journalctl --vacuum-size and --vacuum-time option to try to purge old logs yesterday, and it seems journalctl also purged the logs from the night, so you might not find the relevant logs in the files attached below.

I will restart completely my rpi, and run ssids.py script again...

Here are the logs of the ssids.py script: https://gist.github.com/bbinet/aa1c3260f5756b31f20c37276d61ba27#file-ssids-log

Here are the logs of NetworkManager service: https://gist.github.com/bbinet/aa1c3260f5756b31f20c37276d61ba27#file-nm-log

Here are the logs of wpa_supplicant service: https://gist.github.com/bbinet/aa1c3260f5756b31f20c37276d61ba27#file-wpa_supplicant-log

Here are the logs of the container that runs wifi-connect: https://gist.github.com/bbinet/aa1c3260f5756b31f20c37276d61ba27#file-container-log

majorz commented 6 years ago

Ooops, I left a cleanup() call in the middle and forgot to remove it. The logs revealed that. I will fix it now.

majorz commented 6 years ago

@bbinet Done! Please run the new version as the previous one did not attempt to unmanaged the device or restart NM. I also added PYTHONIOENCODING in the Dockerfile. And also something in the compose since wifi-connect's dnsmasq needs it (although the last one should not have affected you).

bbinet commented 6 years ago

@majorz Thanks, I've just restarted my rpi with the new ssids.py script. Let's wait for the issue to show up again.

bbinet commented 6 years ago

Sorry I was not able to reproduce the issue for 10 days...

majorz commented 6 years ago

@bbinet alright, I will include the changes we talked above with invoking RequestScan on start and also restarting NetworkManager once if access points are not available.

bbinet commented 6 years ago

@majorz ok.

bbinet commented 6 years ago

@majorz Any news on this issue? I'm still getting empty list of wifi access points from time to time: if I manually restart NetworkManager from the hostos then it works again.

majorz commented 6 years ago

No much progress on this issue yet. I am planning to address it till the end of the month.

bbinet commented 6 years ago

Ok, thank you @majorz

bbinet commented 6 years ago

I'm still facing this issue from time to time: this is quite painful as we cannot connect to the device until the next reboot (though we have configured a force reboot every night to workaround this issue). @majorz Any news on this issue and the wifi-connect project in general, as the development seems to have stalled these days?

majorz commented 6 years ago

There is a new ui branch that should be integrated as soon as I can. And also we will be basing the project on this new Rust binding library which I worked on: https://github.com/resin-io-modules/libnm-rs.

I am going to work on this once we release this bigger update, since I would like to depend on the new library for the scanning notification mechanisms.

Also we are going to integrate WiFi Connect in Resin OS itself. We are figuring the exact details currently.

bbinet commented 6 years ago

Ok, thanks for the update, and for your efforts in this project.

andrewjaykeller commented 5 years ago

I'm still seeing this issue, any idea for a fix?

majorz commented 5 years ago

@andrewjaykeller I left you a comment in the other thread here (about the same issue): https://github.com/balena-io/wifi-connect/issues/232#issuecomment-455504349

maggie44 commented 2 years ago

If you are still experiencing this issue can you try this branch instead and report back on whether it helps: https://github.com/balena-os/wifi-connect/pull/450