balena-os / wifi-connect

Easy WiFi setup for Linux devices from your mobile phone or laptop
Apache License 2.0
1.27k stars 354 forks source link

Captive portal does not work on BalenaOS 2.46.1 (Supervisor 10.x) #328

Open marclennox opened 4 years ago

marclennox commented 4 years ago

I had this project working nicely with my other projects, but recently the captive portal no longer comes up. When my phone connects to the wifi portal, it indicates "connected", then "obtaining ip address", then "no internet", but never brings up the captive portal. This was working a few weeks ago, trying to debug but can't figure out what has changed to stop this working.

Any hints how to debug this?

marclennox commented 4 years ago

So it looks like wifi-connect does not work properly on Supervisor versions 10.x, whereas it works fine with Supervisor 9.x versions of BalenaOS.

majorz commented 4 years ago

@marclennox sorry for not responding earlier, had to catch up on a lot of other fronts. I am going to test this tomorrow. This does not sound good.

marclennox commented 4 years ago

Thanks @majorz, I can definitely confirm that wifi-connect works fine on BalenaOS 2.38.0 (both raspberry Pi 3 and Balena Fin 1.0), but does not work on BalenaOS 2.46.1 (raspberry pi 3). The failure mode is that the captive portal just doesn't come up after connecting and obtaining an IP address.

majorz commented 4 years ago

That's bad. We test WiFi Connect before each balenaOS release as part of our stability tests, but probably this slipped through the cracks somehow. I will test this first thing tomorrow morning in a few hours.

marclennox commented 4 years ago

Thanks @majorz, look forward to hearing what you find.

majorz commented 4 years ago

@marclennox I was not able to reproduce.

I followed minimal steps:

  1. Created a new empty application wifi-connect on the dashboard
  2. Flashed a balenaOS 2.46.1+rev1 RPi 3 image (the default 32-bit version, not the 64-bit beta one)
  3. Cloned the WiFi Connect repo
  4. Logged-in with our CLI - balena login
  5. From the root of the repo I did balena push wifi-connect to push the code to the newly created application
  6. Waited for the image to be downloaded and started testing

I did numerous tests with both RPi 3 B and B+, but the captive portal always showed correctly.

Can you please repeat the above minimal steps and let me know how that goes for you? This would help in narrowing down the issue on your side.

marclennox commented 4 years ago

I will. Note however that I'm using it in a multi- container application, with privileged: true

On Thu., Jan. 23, 2020, 06:35 Zahari Petkov, notifications@github.com wrote:

@marclennox https://github.com/marclennox I was not able to reproduce.

I followed minimal steps:

  1. Created a new empty application wifi-connect on the dashboard
  2. Flashed a balenaOS 2.46.1+rev1 RPi 3 image (the default 32-bit version, not the 64-bit beta one)
  3. Cloned the WiFi Connect repo
  4. Logged-in with our CLI - balena login
  5. From the root of the repo I did balena push wifi-connect to push the code to the newly created application
  6. Waited for the image to be downloaded and started testing

I did numerous tests with both RPi 3 B and B+, but the captive portal always showed correctly.

Can you please repeat the above minimal steps and let me know how that goes for you? This would help in narrowing down the issue on your side.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/balena-io/wifi-connect/issues/328?email_source=notifications&email_token=AAE7CZAQTFNHXM2VGYV3PITQ7F6HVA5CNFSM4KHXDHJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJXCLBY#issuecomment-577643911, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE7CZCZQ5X3MOOFSMFXTXTQ7F6HVANCNFSM4KHXDHJA .

majorz commented 4 years ago

Internally all single container applications are processed as multi-container ones - a docker-compose.yml exists with privileged: true, network: host, etc. So it is probably not related, but let's see where it starts to break.

marclennox commented 4 years ago

OK @majorz, I figured out what the issue is.

In my wifi-connect Dockerfile, I've added network-manager to the list of installed packages, in order that I can use the nmcli command to check for an active network.

If I take the stock wifi-connect project, it works fine for me on 2.46. If I simply add network-manager to the package list, the captive portal no longer comes up on my phone after connecting to the Wifi Connect SSID.

It should be noted that on 2.38, Wifi connect works properly regardless of having network-manager added to the installed package list.

majorz commented 4 years ago

Thanks, I will try that. Just to reassure - are you using the newer balenalib images, or the older resin ones? As in FROM balenalib/%%RESIN_MACHINE_NAME%%-debian.

marclennox commented 4 years ago

FROM balenalib/%%BALENA_MACHINE_NAME%%-debian:latest

On Thu, 23 Jan 2020 at 10:10, Zahari Petkov notifications@github.com wrote:

Thanks, I will try that. Just to reassure - are you using the newer balenalib images, or the older resin ones? As in FROM balenalib/%%RESIN_MACHINE_NAME%%-debian.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/balena-io/wifi-connect/issues/328?email_source=notifications&email_token=AAE7CZEEFVOHWYHNPPWLQKLQ7GXPFA5CNFSM4KHXDHJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJXVX5Y#issuecomment-577723383, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE7CZHISSFG2VVSEYKTN7DQ7GXPFANCNFSM4KHXDHJA .

marclennox commented 4 years ago

Same result if I use

FROM balenalib/%%RESIN_MACHINE_NAME%%-debian

Works fine without network-manager, fails to bring up the captive portal with network-manager installed

majorz commented 4 years ago

@marclennox I cannot reproduce that, it works for me. Also those should not be related as installed network-manager in the container on the balenalib base images should not have effect over wifi-connect as it communicates with NetworkManager the service running on the host OS through D-Bus. It does not have any relation to the libraries installed by NetworkManager.

marclennox commented 4 years ago

Very strange. It is 100% reproduceable for me. I'm using a multi-container deployment for my testing. Will try with single container just in case.

Is there a way I can turn on debugging logs to get more info from wifi-connect to see what's failing?

majorz commented 4 years ago

For multi-container make sure it has privileged: true and network_mode: host.

marclennox commented 4 years ago

Yep it does

On Thu., Jan. 23, 2020, 12:39 Zahari Petkov, notifications@github.com wrote:

For multi-container make sure it has privileged: true and network: host.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/balena-io/wifi-connect/issues/328?email_source=notifications&email_token=AAE7CZAPFJR4TVNCN62YAC3Q7HI6JA5CNFSM4KHXDHJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJYF6CY#issuecomment-577789707, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE7CZBM4LEUOY5STPFDNO3Q7HI6JANCNFSM4KHXDHJA .

majorz commented 4 years ago

I see. Please go with the precise steps I provided above. Nothing more or less. And then you may start modifying from that state to see where it breaks, so that I can reproduce on my side as well. Unfortunately for the problem you describe there are no logs to be enabled currently. Usually logs can be retrieved from the host OS for NetworkManager with journalctl, it will look fine on that side, since the problem occurs at later stage. Debugging the kind of problem you describe would require capturing packets with tcpdump and it will be rather hard for you. It will be best if I can reproduce on my side.

marclennox commented 4 years ago

Well, I now was able to reproduce the issue without having network-manager package installed. Rebooting the device then made it work properly. In looking at the logs, I see the following logs when it works properly.

10.01.20 08:32:23 (-0500)  main  Starting WiFi Connect
10.01.20 08:32:23 (-0500)  main  Deleting already created by WiFi Connect access point connection profile: "WiFi Connect"
10.01.20 08:32:23 (-0500)  main  WiFi device: wlan0
10.01.20 08:32:24 (-0500)  main  Access points: ["HUAWEI-3991", "Home Guest", "Home", "Home Guest", "Home", "NETGEAR58", "Home", "Home Guest", "TrackYourAssets!", "Home Guest", "Home", "Home", "Home Guest", "Home Guest", "Home", "Home", "Home Guest"]
10.01.20 08:32:24 (-0500)  main  Starting access point...
23.01.20 13:54:20 (-0500)  main  Access point 'WiFi Connect' created
23.01.20 13:54:20 (-0500)  main  Starting HTTP server on 192.168.42.1:80
23.01.20 13:54:58 (-0500)  main  User connected to the captive portal

And the following logs when it doesn't work

23.01.20 13:57:57 (-0500)  main  Starting WiFi Connect
23.01.20 13:57:57 (-0500)  main  Deleting already created by WiFi Connect access point connection profile: "WiFi Connect"
23.01.20 13:57:57 (-0500)  main  WiFi device: wlan0
23.01.20 13:57:57 (-0500)  main  Access points: ["WiFi Connect"]
23.01.20 13:57:57 (-0500)  main  Starting access point...
23.01.20 13:58:00 (-0500)  main  Access point 'WiFi Connect' created
23.01.20 13:58:00 (-0500)  main  Starting HTTP server on 192.168.42.1:80

So it feels like this might be related to https://github.com/balena-io/wifi-connect/issues/327

It seems that the device gets "stuck" in a state where the portal is activated, so if the process restarts, it only sees the portal SSID, and that's when things go bad.

marclennox commented 4 years ago

Adding the following before calling wifi-connect seems to make the problem go away

nmcli connection down id "WiFi Connect" || true
nmcli connection delete id "WiFi Connect" || true
majorz commented 4 years ago

I see, the problem is when an already "WiFi Connect" profile exists, e.g. because of a power cycle. I will test this out.

marclennox commented 4 years ago

Correct. I think what also might exacerbate the problem in my particular setup, is that I use the timeout option, then (in a loop) restart wifi-connect.

I have been able to build a fairly bullet-proof script using nmcli that all but eliminates this problem for me, so for now I have a very viable workaround.

meech-ward commented 4 years ago

@marclennox have you added any more code to your script, or does just running the following before wifi-connect fix the issue for you?

nmcli connection down id "WiFi Connect" || true
nmcli connection delete id "WiFi Connect" || true
marclennox commented 4 years ago

@meech-ward I've made the script a little more robust (dealing with a possible failure of each nmcli call), but yes, that's basically all I'm doing before launching wifi-connect, and it has solved the issue for me.

matteopeluso commented 4 years ago

Hi it seems like I having the same problem on this distribution balenaOS 2.44.0+rev3

I have clone this repo, built it for rpi3 with balena and deployed in a multi-container application with network_mode: host privileged: true

It generate the AP, but never open the portal, I am using for testing an rpi3 as hardware, a xiaomi and a macBook Pro for the portal.