MiSTer-devel / Linux-Kernel_MiSTer

Other
12 stars 17 forks source link

Wifi randomly pulls a second 169 IP, causing all internet connections to fail post-boot #29

Open Drakonas opened 2 years ago

Drakonas commented 2 years ago

Over the last week a few of us have been hammering at what is causing these weird wifi-related failures on boot at random. When they occur, the MiSTer gets an IP, even gets the time when you don't have an RTC board, and SAMBA/SSH connections work, but anything attempting to get an internet connection after that fails. Even nslookup google.com 8.8.8.8 fails with a timeout to the DNS.

What we have found are the following:

Possible methods to fix (doing multiple is not a bad idea):

Any thoughts as to what may be directly causing this is welcome. I'd like to get to the bottom of this, as various users besides me have reported this happening at random with their MiSTer. We are using the latest Mr. Fusion images as far as I know.

I have attached my syslog for review.

/var/log/messages

sorgelig commented 2 years ago

To fix the problem, need to have this problem. Generally speaking 99% of time i use wired connection and it's hard for me to work on this issue. I'm not very much in Linux specifics, so don't treat me as a master here :) If you can offer a working solution, then go on.

sorgelig commented 2 years ago

there is kind of race condition in boot sequence. I was trying to fix it when i was working on Bluetooth. But it seems impossible to fix it especially when more USB devices are connected. You may try to play with etc/init.d/* scripts sequence - may be it will help.

Drakonas commented 2 years ago

there is kind of race condition in boot sequence. I was trying to fix it when i was working on Bluetooth. But it seems impossible to fix it especially when more USB devices are connected. You may try to play with etc/init.d/* scripts sequence - may be it will help.

It's hard to really see what's going on without access to the buildroot scripts. These don't seem to be public. Do you know where these are located for the project?

birdybro commented 2 years ago

It occurs for me too with this adapter --> https://www.amazon.com/gp/product/B08D72GSMS/ combined with this router+ap --> https://www.amazon.com/dp/B08DTF7KGC/ref=twister_B09P4Q7JK4

I just have to run ip addr flush dev wlan0 and it comes back with one address. It also doesn't come back between boots and it is definitely related to the dhcp lease time, because it will only come back after my dhcp lease has expired I've noticed (or joined a different network). It also occurred at my parents house with a totally different router+ap.

zakk4223 commented 2 years ago

There's multiple things going on:

1) the root filesystem is read-only so dhcpcd can't write a lease or duid file to /var/db/dhcpcd

1a) for some users the wlan0 device is available when dhcpcd first starts. It negotiates a lease but can't write any state to disk. Then for some reason it also receives a udev 'add' event for wlan0. Due to the fact there's no lease state written it tries to refresh/rediscover a dhcp lease. For some users this fails (I suspect the router is applying "protection"). When it fails dhcpcd falls back to a self assigned IP and deletes the route/dns for the 'good' lease. Or at least inserts a higher priority route to nowhere.

I unfortunately cannot debug this one because my wlan0 interface is not available when dhcpcd launches, so it only tries to get a lease once due to udev add event. I've seen Drakonas' logs and their wlan0 device is detected a full 2 seconds before mine so I suspect this is just due to variations in USB setups (hub, other devices etc).

Either /var/db/dhcpcd needs to be writeable or dhcpcd needs to use a different database directory. You can symlink /var/db/dhcpcd to /media/fat/dhcpcd and it will work. Or you could recompile dhcpcd and set DBDIR to /media/fat/dhcpcd (configure --dbdir=/media/fat/dhcpcd)

2) udhcpc is still being run for wlan0 which means there are two dhcp clients running on those interfaces with possibly unpredictable results.

If you change /etc/network/interfaces so the line like 'iface wlan0 inet dhcp' is instead 'iface wlan0 inet manual' the ifup script won't try to invoke a dhcp client, but will still invoke the pre-up scripts for wpa_supplicant. Then dhcpcd will handle the dhcp lease when it runs.

sorgelig commented 2 years ago

Nothing to do with buildroot. Boot scripts are in image and can be read/tweaked. For debug purpose root fs can be mounted in read/write at boot (by uncommenting the line in inittab). If it will fix the problem then need to check which directory needs to be mounted as rw (as tmpfs).

prenetic commented 1 year ago

I have what seems to be the same or at least a very similar problem, outlined in greater detail in this thread on the Mister FPGA forums:

https://misterfpga.org/viewtopic.php?p=58198#p58198 https://misterfpga.org/viewtopic.php?p=58210#p58210 https://misterfpga.org/viewtopic.php?p=58274#p58274

Essentially two leases are allocated to the MiSTer on Wi-Fi (haven't tested whether this happens wired as well). Same MAC address, but one is registered without a hostname and a vendor ID of "udhcp", the other with the "MiSTer" hostname but no vendor ID (from dhcpc). When the device comes up, there is a brief moment of connectivity, followed by 10-20 seconds of disruption, and then connectivity again. You can't see both addresses with ifconfig, but you CAN see both with ip address. The flags differ too, with the udhcpc address showing perpetual validity (doesn't seem to respect the DHCP lease time).

Looking at the DHCP datagrams it's more clear -- the udhcpc requests come in with the MAC address (DUID) as the client identifier, but the dhcpc request comes in with the MAC address (DUID) PLUS an IAID as the client identifier. In the eyes of the DHCP server, these each require a unique IP address despite having the same base MAC address.

For testing/as a workaround, I changed the following option from duid to clientid which causes dhcpc to only send the MAC address as part of the DHCP request, so now the client identifiers between udhcpc and dhcpc match and only one lease is provided as confirmed by logs on my DHCP server (dnsmasq on my router).

# Use the hardware address of the interface for the Client ID.
clientid
# or
# Use the same DUID + IAID as set in DHCPv6 for DHCPv4 ClientID as per RFC4361.
# Some non-RFC compliant DHCP servers do not reply with this set.
# In this case, comment out duid and enable clientid above.
#duid

So given the behavior I think we're running into the same thing here. I'm under the impression that when the choice was made to transition to dhcpc from udhcpc, the latter wasn't fully disabled in the base image and the two DHCP clients are conflicting and causing issues -- so +1 to sticking to one or the other regardless of any other fixes.

Separately, if people are still running into new IP addresses every startup/polluting DHCP pools even after disabling one of the two DHCP clients, then the config change above should take care of that problem since the MAC address shouldn't be changing every boot. There's really no reason to include IAID as part of the client identifier for the case of MiSTer as far as I can tell (though it shouldn't be cycling with every boot anyway), and it is typically omitted for compatibility purposes for IPv4 anyway.

Drakonas commented 1 year ago
# Use the hardware address of the interface for the Client ID.
clientid
# or
# Use the same DUID + IAID as set in DHCPv6 for DHCPv4 ClientID as per RFC4361.
# Some non-RFC compliant DHCP servers do not reply with this set.
# In this case, comment out duid and enable clientid above.
#duid

To clarify, what file is this change supposed to occur? I assume /etc/dhcpcd.conf

prenetic commented 1 year ago
# Use the hardware address of the interface for the Client ID.
clientid
# or
# Use the same DUID + IAID as set in DHCPv6 for DHCPv4 ClientID as per RFC4361.
# Some non-RFC compliant DHCP servers do not reply with this set.
# In this case, comment out duid and enable clientid above.
#duid

To clarify, what file is this change supposed to occur? I assume /etc/dhcpcd.conf

Yep, that's the one. Sorry I forgot to include that here.

Akuma-Git commented 1 year ago

That's because ifup/ifdown are hardcoded to use udhcpc. Try renaming /usr/sbin/udhcpc to /usr/sbin/_udhcpc.

Drakonas commented 1 year ago

That's because ifup/ifdown are hardcoded to use udhcpc. Try renaming /usr/sbin/udhcpc to /usr/sbin/_udhcpc.

Why is this then? It seems, based on this, that two separate DHCP clients can (or maybe always?) load on startup, given the right scenario. Does Busybox handle both for the MiSTer setup? Pretty sure that fixing this should really warrant a change having everything hardcoded for one DHCP client, instead of requiring dirty workarounds

Also, please see Zakk's statements a couple months ago in this issue. There is more to the issue than two DHCP daemons. @Akuma-Git

Akuma-Git commented 1 year ago

Why is this then?

Because udhcpc, ifup, ifdown are busybox components.

It seems, based on this, that two separate DHCP clients can (or maybe always?) load on startup, given the right scenario.

Correct

Does Busybox handle both for the MiSTer setup? Pretty sure that fixing this should really warrant a change having everything hardcoded for one DHCP client, instead of requiring dirty workarounds

Idk, afaict the dhcpcd package is unnecessary

Also, please see Zakk's statements a couple months ago in this issue. There is more to the issue than two DHCP daemons.

Correct, this is due to configuration errors resulting in some fighting between:

prenetic commented 1 year ago

Idk, afaict the dhcpcd package is unnecessary

I'm curious what sparked this change as it seems like it was an intentional choice to switch to dhcpc. Possibly @sorgelig can provide more context, if there was originally an issue with udhcpc that can be addressed here.

sorgelig commented 1 year ago

because udhcpc didn't work well. I will re-check it. Make sure your solution works with Ethernet connection too.

prenetic commented 1 year ago

Via wired Ethernet without the dhcpcd.conf client identifier change (duid) -- datagrams include IAID of 04050607:

dnsmasq DHCP logs

user@router1:~$ tail -f -n 0 /var/log/dnsmasq.log | grep -i 02:03:04:05:06:07
Aug 17 16:27:14 dnsmasq-dhcp[32608]: DHCPDISCOVER(bond0.10) 02:03:04:05:06:07
Aug 17 16:27:14 dnsmasq-dhcp[32608]: DHCPOFFER(bond0.10) 192.168.10.212 02:03:04:05:06:07
Aug 17 16:27:14 dnsmasq-dhcp[32608]: DHCPREQUEST(bond0.10) 192.168.10.212 02:03:04:05:06:07
Aug 17 16:27:14 dnsmasq-dhcp[32608]: DHCPACK(bond0.10) 192.168.10.212 02:03:04:05:06:07 MiSTer

dnsmasq DHCP lease

02:03:04:05:06:07 192.168.10.212 MiSTer ff:04:05:06:07:00:03:00:01:02:03:04:05:06:07

Datagram contents

Option: (61) Client identifier
    Length: 15
    IAID: 04050607
    DUID Type: link-layer address (3)
    Hardware type: Ethernet (1)
    Link layer address: 02:03:04:05:06:07

Active IP addresses

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 02:03:04:05:06:07 brd ff:ff:ff:ff:ff:ff
    inet 192.168.10.212/24 brd 192.168.10.255 scope global dynamic noprefixroute eth0
       valid_lft 82688sec preferred_lft 71888sec

Via wired Ethernet with the dhcpcd.conf client identifier change (clientid) -- datagrams do not include IAID:

dnsmasq DHCP logs

user@router:~$ tail -f -n 0 /var/log/dnsmasq.log | grep -i 02:03:04:05:06:07
Aug 17 16:11:22 dnsmasq-dhcp[30909]: DHCPDISCOVER(bond0.10) 02:03:04:05:06:07
Aug 17 16:11:22 dnsmasq-dhcp[30909]: DHCPOFFER(bond0.10) 192.168.10.211 02:03:04:05:06:07
Aug 17 16:11:22 dnsmasq-dhcp[30909]: DHCPREQUEST(bond0.10) 192.168.10.211 02:03:04:05:06:07
Aug 17 16:11:22 dnsmasq-dhcp[30909]: DHCPACK(bond0.10) 192.168.10.211 02:03:04:05:06:07 MiSTer

dnsmasq DHCP lease

02:03:04:05:06:07 192.168.10.211 MiSTer 01:02:03:04:05:06:07

Datagram contents

Option: (61) Client identifier
    Length: 7
    Hardware type: Ethernet (0x01)
    Client MAC address: MS-NLB-PhysServer-03_04:05:06:07 (02:03:04:05:06:07)

Active IP addresses

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 02:03:04:05:06:07 brd ff:ff:ff:ff:ff:ff
    inet 192.168.10.211/24 brd 192.168.10.255 scope global dynamic noprefixroute eth0
       valid_lft 86327sec preferred_lft 75527sec

Regardless of which configuration I'm only seeing one set of requests show up from dhcpc -- nothing comes in from udhcpc over wired Ethernet from what I can tell. This seems to indicate the udhcp/duplicate lease issue is limited to Wi-Fi (and possibly USB wired Ethernet adapters as well), and that this change doesn't appear to have any negative impact for wired Ethernet other than temporarily creating a second lease when changing the IAID behavior which you can see above. The second lease in this case should expire gracefully as the two aren't tied up at the same time like they are via Wi-Fi.

@Drakonas while this certainly isn't a long-term fix, does the change to dhcpc.conf I mentioned above take care of the second/169.x.x.x address issue you were seeing via Wi-Fi? Wondering if your DHCP server just isn't handling the IPv4 IAID that's going out through dhcpc and fails, that's known behavior for some vendors.

zakk4223 commented 1 year ago

There are at least two 'multiple lease' problems:

1) udhcp and dhcpd both try to get leases on an interface. This is not the original poster's problem, but obviously it is happening for some people. Seems likely that udhcp just needs to be disabled.

2) On some systems 'wlan0' is visible when dhcpd launches, and it immediately sends a lease request and gets a valid response. However, it is so early in the boot process the filesystem is still read only so it can't write a lease state file. Then for some reason it receives a udev 'device added' event. Without a valid lease file it tries to solicit a new lease. This fails, so it falls back to Ipv4LL which then adds a 2nd IP, but more importantly it messes up the route table and makes the network effectively non-functional.

The log linked in the initial post seems to indicate the same IAID is used for both requests. Unfortunately I can't reproduce the issue here (my wlan0 is not available when dhcpd starts up, so it only reacts to the udev event) so I can't see if there's anything different about the request.

Dhcpd needs a way to write lease files, even on startup. I'm not sure if it is feasible to move the startup of dhcpd so it always starts after the filesystem is remounted rw

prenetic commented 1 year ago
  1. On some systems 'wlan0' is visible when dhcpd launches, and it immediately sends a lease request and gets a valid response. However, it is so early in the boot process the filesystem is still read only so it can't write a lease state file. Then for some reason it receives a udev 'device added' event. Without a valid lease file it tries to solicit a new lease. This fails, so it falls back to Ipv4LL which then adds a 2nd IP, but more importantly it messes up the route table and makes the network effectively non-functional.

So I haven't been able to repro this one on my end, but I wonder if it'd be good enough to instead write the DHCP lease state to ephemeral storage /tmp (and not to SD) since minimizing writes by default seems to be a design philosophy of MiSTer.

sorgelig commented 1 year ago

If you have a working config (both ethernet and wifi) already, then please put modified files here, i will include it in next linux release.

ghost commented 1 year ago

problem is related to 2 dhcp clients running at the same time udpcpc being ran ondemand and dhcpcd running in the background

and it's not an kernel problem but userspace linux image

do: ifdown wlan0 killall -9 dhcpcd ifup wlan0

Drakonas commented 1 year ago

problem is related to 2 dhcp clients running at the same time udpcpc being ran ondemand and dhcpcd running in the background

and it's not an kernel problem but userspace linux image

do: ifdown wlan0 killall -9 dhcpcd ifup wlan0

I believe the fix suggested by @gkrzystek is similar to the proposed change here.

I, for one, this think should be revisited.

In short, the boot script in the linux image doesn't actually bring down the interfaces and kill the dhcp client. From my understanding, they are left running and the boot process starts again, attempting to grab a new lease with the previous one still active. Addressing this should fix this issue, I expect.

The proposed change also unmounts the filesystem. I can neither confirm or deny that this is necessary.

Drakonas commented 1 year ago

@Drakonas while this certainly isn't a long-term fix, does the change to dhcpc.conf I mentioned above take care of the second/169.x.x.x address issue you were seeing via Wi-Fi? Wondering if your DHCP server just isn't handling the IPv4 IAID that's going out through dhcpc and fails, that's known behavior for some vendors.

obs64_2023-01-16_14-22-50

Sorry to get back so late on this, but the issue is not resolved by changing to clientid alone and leaving /usr/sbin/udhcpc in place.

Disabling udhcpc (renaming it) has consistently fixed the issue for me over wifi, and I've been using ethernet without issue for months.

I should mention that the udhcpc issue does not affect everyone, but it's because some routers will not get confused by the duplicate lease attempt, and handle it properly. Good routers will not actually see this issue at large. But cheap or poorly made ones (especially those provided by Internet Providers, which some force you to use) will get confused and handoff a wrong IP oor fail to give the lease, and it's easily reproducible. I am using one of these routers, sadly.

@sorgelig Does this give enough information that ethernet is unaffected, and that getting rid of udhcpc should be looked into? I can do more testing if you'd like.

sorgelig commented 1 year ago

so, all i need to do is to remove udhcpc and problem solved?

ghost commented 1 year ago

@Drakonas statementa about "good" touters is slight miss. all routers are just another linux , i did test most of dhhcp server implementations , and most of them assigning lase on combination Clientid + mac , by default. and you have to specially set flag to ignore RFC and use maconly laeases...

so not routers are bad ,but our linux distro is badly set.

ghost commented 1 year ago

what we can do here is: 1) set dhcpcd to actuualy work only on eth0 , which will leave wifi0 for udhcp 2) alter wpa_supplicant config to do not call udhcpd and leave dhcp job to dhcpcd fact that udhcpd exist in the system (as part of busybox) don't mean we have to use it at all some background https://wiki.archlinux.org/title/dhcpcd

ghost commented 1 year ago

Simplest solution for everyone affected who wish to test add following line to /etc/dhcpcd.conf and reboot: denyinterfaces wlan*

prenetic commented 1 year ago

what we can do here is:

  1. set dhcpcd to actuualy work only on eth0 , which will leave wifi0 for udhcp
  2. alter wpa_supplicant config to do not call udhcpd and leave dhcp job to dhcpcd fact that udhcpd exist in the system (as part of busybox) don't mean we have to use it at all some background https://wiki.archlinux.org/title/dhcpcd

I'm not sure option 1 here works as-is. When adding the denyinterfaces wlan* line to /etc/dhcpcd.conf I lose all DNS resolution on the device, and looking at /etc/resolv.conf I'm no longer seeing my DHCP-advertised DNS servers or domain suffix. May make more sense to stick with dhcpcd for everything if it's already handling the generation of resolv.conf.

[01/20/23 11:43:11 AM]
root@MiSTer:~>cd /media/fat/Scripts/ && ./update_all.sh
Launching Update All

No Internet connection, please try again later.
[01/20/23 11:45:39 AM]
root@MiSTer:/media/fat/Scripts>ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 02:03:04:05:06:07 brd ff:ff:ff:ff:ff:ff
3: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether b4:b0:24:29:08:21 brd ff:ff:ff:ff:ff:ff
    inet 192.168.10.66/24 brd 192.168.10.255 scope global wlan0
       valid_lft forever preferred_lft forever
[01/20/23 11:46:29 AM]
root@MiSTer:/media/fat/Scripts>cat /etc/resolv.conf
# Generated by dhcpcd
# /etc/resolv.conf.head can replace this line
# /etc/resolv.conf.tail can replace this line
ghost commented 1 year ago

have you reboot? i did set it in mine system, booting with only wifi conencted and: root@MiSTer:>ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 02:03:04:05:06:07 brd ff:ff:ff:ff:ff:ff 3: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 00:13:25:4c:17:e4 brd ff:ff:ff:ff:ff:ff inet 10.76.175.195/24 brd 10.76.175.255 scope global wlan0 valid_lft forever preferred_lft forever only single ip on the interface root@MiSTer:~>cat /etc/resolv.conf # Generated by dhcpcd # /etc/resolv.conf.head can replace this line # /etc/resolv.conf.tail can replace this line search ninex.info # wlan0 nameserver 10.76.175.1 # wlan0 [01/20/23 10:55:47 PM]

note , you should not have booth eth+ wifi connected

there is small chance in yopour system wifi starts before dhcpcd ... which ovverwrite resolv.conf

imho we should go with dhcpcd as is , global , just reconfigure wpa_supplicant hook to do not call udhcpd...

ghost commented 1 year ago

found ULTIMATE simple solution. switch all dhcp to dhcpcd (so revert dhcpcd.conf to default please) /etc/network/interfaces change iface wlan0 inet dhcp and iface wlan1 inet dhcp

to iface wlan0 inet manual iface wlan1 inet manual

explanation

network startup script start udhcp as interface is set to dhcp (we do not wish to do that) seting manual cause startup script sassume user will provide necessary adresses....

as dhcpcd daemon listens, it pickup interface and configure it...

boom magic ;) root@MiSTer:>ps auxw |grep dhcp 645 dhcpcd dhcpcd: [master] [ip4] 646 root dhcpcd: [privileged actioneer] 647 dhcpcd dhcpcd: [network proxy] 648 dhcpcd dhcpcd: [control proxy] 717 dhcpcd dhcpcd: [BPF ARP] wlan0 10.76.175.197 866 root grep dhcp [01/20/23 11:52:40 PM] root@MiSTer:~>ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 02:03:04:05:06:07 brd ff:ff:ff:ff:ff:ff 3: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 00:13:25:4c:17:e4 brd ff:ff:ff:ff:ff:ff inet 10.76.175.197/24 brd 10.76.175.255 scope global dynamic noprefixroute wlan0 valid_lft 6914sec preferred_lft 6014sec

root@MiSTer:~>cat /etc/resolv.conf # Generated by dhcpcd from wlan0.dhcp # /etc/resolv.conf.head can replace this line domain ninex.info nameserver 10.76.175.1 # /etc/resolv.conf.tail can replace this line

prenetic commented 1 year ago

Ahh nice, I like the simplicity of this approach. The proposed change to /etc/network/interfaces is working flawlessly for me. I've rebooted (soft and hard) about 40 times now and I'm able to connect via Wi-Fi and resolve DNS records every time.

sorgelig commented 1 year ago

Need more feedbacks. If it will work for others, then i will add it.

Drakonas commented 1 year ago

With this "inet manual" fix and no others, I can reach the machine via Samba (and wifi symbol shows), but it still doesn't seem to register everything properly: obs64_2023-01-26_11-18-58 I have re-enabled udhcpc for this test. It seems udhcpc still causes the fault on boot. Renaming /user/sbin/udhcpc to /usr/sbin/_udhcpc (or something else to disable it) always fixes this for me.

prenetic commented 1 year ago

With this "inet manual" fix and no others, I can reach the machine via Samba (and wifi symbol shows), but it still doesn't seem to register everything properly: obs64_2023-01-26_11-18-58 I have re-enabled udhcpc for this test. It seems udhcpc still causes the fault on boot. Renaming /user/sbin/udhcpc to /usr/sbin/_udhcpc (or something else to disable it) always fixes this for me.

Assuming your MiSTer is fully up-to-date, can you confirm whether /etc/resolv.conf is being generated properly with inet manual specified?

ghost commented 1 year ago

@Drakonas only one fix at once. technically you should rollback to default settings and (ONLY!) do /etc/hetwork/interfaces change. use only one active interface ,or wifi or eth ,not booth at once. then do screnshot of: ps auxw ip a ip r s cat /etc/resolv.conf

there is no magic here , dhcpcd daemon will work for all. (assuming your linux is up to date)

Drakonas commented 1 year ago

@Drakonas only one fix at once. technically you should rollback to default settings and (ONLY!) do /etc/hetwork/interfaces change. use only one active interface ,or wifi or eth ,not booth at once. then do screnshot of: ps auxw ip a ip r s cat /etc/resolv.conf

there is no magic here , dhcpcd daemon will work for all. (assuming your linux is up to date)

I said 'With this "inet manual" fix and no others'. I had reverted all other changes to default prior to testing including any dhcp config and executable renames, but I'll do it again to get you this information.

Ethernet is not affected by any of these changes, and I've never had trouble with ethernet regardless of using everything default or not. It's just wifi that is affected.

The following is only with the inet manual change in /etc/network/interfaces. The /etc/dhcpcd.conf is default, and /usr/sbin/udhcpc exists. I've removed screenshot previews because this post would be astronomically long with them.

With eth0 only With wlan0 only

As you can see from the wlan0 screenshot, with the inet manual fix alone, dhcpcd still attempts to get another lease with a 169 address.

wlan0 - ip a wlan0 - ip r s Furthermore, wlan0 has two IP's registered, one dynamic and one global.

This allows connections between my machine and the MiSTer (albeit hostname-relationships do not work), but the MiSTer scripts cannot get an internet connection.

wlan0 - cat /etc/resolv.conf DNS still working though. Yay.

Now, if I rename /usr/sbin/udhcpc to /usr/sbin/_udhcpc, but leave this inet manual fix intact, scripts still do not get an internet connection. dhcpcd still obtains a second ip with 169 address: wlan0 - ps auxw (inet manual + no udhcpc)

I should mention that my initial attempt to run inet manual with no udhcpc was met with it finally only grabbing one IP address. However, as I know this issue is related to certain modems getting confused, I turned the MiSTer off, unplugged the wlan adapter, plugged in ethernet, and then turned it on. It grabbed a new IP. I rebooted once more as-is. Still fine. Then I turned it off as-is, unplugged ethernet and plugged in wlan adapter, and now it gets a 169 address again. So this shows my initial experience with it eventually working after a few reboots but the issue will return again later.

My assumption for this working after a few reboots eventually is the modem stops getting confused. So you have to force the MiSTer to change IP's, then the issue returns when you next try wifi again.

Now, if I leave /usr/sbin/_udhcpc renamed (disabled) and revert /etc/network/interfaces to defaults (inet dhcp), everything works. See below: wlan0 - ps auxw (inet dhcp default and no udhcpc)

And now I will repeat the exact same wlan0 -> ethernet + reboot twice -> wlan0 to prove it will work first time wlan0 pulls a new IP (I'm writing this before doing it, to show how confident I am that removing udhcpc is all that is needed to fix this issue): wlan0 working (only /usr/sbin/udhcpc removed) ps auxw && ip a && ip r s

TL;DR. Just remove udhcpc. That's all that's needed. There's no reason to get anymore complicated.

ghost commented 1 year ago

@Drakonas calm down man, just wanted to see what is going on here. now i see situation in your system and new question raised obraz for some reason dhcpd have problem with keeping ip allocation. see your own screenshot ,single dhcpd keeps 2 addresses, one from allocation from your router and one link-local (usually spin only when no dhcp found) most probably problem here is not dhcp you can add line:

noipv4ll

to your /etc/dhcpcd.conf

which will prevent from bringing up link locall addresses (169.254.x.x)

however this will solve only "no intenret" "problem"

can you please examine output from iwconfig command? such dhcp problems occur mostly when poor signal lievel / link quality or high noise level on wifi is present what i am guessing is a problem here is poor link quality (and what i am trying to point here , problem you are fighting with , whatever similar to problem i pointed is something different.

also what i would suggest is to put wpa_supplicant to debug mode and see if it doesn't report rapid re-connections.. because i am sure , dhcpcd here isn't a problem is more or like an Cannary .

thansks for the help wit h investigation

ghost commented 1 year ago

note for @sorgelig general fix for conflict udhcp vs dhcpcd , should go in next linux release. the fix with disabling ip4all , not. because this may be used to direct connect NAS or something to Ethernet , or many other scenarios.

Drakonas commented 1 year ago

There are at least two 'multiple lease' problems:

  1. udhcp and dhcpd both try to get leases on an interface. This is not the original poster's problem, but obviously it is happening for some people. Seems likely that udhcp just needs to be disabled.

  2. On some systems 'wlan0' is visible when dhcpd launches, and it immediately sends a lease request and gets a valid response. However, it is so early in the boot process the filesystem is still read only so it can't write a lease state file. Then for some reason it receives a udev 'device added' event. Without a valid lease file it tries to solicit a new lease. This fails, so it falls back to Ipv4LL which then adds a 2nd IP, but more importantly it messes up the route table and makes the network effectively non-functional.

The log linked in the initial post seems to indicate the same IAID is used for both requests. Unfortunately I can't reproduce the issue here (my wlan0 is not available when dhcpd starts up, so it only reacts to the udev event) so I can't see if there's anything different about the request.

Dhcpd needs a way to write lease files, even on startup. I'm not sure if it is feasible to move the startup of dhcpd so it always starts after the filesystem is remounted rw

https://github.com/MiSTer-devel/Linux-Kernel_MiSTer/issues/29#issuecomment-1218999073

@gkrzystek as stated here, this is the cause. Please read the thread before saying my wifi 6 router being 3 meters away from my MiSTer is the issue.

I have been calm, but after that post I am trying my best to be civil. Lol. I have spent months testing this and replacing my modem was already something I tried. the problem still wasn't fixed.

I am.open to suggestions but I propose we try to figure out what is causing wlan0 to be visible sometimes when the MiSTer launches, while having @sorgelig move ahead with removing udhcpc/ifup/ifdown, as they're all hardcoded to use udhcpc in BusyBox. This will fix a number of people's issues, but not all (as in problem 2 in the quoted post above)

So, in regards to what we already know, I have a new theory, and I can do more testing for this @zakk4223 but from what I have found recently, the hard reboot script for MiSTer does not fully reboot and does not bring interfaces down or the dhcp client, but the boot script is relaunched. Could that be the cause of some people having multiple leases?

I'm wondering if some people thought rebooting from the MiSTer menu and power cycling it was the same thing, but my recent findings have shown they are not. Looking further up in this thread you'll find a link to someone proposing a script change for the cold reboot script.

I am not sure if this is necessary to address the issue, but I am wondering if cold rebooting might render different boot process that is worth testing. I can do some later on. I am heading to bed lol.

ghost commented 1 year ago

@Drakonas trick is mate / filesystem is alaways RO , it's being remounted rw on user login / script run. and i will have to spend anouther2 hours to explain you why in details. long story short, it's embedded system with pivot root , without proper shutdown procedure. (if / would be rw whole time , on every reboot you would get at last recover journal or fsck)

1) to properly diagnose hook console to pc via usb cable. 2) your diagnosis is almost fine but order don't match look for the pids of dhcp forks , your ip4all gets set after dhcplease were got. 1039 < 1604 which means, your system seems get proper ip from dhcp then for some reason it "thinks" it didn't got one.

i fully understand your frustration mate , and i am really interested in finding where problem is. however ,as you are more focused on complain , than on actual troubleshooting... i pass. i sorted mine problem ,shared with others steps how to fix similar to mine problem, and as ypu don't wish to cooperate , i am stepping down.

Drakonas commented 1 year ago

@Drakonas trick is mate / filesystem is alaways RO , it's being remounted rw on user login / script run. and i will have to spend anouther2 hours to explain you why in details. long story short, it's embedded system with pivot root , without proper shutdown procedure. (if / would be rw whole time , on every reboot you would get at last recover journal or fsck)

1. to properly diagnose hook console to pc via usb cable.

2. your diagnosis is almost fine but order don't match
   look for the pids of dhcp forks , your ip4all gets set after dhcplease were got.
   1039 < 1604
   which means, your system seems get proper ip from dhcp then for some reason it "thinks" it didn't got one.

i fully understand your frustration mate , and i am really interested in finding where problem is. however ,as you are more focused on complain , than on actual troubleshooting... i pass. i sorted mine problem ,shared with others steps how to fix similar to mine problem, and as ypu don't wish to cooperate , i am stepping down.

I deleted my original post. I was too hasty to respond and for that I apologize.

ghost commented 1 year ago

@Drakonas no hard feelings please try modify dhcpcd.conf , putit to debug mode please try switch wpa_supplicant into debug mode and boot system with console hooked over usb (read docs how to use usb terminal with putty) you can then grab text output from putty boot and share. there is small chance wifi driver making us troubles. or something like htat. i really love to solve such puzzles... just need more detailed data. this proccess will take some work from both of us. please try to help me to help you :)

Drakonas commented 1 year ago

@Drakonas no hard feelings please try modify dhcpcd.conf , putit to debug mode please try switch wpa_supplicant into debug mode and boot system with console hooked over usb (read docs how to use usb terminal with putty) you can then grab text output from putty boot and share. there is small chance wifi driver making us troubles. or something like htat. i really love to solve such puzzles... just need more detailed data. this proccess will take some work from both of us. please try to help me to help you :)

I can put the effort in and am willing to. I will report back tomorrow. It is 6AM here and I must sleep. Thank you for being understanding. I have anger issues and it's been hard work getting them under control for the past few years.

ghost commented 1 year ago

and about reboot from menu: short press only restarts menu process , os stay untouched if you press for long , hard reboot it will just kill whole system like with kernel panic , and kick bootloader , so kernel will reinitialise everything like in situation of powerloss so idea @zakk4223 is half true, only short press does not reboot , but it actually not touch network it stay open you can check it out , keep ssh session to your mister (or usb console open) on usb console you will see system don't reboot or fully reboot - depend cold/hot reboot from menu chosen on ssh you will see that your conenction will be cut only on hard reboot and i agree this menu item is counterintuitive.

Drakonas commented 1 year ago

and about reboot from menu: short press only restarts menu process , os stay untouched if you press for long , hard reboot it will just kill whole system like with kernel panic , and kick bootloader , so kernel will reinitialise everything like in situation of powerloss so idea @zakk4223 is half true, only short press does not reboot , but it actually not touch network it stay open you can check it out , keep ssh session to your mister (or usb console open) on usb console you will see system don't reboot or fully reboot - depend cold/hot reboot from menu chosen on ssh you will see that your conenction will be cut only on hard reboot and i agree this menu item is counterintuitive.

My ssh connection isn't cut even when I power cycle manually, which doesn't make sense to me. It should lose connection. During all my tests with eth0 only and wlan0 only, the ssh connection was never cut when I turned off my mister. I use splitter to USB hub and de10-nano, with analog io.

ghost commented 1 year ago

My ssh connection isn't cut even when I power cycle manually, which doesn't make sense to me. It should lose connection.

huh taht would be magic (sorry for my sarcasm) or your hub is feeding power to the system somehow but there is no possibility to keep ssh if system is bein properly switched off

reset pwoer with unpowering both de10 and hub also please share what kind of usb network adapter you have i have 3 different units , using same driver ,and all behave differently...

Drakonas commented 1 year ago

My ssh connection isn't cut even when I power cycle manually, which doesn't make sense to me. It should lose connection.

huh taht would be magic (sorry for my sarcasm) or your hub is feeding power to the system somehow but there is no possibility to keep ssh if system is bein properly switched off

reset pwoer with unpowering both de10 and hub also please share what kind of usb network adapter you have i have 3 different units , using same driver ,and all behave differently...

Nevermind, if I wait long enough it is lost. I am tired.

Drakonas commented 1 year ago

@Drakonas no hard feelings please try modify dhcpcd.conf , putit to debug mode please try switch wpa_supplicant into debug mode and boot system with console hooked over usb (read docs how to use usb terminal with putty) you can then grab text output from putty boot and share. there is small chance wifi driver making us troubles. or something like htat. i really love to solve such puzzles... just need more detailed data. this proccess will take some work from both of us. please try to help me to help you :)

I have added the debug parameter to the dhcpcd.conf but wpa_supplicant doesn't seem to support this in its conf, only in execution with -d or -dd parameter. Where is wpa_supplicant executed from? I am unable to find that, and I'm unsure if it can be changed.

I will check back later on.

ghost commented 1 year ago

@sorgelig can you please rebuild linux image with change i proposed? /etc/nerwork/interfaces iface wlan0 inet manual iface wlan1 inet manual +rebuild kernel with updated rtl drivers that you merge from me. so li can try to help @Drakonas with his wifi issue diagnostics ??? RTL drivers update should definetly improve dongle compat list

sorgelig commented 1 year ago

https://www.mediafire.com/file/a8dhdti53hntru3/test.7z/file

ghost commented 1 year ago

@Drakonas please download, unpack and replace linux.img and zImage_dtb in your sd card in linux folder. this image contains, more stable realtek wifi dongle cards ,as well modification i requested ,so it should connect more realiebly. if not, i will request lsit of actions from you ,like setting wifi driver into debug mode, etc. no worries will paste list of precise instructions

ghost commented 1 year ago

@sorgelig thanks for quick response. just finished test with all 10 dongles i have. can you consider this image to be published in mister-devel - we need more input from users...

changes we supplied, solving primary problem of 2 dhcpclients handling wlan0 (which at last for some users) were causing a problems. i do not expect more problems occurring because of those changes, one is cosmetics, and rtl drivers have more userbase in aircrack-ng community than we have ;) and updated realtek drivers, also will reduce the frustrations at last for newjoiners. as those drivers are build with debug. i will continue investigations with @Drakonas as his case , require some low level investigation - i am not able reproduce this case other way than actually breaking my wifi link into state where more than 50% packages are lost which seems , not be a rootcause of his problem.