berlin-open-wireless-lab / DAWN

Decentralized WiFi Controller
GNU General Public License v2.0
372 stars 64 forks source link

Can't find all APs? #168

Open Cosmos-Break opened 2 years ago

Cosmos-Break commented 2 years ago

I have three APs, Xiaomi AX6S, Phicomm K2P and NewWiFi D2. The OpenWRT version is 22.03.0-rc1 (On all three devices).

AX6S can find all three APs, D2 finds two APs, K2P only finds one AP.

AX6S: image

D2: image

K2P: image

This is not a LuCI problem, because 'ubus call dawn get_network' returns the same result.

kmaras77 commented 2 years ago

i have the same problemm too. i have tested tcp and udp connection and problem is still.

PolynomialDivision commented 2 years ago

Can you check if umdns is working correctly on all aps?

ptpt52 commented 2 years ago

@PolynomialDivision would each AP device see all APs ? my casue:

AP1-------AP2----internet->

AP2 see AP1 but AP1 can't find AP2

and I notice that AP1 make a tcp connection to AP2 on port 1026 but AP2 make no connection to AP1

all my AP setup on dawn is

config network
    option broadcast_port '1025'
    option tcp_port '1026'
    option network_option '2'
    option shared_key 'Niiiiiiiiiiiiick'
    option iv 'Niiiiiiiiiiiiick'
    option use_symm_enc '0'
    option collision_domain '-1'
    option bandwidth '-1'
    option broadcast_ip '192.168.16.255'
ptpt52 commented 2 years ago

I think my case is the same issue with this one.

PolynomialDivision commented 2 years ago

@PolynomialDivision would each AP device see all APs ?

Yes. Sounds like a umdns issue. Can you check:

ubus call umdns browse
ptpt52 commented 2 years ago

on AP1: ubus call umdns browse

{
..........
    "_dawn._tcp": {
        "X-WRT": {
            "port": 1026
        }
    },
......
}

on AP2

.........
    "_dawn._tcp": {
        "X-WRT": {
            "ipv4": "192.168.16.118",
            "ipv6": "fe80::e667:1eff:fe29:68fe",
            "port": 1026
        }
    },
.........
PolynomialDivision commented 2 years ago

The entry on ap1 is wrong. It should contain an IP like on AP2. DAWN does not know where to connect to. Can you restart umdns on both?

ptpt52 commented 2 years ago

The entry on ap1 is wrong. It should contain an IP like on AP2. DAWN does not know where to connect to. Can you restart umdns on both?

true

now i notice that it is related to umdns config:

  1. wrong config:

    config umdns
        option jail '1'
        list network 'lan'
        list network 'meshx0'
  2. good config:

    config umdns
        option jail '1'
        list network 'meshx0'

but I think config 1 should work, but not. @PolynomialDivision

PolynomialDivision commented 2 years ago

but I think config 1 should work, but not.

Do you have the "lan" network available? I also had issues when using a network that is not configured on the ap.

ptpt52 commented 2 years ago

but I think config 1 should work, but not.

Do you have the "lan" network available? I also had issues when using a network that is not configured on the ap.

no, lan is not started on boot, so it is related to this? could umdns tolerate such config? I think it should.

PolynomialDivision commented 2 years ago

no, lan is not started on boot, so it is related to this? could umdns tolerate such config?

Maybe @blogic can say something to that?

ptpt52 commented 2 years ago

@PolynomialDivision there are many case to cause this issue if I restart AP1 this issue happend again. this should not.

ptpt52 commented 2 years ago

also /etc/init.d/umdns restart on AP1 also cause issue that would lead to AP1 ubus call umdns browse get _dawn.tcp with no ip

ptpt52 commented 2 years ago
on AP1, call /etc/init.d/umdns restart, or reboot AP1

on AP2, ubus call umdns browse get:
...
    "_dawn._tcp": {
        "X-WRT": {
            "port": 1026
        }
    },
...

on AP2, call /etc/init.d/umdns reload, then it get the ip:
...
    "_dawn._tcp": {
        "X-WRT": {
                         "ipv4": "xxxxx"
            "port": 1026
        }
    },
...

what a magic issue? @PolynomialDivision @blogic

ptpt52 commented 2 years ago

@PolynomialDivision for now I hack an workaroud to fixup this issue: if dawn found that there is no ipv4, it trigger /etc/init.d/umdns reload to fixup it.

diff --git a/src/utils/ubus.c b/src/utils/ubus.c
index 995b2f8..14d52c1 100644
--- a/src/utils/ubus.c
+++ b/src/utils/ubus.c
@@ -1310,6 +1310,8 @@ int wnm_disassoc_imminent(uint32_t id, const struct dawn_mac client_addr, struct
     return 0;
 }

+static int umdns_need_reload = 0;
+
 static void ubus_umdns_cb(struct ubus_request *req, int type, struct blob_attr *msg) {
     struct blob_attr *tb[__DAWN_UMDNS_TABLE_MAX];

@@ -1341,7 +1343,10 @@ static void ubus_umdns_cb(struct ubus_request *req, int type, struct blob_attr *
             dawnlog_debug("IPV4: %s\n", blobmsg_get_string(tb_dawn[DAWN_UMDNS_IPV4]));
             dawnlog_debug("Port: %d\n", blobmsg_get_u32(tb_dawn[DAWN_UMDNS_PORT]));
         } else {
-            return; // TODO: We're in a loop. Should this be return or continue?
+            if (!tb_dawn[DAWN_UMDNS_IPV4] && tb_dawn[DAWN_UMDNS_PORT]) {
+                umdns_need_reload = 1;
+            }
+            continue;
         }
         add_tcp_connection(blobmsg_get_string(tb_dawn[DAWN_UMDNS_IPV4]), blobmsg_get_u32(tb_dawn[DAWN_UMDNS_PORT]));
     }
@@ -1366,6 +1371,11 @@ int ubus_call_umdns() {
     blob_buf_free(&b);
     dawn_unregmem(&b);

+    if (umdns_need_reload == 1) {
+        umdns_need_reload = 0;
+        system("/etc/init.d/umdns reload");
+    }
+
     return 0;
 }
ptpt52 commented 2 years ago

on AP2:

root@X-WRT:~# lsof -ni :1026
COMMAND  PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
dawn    6858 root    8u  IPv4  993817      0t0  TCP *:1026 (LISTEN)
dawn    6858 root    9u  IPv4 1143895      0t0  TCP 192.168.16.1:40802->192.168.16.118:1026 (ESTABLISHED)
dawn    6858 root   10u  IPv4  995271      0t0  TCP 192.168.16.1:1026->192.168.16.118:41608 (ESTABLISHED)
dawn    6858 root   11u  IPv4 1078948      0t0  TCP 192.168.16.1:1026->192.168.16.118:53928 (ESTABLISHED)
dawn    6858 root   12u  IPv4 1123768      0t0  TCP 192.168.16.1:1026->192.168.16.118:49606 (ESTABLISHED)

but there is new issue, old dawn tcp connections not close?

on AP2, I see there are 3 connections come in from AP1 but only 1 connection out from AP1

on AP1 (just reboot): there is 2 connection total, one in and one out, looks good.

ptpt52 commented 2 years ago

@PolynomialDivision

https://github.com/berlin-open-wireless-lab/DAWN/blob/e596ff131735821684f7ecea73d7634733319f94/src/network/tcpsocket.c#L404

this line, ustream_write just add data to write buff for most case, tcp connection buff is large, and it take long to reach full and never fail before full. this is the cause for tcp connection could never close

case 1:

AP1 reboot
AP2 keep the tcp connection to AP1, and never close,
and dawn just keep push data to buff via ustream_write()

case 2:

AP1 reboot
AP2 keep the connection from AP1

again AP1 up and new connection to AP2, then reboot AP1
then, AP2 keep this new connection from AP1

repeat reboot AP1

in AP2
more and more connections keep since AP1 reboot times. 
ptpt52 commented 2 years ago

to fix this, I hack again:

diff --git a/src/network/tcpsocket.c b/src/network/tcpsocket.c
index 91bc452..0abcfd7 100644
--- a/src/network/tcpsocket.c
+++ b/src/network/tcpsocket.c
@@ -364,12 +364,15 @@ void send_tcp(char *msg) {
         list_for_each_entry_safe(con, tmp, &tcp_sock_list, list)
         {
             if (con->connected) {
+                int need_close = 0;
                 int len_ustream = ustream_write(&con->stream.stream, final_str, final_len, 0);
                 dawnlog_debug("Ustream send: %d\n", len_ustream);
-                if (len_ustream <= 0) {
+                if (ustream_pending_data(&con->stream.stream, true) >= 24*1024 + final_len /* 24K buffered data is enough? */)
+                    need_close = 1;
+                if (len_ustream <= 0 || need_close) {
                     dawnlog_error("Ustream error(" STR_QUOTE(__LINE__) ")!\n");
                     //ERROR HANDLING!
-                    if (con->stream.stream.write_error) {
+                    if (con->stream.stream.write_error || need_close) {
                         ustream_free(&con->stream.stream);
                         dawn_unregmem(&con->stream.stream);
                         close(con->fd.fd);
@@ -401,12 +404,15 @@ void send_tcp(char *msg) {
         list_for_each_entry_safe(con, tmp, &tcp_sock_list, list)
         {
             if (con->connected) {
+                int need_close = 0;
                 int len_ustream = ustream_write(&con->stream.stream, final_str, final_len, 0);
                 dawnlog_debug("Ustream send: %d\n", len_ustream);
-                if (len_ustream <= 0) {
+                if (ustream_pending_data(&con->stream.stream, true) > 24*1024 + final_len /* 24K buffered data is enough? */)
+                    need_close = 1;
+                if (len_ustream <= 0 || need_close) {
                     //ERROR HANDLING!
                     dawnlog_error("Ustream error(" STR_QUOTE(__LINE__) ")!\n");
-                    if (con->stream.stream.write_error) {
+                    if (con->stream.stream.write_error || need_close) {
                         ustream_free(&con->stream.stream);
                         dawn_unregmem(&con->stream.stream);
                         close(con->fd.fd);
PolynomialDivision commented 2 years ago

@ynezz Is there an option for ustream to handle such situations?

ptpt52 commented 2 years ago

I also have add patch to close timeout connection https://github.com/ptpt52/dawn/commit/b195cfab7d48c52d27b31fe36238ccadd824c66e

milindpatel63 commented 2 years ago

i had a similar issue. I am running 2 routers, a TP Link Archer A7 and Xiaomi Wifi Mini. They are connected together via 802.11s mesh. The Xiaomi one is connected to internet via 5Ghz Wifi Client Mode.

I had enabled STP on br-lan interface previously, thinking it would help resolve any kind of loop forming in mesh situations, but that was the issue. After disbaling STP, both my routers can see umdns advertised dawn, smb, ssh, and other stuff in "ubus call umdns browse" and it's all working perfectly now. Also, one more thing, in /etc/avahi/avahi-daemon.conf i have set enable-reflector=yes which was no by default. I don't know if this helped, but it's all working now.

Hope this helps someone.