earlephilhower / arduino-pico

Raspberry Pi Pico Arduino core, for all RP2040 and RP2350 boards
GNU Lesser General Public License v2.1
2.03k stars 422 forks source link

MDNS stops responding after a couple of minutes #1267

Closed obdevel closed 1 year ago

obdevel commented 1 year ago

MDNS seems to stop responding after a couple of minutes. The sketch continues to run and responds to pings to the IP address. MDNS.isRunning() continues to return true. I've tried from both my MacBook and iPhone (on a larger program with a webserver).

My minimum test sketch:


// Pico_MDNS_test.ino

#include <WiFi.h>
#include <LEAmDNS.h>

#define SSID "xxx"
#define PASSWORD "123"
#define HOSTNAME "cbusserver"

void setup() {

  static unsigned long t1 = millis();

  Serial.begin(115200);
  while (!Serial && millis() - t1 < 5000) delay(100);

  // connect to wifi
  Serial.printf("connecting to wifi as STA\n");
  WiFi.mode(WIFI_STA);
  WiFi.begin(SSID, PASSWORD);

  while (WiFi.status() != WL_CONNECTED) {
    delay(500);
    Serial.printf(".");
  }

  Serial.printf("\n");
  Serial.printf("WiFi connected ok\n");

  // start MDNS server
  if (MDNS.begin(HOSTNAME)) {
    Serial.printf("MDNS responder started with hostname = %s.local\n", HOSTNAME);
    MDNS.addService("http", "tcp", 80);
  } else {
    Serial.printf("** failed to start MDNS responder\n");
  }

  Serial.printf("end of setup\n");
}

void loop() {

  static unsigned long t2 = millis(), counter = 0;

  // prove we're still alive
  if (millis() - t2 >= 10000) {
    t2 = millis();
    Serial.printf("%ld\n", counter++);
  }

  if (!MDNS.isRunning()) {
    Serial.printf("MDNS is not running\n");
  }

  MDNS.update();
}

I'm testing from the macOS command line with:

$ while [ true ]\ndo\nping cbusserver.local -c 4\nsleep 10\ndone

Anything I'm doing wrong ? Any debug I can try ?

Core version 3.0.0 Arduino IDE 1.8.19

Screenshot 2023-03-05 at 03 38 11

Thanks.

earlephilhower commented 1 year ago

I never looked into the MDNS code so this might be a tricky one. It's taken verbatim from the ESP8266 core where I think I remember a similar issue.

@d-a-v, sorry to bring you in, but didn't you just look into something similar and found it was actually per-spec?

obdevel commented 1 year ago

I tried a couple of previous versions to see if I could find a regression. Both 2.7.0 and 2.7.3 work correctly for at least 30 mins, so it seems that the regression occurred in 3.0.0, perhaps in the underlying network code or elsewhere. I'll leave the 2.7.3 test running overnight.

There is nothing in 3.0.0 that I urgently need, so I can stick with 2.7.3 for now.

earlephilhower commented 1 year ago

There was just a change to the LWIP stack which may have an impact here (but I'm not equipped to test). I'd recommend trying 3.1.0 when it comes out. I'll close this and we can revisit if needed.

dinther commented 1 year ago

In my case (running version 3.0.0) LEAMDNS doesn't stop responding. It didn't respond in the first place.

When LEAMDNS starts, it announces to the network "Hey I am here" Other MDNS hosts on your network cache this information for 2 minutes. In my case it was Bonjour on windows doing this. You can check by opening a windows command windows and run dns-sd -Q yourhostname.local. it will keep running and show domains added and after 2 minutes you can see it is removed. With version 2.7.3 the cache duration seems much longer.

Either way, android will not find "yourhostname.local" because that reply must come from the matching host or from the local MDNS cache on the phone. Since the host can't hear the multicast messages and there is no local cache the ip address is not found.

All your tests were probably done on the same machine which made it appear LEAMDNS replied but it didn't, it was bonjour returning cached data.

You can also run sudo tcpdump dst 224.0.0.251 and udp and ip and port 5353 on a linux machine to see what is flying around the network in terms of MDNS

I believe Raspberry pi pico w can not receive multicast packets due to an issue that is way beyond my comprehension. I am so glad I found this post. My entire project depends on the ability to use MDNS.

Earle, if there a way I can test the change to the LWIP stack now?

mef51 commented 1 year ago

I had a similar experience to @dinther , I happened to update to 3.0.0 and then started adding LEAMDNS for the first time and no response. Downgrading to 2.7.3 I started getting a response.

obdevel commented 1 year ago

So why does name resolution continue to work after 120 secs in 2.7.3 (and earlier) but not in 3.0.0 ? The clients are identical in all test cases, macOS and iOS. I don't have access to Windows or Android.

earlephilhower commented 1 year ago

@dinther thanks for the detailed explanation. I think this is all related to the new CYW43 binary blob. It appears to block all multicast MACs by default now (whereas SDK 1.4's passed them thru). LWIP isn't even made aware of them, they're just dropped inside the WIFI chip.

I've gone thru Wikipedia and found the 2 MACs that MDNS seems to use for m-cast and added manual calls to the new APIs in #1290. Can you give that a try? I've had the PicoW up for 800 seconds now and avahi-discover is still showing cbusserver for me, so AFAICT it's now working, but I am by no means an AVAHI (or Ethernet or LWIP) expert so someone else's validation would be much appreciated.

(Also, if you pull #1290 then you will get the stability improved LWIP stack automatically since that was merget to master. That's probably not related to this as it was causing a complete crash or hang in certain cases, not odd "seems like its working but doing nothing" as described here).

dinther commented 1 year ago

I try Earl, I am incredibly out of my depth here. I know how to use the Arduino IDE and that is the extend of it.

earlephilhower commented 1 year ago

No worries, @dinther . Like I said, you explained very well what was going on at the low level. Made it easy to figure out the underlying problem!

I have just done a test where I ran the test case in the initial post with the name "oldcbusserver", waited 10 mins, then turned on a PC and tried to ping "oldcbusserver.local" and failed (like you saw). Turned off the PC. Then I checked out #1290, changed the name to "newcbusserver" and waited 10 more mins. Turned on the PC and tried to ping "newcbusserver.local" and it worked.

So I think we're good and I'll merge it and 3.1will be good to go.