adafruit / circuitpython

CircuitPython - a Python implementation for teaching coding with microcontrollers
https://circuitpython.org
Other
4.07k stars 1.2k forks source link

mDNS stops finding peers after some time, then sometimes hard faults #6832

Open anecdata opened 2 years ago

anecdata commented 2 years ago

CircuitPython version

Adafruit CircuitPython 8.0.0-beta.0 on 2022-08-18; Adafruit Feather ESP32-S2 TFT with ESP32S2

Code/REPL

import time
import wifi
import mdns
from secrets import secrets

MDNSFINDTIMEOUT = 5

time.sleep(5)  # connect serial

# to connect in case CP8 web workflow hasn't already
if not wifi.radio.ipv4_address:
    print (f"{time.monotonic_ns()} Connecting to wifi AP ... ", end="")
    wifi.radio.connect(secrets["ssid"], secrets["password"])
    print (f"{wifi.radio.ipv4_address}")

print(f"{time.monotonic_ns()} Starting mDNS server ...")
m = mdns.Server(wifi.radio)

while True:
    print(f"{time.monotonic_ns()} Finding mDNS hosts from {wifi.radio.ipv4_address} ...")
    for service in m.find(service_type="_circuitpython", protocol="_tcp", timeout=MDNSFINDTIMEOUT):
        print(f"{time.monotonic_ns()} {service.service_type} {service.protocol} {service.port} {service.hostname} {service.instance_name}")
    time.sleep(15)

Behavior

Loop will display findings for 5-10 minutes, or an hour. Then finding no results in all subsequent loops, despite still being connected to wifi. Control-C exits, but Control-D to reload either runs and still finds no hosts, or triggers a hard fault. Sometimes it will hard fault by itself after some iterations of finding no hosts.

Regression test with:

Adafruit CircuitPython 7.3.2 on 2022-07-20; Adafruit Feather ESP32-S2 TFT with ESP32S2

yields similar behavior.

Not sure if this is related to #6186.

Description

No response

Additional information

Optionally: add a deinit to Server to allow user code to deinit / reinit the mDNS server to work around some issues.

DavePutz commented 2 years ago

@anecdata - I was able to reproduce this issue; but what I saw on a network analyzer is that the mDNS queries were still being sent; but the mDNS server (which I had running on a separate ESP32) stopped responding. What are you using for a mDNS server? Also, did you run any Web Workflow activity during your testing? I did that and did cause a hard crash.

anecdata commented 2 years ago

@DavePutz I'll go back and verify, but I believe I tested this with web workflow on and off. I may be misunderstanding something, but there is no mDNS server other than the mdns.Server(wifi.radio) which is used to do the network queries. I have a number of other devices with web workflow running (and I had one with manual pre-web-workflow mDNS running), and they should all, in theory, show up in the .find listing (and often do, until it the .find starts coming back empty.

Other behaviors I see when scanning mDNS for _circuitpython _tcp are

The amount of time or number of scans varies before an issue arises.

P.S. I've put connect() in the loop so that before each scan, there is validation that the device is still connected to an AP / has an IPv4.

P.P.S. Yes, just ran the scan on a device that is not running web workflow, and it failed after several minutes with an OSError -2. Another failed after several minutes with ConnectionError: No network with that ssid. BTW, ConnectionError: No network with that ssid is an exception I very rarely see other than this. I have other devices running in this area, one continually doing wifi scans for APs, and they can all connect and show good RSSI to the nearby APs.

dhalbert commented 1 year ago

Actions: look at ESP-IDF issues. Test with Pico W as well.

anecdata commented 1 year ago

Running mDNS finder now on Pico W for comparison: Adafruit CircuitPython 8.0.0-beta.4-68-g6e40949f6 on 2022-12-02; Raspberry Pi Pico W with rp2040 No crashes yet, but two differences in results:

Addendum: still going strong after running overnight (not surprising since the code / SDK are so different)

tannewt commented 1 year ago

Do we think there should be identical duplicate behavior?

anecdata commented 1 year ago

raspberrypi: more on mDNS duplicates (etc.) in issue #7326

tannewt commented 1 year ago
  • there's a debug message remaining in raspberrypi: found service 0x********

I fixed this in #7445

I'm looking at the reliability issue now.

tannewt commented 1 year ago

I tried to reproduce this on an ESP32-S3 USB OTG but after 30 minutes it was still finding the other CP device. Any idea how many total results you got before it crashed? Maybe we're leaking them. Would you mind testing with a DEBUG build to get the backtrace? Thanks!

anecdata commented 1 year ago

I'd guess something on the order of half dozen results every 15 seconds batch in the loop, sometimes more, sometimes less.

I haven't done much with mDNS recently, but I didn't see it during testing of "Share the web workflow MDNS object with the user" and other recent mDNS changes. I'll try first to just set it up and see if it's still happening. If it is, I can queue it up after 7459, the ESP32-S2 safe mode issue.

tannewt commented 1 year ago

FWIW my test ran two and a half hours and kept finding other devices.

anecdata commented 1 year ago

I loaded up modified test code from above (mostly a more robust connect, and bumped the mDNS timeout to 10 seconds) onto an S2 TFT with Adafruit CircuitPython 8.0.0-beta.6-44-g936ecdd2b on 2023-01-18:

import time
import traceback
import wifi
import mdns
from secrets import secrets

MDNSFINDTIMEOUT = 10

def connect():
    while not wifi.radio.ipv4_address:
        try:
            wifi.radio.connect(secrets["ssid"], secrets["password"])
        except ConnectionError as e:
            traceback.print_exception(e, e, e.__traceback__)
            time.sleep(1)

    # time.sleep(0.100)  # Pico W wifi.radio.ipv4_address can lag wifi.radio.connect by tens of ms
    time.sleep(1)  # ap_info takes a moment to be valid
    rssi = None
    if hasattr(wifi.radio, "ap_info") and wifi.radio.ap_info.rssi:
        rssi = wifi.radio.ap_info.rssi
    return wifi.radio.ipv4_address, rssi

time.sleep(2)  # wait for serial
print(f"{'='*25}")
print(f"{time.monotonic_ns()} Starting mDNS server")
m = mdns.Server(wifi.radio)

while True:
    print(f"{time.monotonic_ns()} Finding mDNS hosts from {connect()}")
    for service in m.find(service_type="_circuitpython", protocol="_tcp", timeout=MDNSFINDTIMEOUT):
        print(f"{time.monotonic_ns()} {service.service_type} {service.protocol} {service.port} {service.hostname} {service.instance_name}")
    time.sleep(15)

Also loaded up Adafruit CircuitPython 8.0.0-beta.6-44-g936ecdd2b on 2023-01-18 onto 4 QT Py S2 and 4 Pico W, web workflow enabled, no code.py.

No safe mode observed, just mDNS quirks that seem a little beyond UDP unreliability:

So I think we can close this issue? If safe mode or extended no-results arise again, this or a new issue can be opened. And leave the quirks and platform differences to future testing.

anecdata commented 1 year ago

The code did eventually start looping with

Traceback (most recent call last):
  File "code.py", line 12, in connect
ConnectionError: No network with that ssid

So maybe something is getting messed up in wifi-land, but it's recoverable with a reload.

tannewt commented 1 year ago

So I think we can close this issue? If safe mode or extended no-results arise again, this or a new issue can be opened. And leave the quirks and platform differences to future testing.

I'm not sure we need to close, just re-milestone it.

I am out of stamina for debugging MDNS for the time being.

anecdata commented 1 year ago

Good plan.