adafruit / circuitpython

CircuitPython - a Python implementation for teaching coding with microcontrollers
https://circuitpython.org
Other
4.09k stars 1.21k forks source link

raspberrypi 8.0.0 socket exceptions when web workflow is enabled #7333

Closed anecdata closed 1 year ago

anecdata commented 1 year ago

CircuitPython version

Adafruit CircuitPython 8.0.0-beta.5 on 2022-12-08; Raspberry Pi Pico W with rp2040

Code/REPL

import time
import traceback
import wifi
import socketpool
import ssl
import adafruit_requests
from secrets import secrets

TEXT_URL = "http://wifitest.adafruit.com/testwifi/index.html"

print("Connecting...")
wifi.radio.connect(secrets["ssid"], secrets["password"])
print("My IP address is", wifi.radio.ipv4_address)

pool = socketpool.SocketPool(wifi.radio)
requests = adafruit_requests.Session(pool, ssl.create_default_context())

while True:
    try:
        print("Fetching text from", TEXT_URL)
        response = requests.get(TEXT_URL)
        print("-" * 40)
        print(response.text)
        print("-" * 40)
    except Exception as e:
        traceback.print_exception(e, e, e.__traceback__)
    time.sleep(5)

Behavior

The above code, slightly modified boilerplate internet connect code, does not work on raspberrypi 8.0.0-beta.5 when web workflow is enabled (wifi credentials in .env file).

Consistently gets exception:

Traceback (most recent call last):
  File "code.py", line 139, in get_http
  File "adafruit_requests.py", line 718, in post
  File "adafruit_requests.py", line 679, in request
OutOfRetries: Repeated socket failures

It seems that the timeout in requests is kicking in within the library. No data is received. Occasionally ETIMEDOUT.

Underlying exceptions to the OutOfRetries occur (discovered through some print-debugging in adafruit_requests) if a manual timeout is supplied. Exceptions are initially OSError: [Errno 116] ETIMEDOUT in recv_into, waiting for the H of HTTP to kick off the receive. Then OSError: 32 in _send takes over on later requests. But again, sometimes there is no error, just no data received.

I suspect some low-level socket shenanigans. There should be 8 sockets available, web workflow uses at least one for the TCP listen, probably something for mDNS, and more with client(s) accessing web workflow features?

Interestingly If requests alternate between two URLs, after several failures for both, the second will succeed once. Rinse, repeat.

No issue with the above code on 8.0.0-beta.4, or on 8.0.0-beta.5 with web workflow disabled (no .env file).

Description

No response

Additional information

No response

tannewt commented 1 year ago

Web workflow uses 1 socket for mdns, 1 socket to accept connections with a buffer of 1, 1 "active" socket for web requests and 1 socket for the websocket connection. So, 4 or 5 depending on how the pending connection is handled.

anecdata commented 1 year ago

On some URLs, with web workflow off, things appear to work. But with the added print-debugs to Requests to expose the pass'd and retried exceptions, it looks like on _send_request there is either OSError: [Errno 9] EBADF (in my case, for an mDNS address) or OSError: 32 [EPIPE?] (in my case, for a numeric local IPv4) on every other try, but the retry gets it, so nothing is reported with the standard library.

This happens in beta.4 and beta.5. So something has been lurking there that may be related to the web workflow beta.5 case.

anecdata commented 1 year ago

Forgot to mention that with web workflow on raspberrypi beta.5, when using the above code, I've tried several URLs. With the Adafruit URL, I don't know what the server sees, but none are successful. When I hit a WAN Apache server I can access (using domain name), the server shows HTTP 200 status codes and occasionally some are successful, but usually error as described above. But a LAN Apache server (using IP address) shows HTTP 408 status codes (server timed out). So it looks like some level of connection and data transfer is made, but the reads tend to fail in most circumstances.

anecdata commented 1 year ago

Did some testing on espressif for comparison: none of the exceptions occur with beta.4 or beta.5 on QT Py ESP32-S2 or QT Py ESP32-S3, including none of the "hidden" exceptions pass'd through Requests' retries.

However, with some URLs, the first try in the _send_request Requests loop gets an empty response (status code 200 on the servers in all observable cases) but the retry is successful.

Addendum: In another espressif code context (QT Py ESP32-S2 running beta.5, requests timeout set to 5 seconds), the first request in _send_request gets OSError: Unhandled ESP TLS error 78 0 8018 -78 every time, but the internal Requests retry is successful.

It does seem that there are several issues at play here, but the most immediate is raspberrypi not playing well with beta.5 and web workflow.

jepler commented 1 year ago

I'm testing with

Adafruit CircuitPython 8.0.0-beta.5-11-gfc13fba6e5-dirty on 2022-12-14; Raspberry Pi Pico W with rp2040

this contains some fixes to make a wider range of socket name lookup failures throw the "gaierror" exception as intended, but should be functionally identical to main.

wifi workflow is enabled. However, I have not connected to the web workflow interface or messed with resolving the device's mdns name from other computers.

To exclude as many factors as possible, I did not use adafruit_requests, but just wrote low level tests of getaddrinfo and socket.connect. all code was run by import from code.py, not sure why it would make a difference to have the import, it's just the way I quickly switch between pieces of test code.

import socketpool
import wifi

if wifi.radio.ipv4_address is None:
    print("connecting to wifi")
    wifi.radio.connect(os.getenv("WIFI_SSID"), os.getenv('WIFI_PASSWORD'))
print(f"local address {wifi.radio.ipv4_address}")

socket = socketpool.SocketPool(wifi.radio)

host_names = ['eric.local', 'rat.local', 'jo.local', 'example.com', 'unpythonic.net', 'github.com']

def cycle(*args):
    while True:
        for a in args:
            yield from a

print("Name resolution test")
success = failure = 0
for i, h in zip(range(100), cycle(host_names)):
    try:
        info = socket.getaddrinfo(h, 80)
        print(end='.')
        success += 1
    except Exception as e:
        print(f"{type(e)}: {e} {getattr(e, 'errno')}")
        failure += 1

print(f"Over {success+failure} attempts, {success} success {failure} failure")

print()
print("Causing a failure, should give gaierror")
try:
    print(socket.getaddrinfo('this-should.fail', 80))
except Exception as e:
    print(f"{type(e)}: {e} {getattr(e, 'errno')}")

print("TCP connection test")
target = ('rat.u', 443) # a local computer running a web server

success = failure = 0
for i in range(100):
    try:
        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
            s.connect(target)
            print(end='.')
        success += 1
    except Exception as e:
        print(f"{type(e)}: {e} {getattr(e, 'errno', None)}")
        failure += 1

print(f"Over {success+failure} attempts, {success} success {failure} failure")

print()
print("SSL connection test")
time.sleep(.1)
ctx = ssl.create_default_context()
success = failure = 0
for i in range(10): 
    try:
        with ctx.wrap_socket(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
            print(f"{s=}")
            time.sleep(.1)
            s.connect(target)
            s.send('HEAD / HTTP/1.1\r\nHost: zaphod.unpythonic.net\r\n\r\n')
            read_response(s)
            print(end='.')
        success += 1
    except Exception as e:
        print(f"{type(e)}: {e} {getattr(e, 'errno', None)}")
        failure += 1

print()
print(f"Over {success+failure} attempts, {success} success {failure} failure")

Unfortunately, this doesn't reproduce any problems for me. Typical output:

Auto-reload is on. Simply save files over USB to run them or enter REPL to disable.
code.py output:
local address 10.0.2.94
Name resolution test
....................................................................................................
Over 100 attempts, 100 success 0 failure

Causing a failure, should give gaierror
socket_resolve_host() returned -2
<class 'gaierror'>: (-2, 'Name or service not known') -2
TCP connection test
....................................................................................................
Over 100 attempts, 100 success 0 failure

SSL connection test
s=<SSLSocket>
b'HTTP/1.1 200 OK\r\nDate: Wed, 14 Dec 2022 21:24:14'
b' GMT\r\nServer: Apache/2.4.54 (Debian)\r\nLast-Modif'
b'ied: Sat, 08 Jul 2017 15:52:04 GMT\r\nETag: "29cd-'
b'553d054d154ae"\r\nAccept-Ranges: bytes\r\nContent-Le'
b'ngth: 10701\r\nVary: Accept-Encoding\r\nContent-Type'
b': text/html\r\n\r\n'
…repetitions snipped…
Over 10 attempts, 10 success 0 failure

Code done running.
RetiredWizard commented 1 year ago

I tried running the original test script on the Pico W running the 12/8 beta.5 bits from s3 and after about 5 minutes I got the "OutOfRetries: Repeated socket failures" message.

I then flashed the latest bits from github and replaced my .env file with a settings.toml file and ran the test script which results in the success message being displayed repeatedly. I've run it now for several minutes without any failures.

RetiredWizard commented 1 year ago

This is probably expected behavior but I hadn't noticed it before. Prior to running this script the Pico W shows up in the web workflow list of "CircuitPython devices on your network" from another idle device, but while the script is running the device doesn't show up in the list.....

jepler commented 1 year ago

This is probably expected behavior but I hadn't noticed it before. Prior to running this script the Pico W shows up in the web workflow list of "CircuitPython devices on your network" from another idle device, but while the script is running the device doesn't show up in the list.....

No, I think it would be expected to still show up. @tannewt can you say for sure?

tannewt commented 1 year ago

I'd expect it to show up. However, mdns does use udp so it may be missed occasionally due to that.

RetiredWizard commented 1 year ago

I did check it out on a Feather ESP32-S3 and didn't see the same behavior so I opened an issue up on it #7346

anecdata commented 1 year ago

I just tested this with latest S3 Adafruit CircuitPython 8.0.0-beta.5-22-ge9f032f46 on 2022-12-20; Raspberry Pi Pico W with rp2040 and .env converted to settings.toml.

This works now with web workflow enabled. (Requests is still often getting an OSError: 32 on the first, hidden, try, but the internal retry is successful.)

Does anyone know what was broken in beta.5, and what fixed it?

anecdata commented 1 year ago

Adafruit CircuitPython 8.0.0-rc.2 on 2023-02-02; Raspberry Pi Pico W with rp2040

Re-opening based on: https://discord.com/channels/327254708534116352/537365702651150357/1071240927936512092 https://github.com/adafruit/Adafruit_CircuitPython_Requests/issues/126 https://discord.com/channels/327254708534116352/537365702651150357/1071516251693777009 It's very odd that when the issue occurs, it persists across microcontroller.reset() and power cycles. Either something is affecting flash(?), or perhaps some non-volatile state in the wifi module(??).

anecdata commented 1 year ago

I suspect there may be some marginal timing (perhaps variable) where (TLS?) sockets aren't ready at code start, and the perception of persistence across power / reset is just the code not working. But once it doesn't work, it will continue to not work until some delay is added and the device is reset.

This works after a reset and ongoing, but if STARTUP_WAIT = 0 it will fail with the OSError: (-29312, 'MBEDTLS_ERR_SSL_CONN_EOF').

import wifi
import time
import socketpool
import ssl

STARTUP_WAIT = 5
ITERATIONS = 10
HOST = "example.com"
PATH = "/"
PORT = 443
MAXBUF = 4096

time.sleep(STARTUP_WAIT)  # wait for serial (and maybe wait for sockets)

print(f"{'='*25}")
print("Web Workflow enabled - already connected to AP")

pool = socketpool.SocketPool(wifi.radio)

for _ in range(ITERATIONS):
    print(f"{'-'*25}\nCreate TCP Client Socket")
    with pool.socket(pool.AF_INET, pool.SOCK_STREAM) as sock:
        s = ssl.create_default_context().wrap_socket(sock, server_hostname=HOST)

        print("Connecting")
        s.connect((HOST, PORT))

        size = s.send(f"HEAD {PATH} HTTP/1.1\r\nHost: {HOST}:{PORT}\r\n\r\n".encode())
        print("Sent", size, "bytes")

        # just get the first hunk and call it a day
        buf = bytearray(MAXBUF)
        size = s.recv_into(buf)
        print('Received', size, "bytes", buf[:size])

Indeed, even @jepler's code behaves the same way if the TLS block is attempted first:

EXPAND... ```py import time import socketpool import wifi import ssl time.sleep(0) # wait for serial (and maybe wait for sockets) print(f"{'-'*25}") if wifi.radio.ipv4_address is None: print("connecting to wifi") wifi.radio.connect(os.getenv("WIFI_SSID"), os.getenv('WIFI_PASSWORD')) print(f"local address {wifi.radio.ipv4_address}") socket = socketpool.SocketPool(wifi.radio) print() print("SSL connection test") time.sleep(.1) ctx = ssl.create_default_context() success = failure = 0 for i in range(10): try: with ctx.wrap_socket(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s: print(f"{s=}") time.sleep(.1) s.connect(("example.com", 443)) s.send('HEAD / HTTP/1.1\r\nHost: example.com\r\n\r\n') buf = bytearray(4096) size = s.recv_into(buf) print('Received', size, "bytes", buf[:size]) success += 1 except Exception as e: print(f"{type(e)}: {e} {getattr(e, 'errno', None)}") failure += 1 print() print(f"Over {success+failure} attempts, {success} success {failure} failure") ```

Addendum: The issue does not appear to be TLS-specific. The original example fails as well after a reset without a delay:

EXPAND... ```py import time import traceback import wifi import socketpool import ssl import adafruit_requests TEXT_URL = "http://wifitest.adafruit.com/testwifi/index.html" STARTUP_WAIT = 0 time.sleep(STARTUP_WAIT) # wait for serial (and maybe wait for sockets) print(f"{'='*25}") print("Web Workflow enabled - already connected to AP") pool = socketpool.SocketPool(wifi.radio) requests = adafruit_requests.Session(pool, ssl.create_default_context()) while True: try: print("Fetching text from", TEXT_URL) response = requests.get(TEXT_URL) print("-" * 40) print(response.text) print("-" * 40) except Exception as e: traceback.print_exception(e, e, e.__traceback__) time.sleep(5) ```

Perhaps related to #7313 (thanks, @Neradoc) - one second may not be enough.

dhalbert commented 1 year ago

@jepler Is there something we need to wait for when starting up the co-processor?

Sounds like same problem reported here: https://forums.adafruit.com/viewtopic.php?t=198486 (noted by mgmt). I haven't caught up on discord this weekend, so maybe this is all already known and linked.

anecdata commented 1 year ago

hyde00001 confirmed on Discord just now that a delay at the start of the code works as above, once the board is then reset.

jepler commented 1 year ago

I have not seen a documented need for such a delay.

I do not understand how to check the latest reproducer script to see the resulting exception, since without a delay I won't be able to connect and see the printed exception.

What exception does the non-tls version produce?

anecdata commented 1 year ago

No exception, it (non-tls) just times out with zero bytes received. Try your TLS loop but with no TLS right at the top of code, no delay, right after a reset. You'll miss the first prints, but subsequent loops will print.

This works with 5 seconds delay after reset, but not with 0 seconds delay after reset:

EXPAND... ```py import time import socketpool import wifi import ssl import traceback time.sleep(5) # wait for serial (and maybe wait for sockets) print(f"{'='*25}") print("Web Workflow enabled - already connected to AP") socket = socketpool.SocketPool(wifi.radio) print() print("non-SSL connection test") time.sleep(.1) # ctx = ssl.create_default_context() success = failure = 0 for i in range(10): try: with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: print(f"local address {wifi.radio.ipv4_address}") print(f"{s=}") time.sleep(.1) s.connect(("wifitest.adafruit.com", 80)) # "http://wifitest.adafruit.com/testwifi/index.html" s.send('HEAD /testwifi/index.html HTTP/1.1\r\nHost: wifitest.adafruit.com\r\n\r\n') buf = bytearray(1024) size = s.recv_into(buf) print('Received', size, "bytes", buf[:size]) success += 1 except Exception as e: traceback.print_exception(e, e, e.__traceback__) print(f"{type(e)}: {e} {getattr(e, 'errno', None)}") failure += 1 print() print(f"Over {success+failure} attempts, {success} success {failure} failure") ```
anecdata commented 1 year ago

It may not be possible to reproduce it in MicroPython since it's really more like the non-web-workflow case in CircuitPython (where it's likely that the time to set up and connect to the AP is enough). This MicroPython code without a delay (rp2-pico-w-20230203-unstable-v1.19.1-852-g9ea64a36a.uf2) works across resets:

EXPAND... ```py import time ITERATIONS = 10 # Connect to network import network wlan = network.WLAN(network.STA_IF) wlan.active(True) wlan.connect('ssid', 'password') # Should be connected and have an IP address wlan.status() # 3 == success wlan.ifconfig() # Get IP address for destination import socket ai = socket.getaddrinfo("wifitest.adafruit.com", 80) addr = ai[0][-1] for _ in range(ITERATIONS): # Create a socket and make a HTTP request s = socket.socket() s.connect(addr) # s.send(b"GET / HTTP/1.0\r\n\r\n") s.send('HEAD /testwifi/index.html HTTP/1.1\r\nHost: wifitest.adafruit.com\r\n\r\n') # Print the response print(s.recv(512)) s.close() time.sleep(1) time.sleep(10) import machine machine.reset() ```
dhalbert commented 1 year ago

I have not seen a documented need for such a delay.

Sorry - I missed that is with the web workflow only.

dhalbert commented 1 year ago

import wifi has the side effect of calling common_hal_wifi_init(). If you move the delay to before the import wifi, does it affect what happens?

anecdata commented 1 year ago

Interesting. I didn't expect that to behave the same, but it did. The issue appears more fundamental than wifi init.

This code works with a 5-second delay, but not with 0 seconds (again non-SSL version times out with 0 bytes received ; SSL version gets the (-29312, 'MBEDTLS_ERR_SSL_CONN_EOF') exception):

EXPAND... ```py import time time.sleep(5) # wait for serial and cyw43 print(f"{'='*25}") print("Web Workflow enabled - already connected to AP") import wifi import socketpool # import ssl import traceback socket = socketpool.SocketPool(wifi.radio) print() print("non-SSL connection test") time.sleep(.1) # ctx = ssl.create_default_context() success = failure = 0 for i in range(10): try: with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: print(f"local address {wifi.radio.ipv4_address}") print(f"{s=}") time.sleep(.1) s.connect(("wifitest.adafruit.com", 80)) # "http://wifitest.adafruit.com/testwifi/index.html" s.send('HEAD /testwifi/index.html HTTP/1.1\r\nHost: wifitest.adafruit.com\r\n\r\n') buf = bytearray(1024) size = s.recv_into(buf) print('Received', size, "bytes", buf[:size]) success += 1 except Exception as e: traceback.print_exception(e, e, e.__traceback__) print(f"{type(e)}: {e} {getattr(e, 'errno', None)}") failure += 1 print() print(f"Over {success+failure} attempts, {success} success {failure} failure") ```

I guess that makes sense though, even microcontroller.reset() (which is how I typically test this) presumably uses some common hal or driver function to reset the rp2040, which also presumably causes some reset of the cyw43 module.

DavePutz commented 1 year ago

Clearly some timing issue here, though I cannot see where. Putting a "mp_hal_delay_ms(1500);" at the top of common_hal_socketpool_socket() in ports/raspberrypi/common-hal/socketpool/Socket.c makes everything work, but that's quite a hack.

anecdata commented 1 year ago

Tested. Fixed by #7589. Thanks, @gneverov!