adafruit / Adafruit_CircuitPython_Requests

Requests-like interface for web interfacing
MIT License
51 stars 36 forks source link

Repeated socket failures after ENOTCONN #63

Closed anecdata closed 3 years ago

anecdata commented 3 years ago

FeatherS2 with ESP32S2 6.2.0-beta.0 on 2021-01-22 adafruit-circuitpython-bundle-6.x-mpy-20210122

This is intermittent, so it's hard to demonstrate, but after only two of these errors (I would have expected the threshold to be four) mixed among successful HTTPS GETs:

Traceback (most recent call last):
  File "code.py", line 621, in http_get
  File "adafruit_requests.py", line 595, in get
  File "adafruit_requests.py", line 564, in request
  File "adafruit_requests.py", line 562, in request
  File "adafruit_requests.py", line 477, in _send_request
  File "adafruit_requests.py", line 468, in _send
  File "adafruit_requests.py", line 466, in _send
OSError: [Errno 128] ENOTCONN

all subsequent GET attempts yield:

Traceback (most recent call last):
  File "code.py", line 621, in http_get
  File "adafruit_requests.py", line 595, in get
  File "adafruit_requests.py", line 559, in request
  File "adafruit_requests.py", line 454, in _get_socket
RuntimeError: Repeated socket failures

ENOTCONN arises quickly, <0.005 seconds after the GET.

Only workaround found so far is to reload.

jerryneedell commented 3 years ago

@anecdata Do you have a time.sleep() in your code? see https://github.com/adafruit/circuitpython/issues/4061

hierophect commented 3 years ago

@anecdata how rapidly are you making the send calls? I recently did some work to make the internals of Socket non-blocking, which might have resulted in some mis-identified errors if calls were made too fast. I'm still wrangling a bit with the error conventions of the Socket module, so I'm afraid I'm not sure how to go about addressing this problem without further info.

tannewt commented 3 years ago

Maybe Requests isn't catching the ENOTCONN and needs to try and recreate the socket as a result.

anecdata commented 3 years ago

@jerryneedell @hierophect There are no time.sleep() calls in the main loop of my 'production' code. When it starts up, there are two GET requests somewhat in sequence (5-10 seconds apart), but after that, it's minutes between GET requests. The ENOTCONN has arisen even after successful startup sequence, seemingly randomly in later loops, and after any two ENOTCONN it's out of business until reload. I'll try to gather more data.

anecdata commented 3 years ago

Just a small tidbit of information... I put some code in before every network action that tests for connection by checking wifi.radio.ipv4_address (and connects if necessary). wifi.radio.ipv4_address should be None after a connection drops, it's the best check I've found to passively test connection to the AP without risking an exception (e.g., wifi.radio.ap_info.).

Looks like when I get ENOTCONN either time, the connection still shows as active right before the GET:

Connected? True
Traceback (most recent call last):
  File "code.py", line 639, in http_get
  File "adafruit_requests.py", line 595, in get
  File "adafruit_requests.py", line 564, in request
  File "adafruit_requests.py", line 562, in request
  File "adafruit_requests.py", line 477, in _send_request
  File "adafruit_requests.py", line 468, in _send
  File "adafruit_requests.py", line 466, in _send
OSError: [Errno 128] ENOTCONN
Try GET #2 timeout 2s...
Connected? True
2021-01-26T11:08:04-06:00 200 bytearray(b'OK') 1.385s
hierophect commented 3 years ago

@anecdata as noted in the other thread, this is almost certainly related to Send() erroneously reporting this error as ENOTCONN and is actually probably a non blocking or other timing problem. I'd be updating in that other thread as I work on it since I don't think it has anything to do with Requests.

anecdata commented 3 years ago

@hierophect OK, thanks, just ruling out disconnection. I have a row of FeatherS2 boards with 6.1, 6.2, etc. so testing changes will be easy.

hierophect commented 3 years ago

@anecdata great, thanks a bunch for helping out with all this. Proper error handling and socket closing is still my biggest struggle on the low level.

anecdata commented 3 years ago

Behavior is changed, but still present. Using today's artifacts from #4049 commits (6.2.0-beta.1-29-gaf92db28b on 2021-02-01) on multiple FeatherS2 with ESP32S2, I get this intermittently after some successful GETs:

Traceback (most recent call last):
  File "code.py", line 638, in http_get
  File "adafruit_requests.py", line 595, in get
  File "adafruit_requests.py", line 564, in request
  File "adafruit_requests.py", line 562, in request
  File "adafruit_requests.py", line 477, in _send_request
  File "adafruit_requests.py", line 468, in _send
  File "adafruit_requests.py", line 466, in _send
OSError: Failed SSL handshake

Then, after any two of those (successful GETs in between):

Traceback (most recent call last):
  File "code.py", line 638, in http_get
  File "adafruit_requests.py", line 595, in get
  File "adafruit_requests.py", line 559, in request
  File "adafruit_requests.py", line 436, in _get_socket
  File "adafruit_requests.py", line 433, in _get_socket
RuntimeError: Out of sockets
hierophect commented 3 years ago

I would need to see your code to debug this any further. Based on the other issue, SSL failure can occur due to a timeout, but potentially from other problems as well. Running out of sockets will happen anytime you do not close a socket properly. I can track where these occur with LWIP_DEBUG and mp_printf/gdb but I'd need my hands on the code, or a subset of it that makes the issue occur (I don't get it with jerryn's anymore). My own test scripts also don't produce errors.

hierophect commented 3 years ago

Note that we haven't fixed Requests yet, which doesn't properly close the socket unless you make the modifications I discussed in the other thread, so maybe the out of sockets problem related to that?

anecdata commented 3 years ago

@hierophect Can you please point to the Requests modification you're referring to? I'm tracking about 20 networking issues and PRs right now, and I'm drawing a blank.

I'll try to get a minimal example to demo this issue.

hierophect commented 3 years ago

@anecdata Oh sure, sorry. It's number two on this list, https://github.com/adafruit/circuitpython/issues/4061#issuecomment-770274825, broken into this issue: adafruit/Adafruit_CircuitPython_Requests#67. An except that only catches _SendFailed needs to be made more generic so that sockets are closed properly.

anecdata commented 3 years ago

If I modify Requests to also catch OSError along with _SendFailed, I don't get the RuntimeError: Out of sockets. I don't know if this is the right solution generally (for example, are there OSErrors that we shouldn't catch).

It looks like now OSError: Failed SSL handshake happens in my code when there have been no GET requests for several minutes. But then it's immediately retried and works fine until the next time.

anecdata commented 3 years ago

Side note: something is now broken with HTTP?:

import sys
import time
import socketpool
import ssl
import wifi
import secrets
import adafruit_requests as requests
from secrets import secrets

wifi.radio.connect(secrets["ssid"], secrets["password"])
print("RSSI", wifi.radio.ap_info.rssi)

socket = socketpool.SocketPool(wifi.radio)
http = requests.Session(socket, ssl.create_default_context())

URL = "http://wifitest.adafruit.com/testwifi/index.html"  # doesn't work
# URL = "http://httpbin.org/get"  #  doesn't work
# URL = "https://httpbin.org/get"  # works

while True:
    print("-" * 40)
    try:
        response = http.get(URL)
    except (OSError) as e:
        sys.print_exception(e)
    except (RuntimeError) as e:
        sys.print_exception(e)
        # unrecoverable, even with reload or reset:
        """
        Traceback (most recent call last):
          File "code.py", line 26, in <module>
          File "code.py", line 24, in <module>
          File "/lib/adafruit_requests.py", line 597, in get
          File "/lib/adafruit_requests.py", line 560, in request
          File "/lib/adafruit_requests.py", line 429, in _get_socket
        RuntimeError: Sending request failed
        """
    try:
        print(response.status_code, response.reason)
        trash = response.text
        trash = None
        response.close()
    except (NameError) as e:
        sys.print_exception(e)
    time.sleep(5)

Addendum: I realized the right call for HTTP I think should be: http = requests.Session(socket) but the same exception trace occurs.

anecdata commented 3 years ago

OK, I'm finally able to reproduce it with a public URL. I've modified adafruit_requests as described two comments up, and also added sys.print_exception(e). It appears that some state is retained from the prior request, and that is interfering with the subsequent request in a way that manifests as a Failed SSL handshake:

import sys
import time
import socketpool
import ssl
import wifi
import secrets
import adafruit_requests as requests
from secrets import secrets

# issue seems correlated to longer response bodies
URL = "https://api.coindesk.com/v1/bpi/currentprice.json"
# issue seems correlated to lack of a {"connection": "keep-alive"} header
#     ...hard to find in the wild, so force no keep-alive
HEADERS = {"Connection": "close"}
# delay to trigger the issue varies depending on server / response
#     ...in the coindesk case, no delay is necessary
#     ...in another case, a delay of >= 32 seconds will consistently trigger the issue
DELAY = 0

wifi.radio.connect(secrets["ssid"], secrets["password"])
print("RSSI", wifi.radio.ap_info.rssi)

socket = socketpool.SocketPool(wifi.radio)
https = requests.Session(socket, ssl.create_default_context())

print(DELAY, "-" * 40)
for _ in range(0, 5):
    try:
        response = https.get(URL, headers=HEADERS)
    except (OSError) as e:
        sys.print_exception(e)
    except (RuntimeError) as e:
        sys.print_exception(e)

    try:
        print(response.status_code, response.reason, response.headers, end=" ")
        trash = response.text
        print(len(trash))
        trash = None
        response.close()
    except (NameError) as e:
        sys.print_exception(e)
    time.sleep(DELAY)

results in:

code.py output:
RSSI -47
0 ----------------------------------------
200 bytearray(b'OK') {'Content-Length': '677', 'Access-Control-Allow-Origin': '*', 'Date': 'Tue, 02 Feb 2021 19:03:12 GMT', 'X-Amz-Cf-Pop': 'ORD51-C2', 'X-Powered-By': 'Fat-Free Framework', 'X-Amz-Cf-Id': 'VdlXBG50Iro5wh8eoH4gCvePNvA9pBw4gxuhHPBc0AJH4OPRHXphgg==', 'Server': 'nginx/1.14.1', 'Content-Type': 'application/javascript', 'Connection': 'close', 'Via': '1.1 7bae3892a4b9b7cb601b5d663d98de74.cloudfront.net (CloudFront)', 'X-Cache': 'Hit from cloudfront', 'Cache-Control': 'max-age=15', 'Age': '5', 'Expires': 'Tue, 02 Feb 2021 19:04:07 UTC'} 677
Traceback (most recent call last):
  File "/lib/adafruit_requests.py", line 563, in request
  File "/lib/adafruit_requests.py", line 493, in _send_request
  File "/lib/adafruit_requests.py", line 469, in _send
  File "/lib/adafruit_requests.py", line 467, in _send
OSError: Failed SSL handshake
200 bytearray(b'OK') {'Content-Length': '677', 'Access-Control-Allow-Origin': '*', 'Date': 'Tue, 02 Feb 2021 19:03:12 GMT', 'X-Amz-Cf-Pop': 'ORD51-C2', 'X-Powered-By': 'Fat-Free Framework', 'X-Amz-Cf-Id': 'QVwyEh8LNK8LO-V8cLi5N5cf4PYxXOlEyGUjmnGYMAFY78FnDDbXWw==', 'Server': 'nginx/1.14.1', 'Content-Type': 'application/javascript', 'Connection': 'close', 'Via': '1.1 6aec8f1aa09edf6116ef0f638212a7c6.cloudfront.net (CloudFront)', 'X-Cache': 'Hit from cloudfront', 'Cache-Control': 'max-age=15', 'Age': '5', 'Expires': 'Tue, 02 Feb 2021 19:04:07 UTC'} 677
Traceback (most recent call last):
  File "/lib/adafruit_requests.py", line 563, in request
  File "/lib/adafruit_requests.py", line 493, in _send_request
  File "/lib/adafruit_requests.py", line 469, in _send
  File "/lib/adafruit_requests.py", line 467, in _send
OSError: Failed SSL handshake
200 bytearray(b'OK') {'Content-Length': '677', 'Access-Control-Allow-Origin': '*', 'Date': 'Tue, 02 Feb 2021 19:03:12 GMT', 'X-Amz-Cf-Pop': 'ORD51-C2', 'X-Powered-By': 'Fat-Free Framework', 'X-Amz-Cf-Id': 'lsJX0sBdNAPaiWcWt5Slanu7XJb95u-SsLJ4lUaY8DE3_VGpz5MhEg==', 'Server': 'nginx/1.14.1', 'Content-Type': 'application/javascript', 'Connection': 'close', 'Via': '1.1 9f453aa625f8fca7ca79ca9f2de01aaf.cloudfront.net (CloudFront)', 'X-Cache': 'Hit from cloudfront', 'Cache-Control': 'max-age=15', 'Age': '6', 'Expires': 'Tue, 02 Feb 2021 19:04:07 UTC'} 677
Traceback (most recent call last):
  File "/lib/adafruit_requests.py", line 563, in request
  File "/lib/adafruit_requests.py", line 492, in _send_request
  File "/lib/adafruit_requests.py", line 469, in _send
  File "/lib/adafruit_requests.py", line 467, in _send
OSError: Failed SSL handshake
200 bytearray(b'OK') {'Content-Length': '677', 'Access-Control-Allow-Origin': '*', 'Date': 'Tue, 02 Feb 2021 19:03:12 GMT', 'X-Amz-Cf-Pop': 'ORD51-C2', 'X-Powered-By': 'Fat-Free Framework', 'X-Amz-Cf-Id': 'XmMaly1HblL_yhwwsro0mkYZ3BGM-xfhFEXXIA49XRPwrKfBOoosYA==', 'Server': 'nginx/1.14.1', 'Content-Type': 'application/javascript', 'Connection': 'close', 'Via': '1.1 ac7135d7082fee948afa4d75bab8e0e1.cloudfront.net (CloudFront)', 'X-Cache': 'Hit from cloudfront', 'Cache-Control': 'max-age=15', 'Age': '7', 'Expires': 'Tue, 02 Feb 2021 19:04:07 UTC'} 677
Traceback (most recent call last):
  File "/lib/adafruit_requests.py", line 563, in request
  File "/lib/adafruit_requests.py", line 493, in _send_request
  File "/lib/adafruit_requests.py", line 469, in _send
  File "/lib/adafruit_requests.py", line 467, in _send
OSError: Failed SSL handshake
200 bytearray(b'OK') {'Content-Length': '677', 'Access-Control-Allow-Origin': '*', 'Date': 'Tue, 02 Feb 2021 19:03:12 GMT', 'X-Amz-Cf-Pop': 'ORD51-C2', 'X-Powered-By': 'Fat-Free Framework', 'X-Amz-Cf-Id': 'HwkbkE9vaGUvmOI0tH5QaYUDRVZLy8fZciddrfoE5QZ2LRYCRNRBhA==', 'Server': 'nginx/1.14.1', 'Content-Type': 'application/javascript', 'Connection': 'close', 'Via': '1.1 cf1ea76a2398f8d5f0e0d97f61992ee8.cloudfront.net (CloudFront)', 'X-Cache': 'Hit from cloudfront', 'Cache-Control': 'max-age=15', 'Age': '8', 'Expires': 'Tue, 02 Feb 2021 19:04:07 UTC'} 677

I suspect this is not an issue with Requests.

anecdata commented 3 years ago

In the pre-ESP32-S2 days, Requests was HTTP/1.0 and we sent no countermanding Connection header, so each HTTP connection would get closed after the response (by client and server).

In HTTP/1.1 (newly supported in Requests in conjunction with ESP32-S2) HTTP connections are presumed persistent (keep-alive) unless the server sends a Connection: close header, or the client does and then the server must comply.

Requests doesn't currently take any action on the presence or absence of a Connection header, and as far as I can tell, always tries to close the socket after a Response is received (this is HTTP/1.0 default, or Connection: close, behavior), but we do not send a Connection: close header. Presumably, the server will eventually timeout its connection. Our subsequent connections to the same server will use new sockets (unless I'm reading things wrong).

Oddly, the example code above explicitly sends a {"Connection": "close"} header, and the Failed SSL handshake exception occurs immediately on every alternate request.

The timing differences are a mystery too... why some servers / responses fail immediately and some fail only after specific amounts of time have lapsed.

Addendum: looks like for Session (which ESP32-S2 uses), there is a facility to reuse sockets if they share host, port, and proto. This should give persistence in many cases. So the correlation to lack of keep-alive may be a red herring.

anecdata commented 3 years ago

I'm closing this since the original repeated socket failures / out of sockets issue was resolved. The OSError: Failed SSL handshake will still lurk in the background under some circumstances, but PR 70 will mask it from the user by closing the offending socket and retrying with a new one.

hierophect commented 3 years ago

@anecdata I think the reason the SSL handshake occurs every other request (as it did with Jerry's code too) is because of how the socket fails - on the first use, it sees there's no socket and creates one, on the second, it sees there already is one and attempts to use it, but fails, and closes it. Then, because the socket has been closed, it just creates a new one on the next request and has no error again, creating the appearance of alternating failures. Does that make any sense?

anecdata commented 3 years ago

the second fails why... because in the core or IDF it has timed out and been closed out from under Requests?

in the above code, the connection header closes it, but this also occurs in scenarios that should be persistent (I don't know what server or lower-level client timeouts may apply in those cases).

hierophect commented 3 years ago

@anecdata it actually happens even "lower" than my implementation in the core, it's an error returned by esp_tls_conn_new_sync which is from the esp-idf, part of Mbed TLS. It's possible that the timeout that Requests has set up is not aligned with the timeout in MbedTLS.