adafruit / circuitpython

CircuitPython - a Python implementation for teaching coding with microcontrollers
https://circuitpython.org
Other
4.11k stars 1.22k forks source link

socket timeout or workaround #2073

Closed ssube closed 4 years ago

ssube commented 5 years ago

Hi, I'm looking to use the Feather M4 Express and Ethernet FeatherWing as a way to collect I2C sensor data with Prometheus and threw together a very small HTTP server for that purpose. It works pretty well with light traffic, but if the socket.accept waits too long for an incoming request, the board seems to fall off the network and my code soft-reboots without raising an error:

Headers: {'http': 'HTTP/1.1', 'method': 'GET', 'path': '/favicon.ico'}
Connection: ('10.2.1.193', 8080)

Code done running. Waiting for reload.
soft reboot

Auto-reload is on. Simply save files over USB to run them or enter REPL to disable.
code.py output:
['0x40', '0x77']
Online: True
Network: ('0.0.0.0', '0.0.0.0', '0.0.0.0', '0.0.0.0')

When no requests are incoming, I would like the socket.accept to break after a few seconds to let the network driver run and take another sample from the sensors.

The docs and source seem to support socket.settimeout but it appears to invoke a method (syscall?) on the NIC, which throws:

Online: True
Network: ('0.0.0.0', '0.0.0.0', '0.0.0.0', '0.0.0.0')
Online: True
Network: ('10.2.2.254', '255.255.0.0', '10.2.1.1', '0.0.0.0')
Binding: 10.2.2.254:8080
Unable to set socket timeout: [Errno 22] Invalid argument

Is there a way to make the socket timeout, have a timer interrupt it, or work around this somehow?

Running:

Adafruit CircuitPython 4.1.0 on 2019-08-02; Adafruit Feather M4 Express with samd51j19

My socket setup code is:

def start_http_server(port, address='0.0.0.0'):
    bind_address = (address, port)

    http_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    http_socket.bind(bind_address)
    http_socket.listen(1)

    try:
        http_socket.settimeout(5.0)
    except OSError as err:
        print('Unable to set socket timeout:', err)

    return Server(http_socket)

class Server():
    http_socket = False

    def __init__(self, http_socket):
        self.http_socket = http_socket

    def accept(self, router):
        conn, addr = self.http_socket.accept()
        print('Connection: {}'.format(addr))

        req = conn.recv(1024).decode(http_encoding)
        req_headers = self.parse_headers(req)
        print('Headers: {}'.format(req_headers))

        handler = router.select(req_headers['method'], req_headers['path'])
        resp = handler(req_headers, '')

def main():
    ready = False
    bound = False

    registry = CollectorRegistry(namespace='prom_express')
    metric_gas = Gauge('gas',
                       'gas from the bme680 sensor', registry=registry)
    metric_humidity = Gauge('humidity',
                            'humidity from both sensors', ['sensor'], registry=registry)
    metric_pressure = Gauge('pressure',
                            'pressure from the bme680 sensor', registry=registry)
    metric_temperature = Gauge('temperature',
                               'temperature from both sensors', ['sensor'], registry=registry)

    def prom_handler(headers, body):
        return {
            'status': '200 OK',
            'content': '\r\n'.join(registry.print()),
        }

    router = Router([
        ('GET', '/metrics', prom_handler),
    ])
    server = False

    rgb[0] = RED  # starting
    while not ready:
        ready = check_network(eth)
        led.value = ready

    rgb[0] = BLUE  # connected
    while not bound:
        server, bound = bind_server(eth)

    rgb[0] = GREEN  # ready
    while True:
        metric_gas.set(sensor_bme680.gas)
        metric_humidity.labels('bme680').set(sensor_bme680.humidity)
        metric_humidity.labels('si7021').set(sensor_si7021.relative_humidity)
        metric_pressure.set(sensor_bme680.pressure)
        metric_temperature.labels('bme680').set(sensor_bme680.temperature)
        metric_temperature.labels('si7021').set(sensor_si7021.temperature)

        try:
            server.accept(router)
        except OSError as err:
            print('Error accepting request: {}'.format(err))
            server, bound = bind_server(eth)
        except Exception as err:
            print('Unknown error: {}'.format(err))

With a more complete example here: https://github.com/ssube/prometheus_express/blob/master/examples/feather-m4-i2c.py

ladyada commented 5 years ago

im impressed it works cause we haven't done any ethernet projects yet :D @nickzoic added it so will know the most. we're mostly supporting networking with esp32spi https://github.com/ladyada/Adafruit_CircuitPython_ESP32SPI

ssube commented 5 years ago

I don't have one of the esp32 wifi wings on hand right now, but will look into picking one up, at least for testing. Are they strictly wifi, or is it possible to solder on an rj45 port? Not sure all the final locations for these sensors will have strong wifi signal, and I was hoping to use PoE splitters to power the boards from a single switch.

Not sure if this is related, but on some further testing, excessive load seems to also confuse the socket after a while. Something like this:

> ./wrk -t 1 -c 1 -d 5s http://10.2.2.255:8080/metrics
Running 5s test @ http://10.2.2.255:8080/metrics
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    82.34ms   21.98ms 108.68ms   67.24%
    Req/Sec    12.04      4.43    20.00     75.56%
  58 requests in 5.01s, 45.04KB read
  Socket errors: connect 0, read 1, write 21, timeout 0
Requests/sec:     11.59
Transfer/sec:      9.00KB

Will cause an OSError with errno 3, 4, or 7. Usually, creating a new socket and listening again fixes it, but may leave wrk hanging (while [[ true ]]; do curl can also trigger the errors, but is not left hanging).

Headers: {'http': 'HTTP/1.1', 'method': 'GET', 'path': '/metrics'}
Connection: ('10.2.1.193', 8080)
Headers: {'http': 'HTTP/1.1', 'method': 'GET', 'path': '/metrics'}
Error accepting request: [Errno 128] ENOTCONN
Binding: 10.2.2.255:8080
Unable to set socket timeout: [Errno 22] Invalid argument
Error accepting request: [Errno 128] ENOTCONN
Binding: 10.2.2.255:8080
Traceback (most recent call last):
  File "code.py", line 102, in <module>
  File "code.py", line 97, in main
  File "code.py", line 97, in main
  File "/lib/prometheus_express/utils.py", line 24, in bind_server
  File "/lib/prometheus_express/server.py", line 12, in start_http_server
OSError: 4

With the -c 1 -t 1 flags, wrk should be running a single connection, but I'm not sure how that client connection would prompt an OSError (or a few different ones).

ladyada commented 5 years ago

the esp's are def just wifi

nickzoic commented 5 years ago

See also #1880 ... exceptions could be a lot better on 'socket', 'network' and 'wiznet' libraries, we just have to decide what to standardize on ...

phrogger commented 4 years ago

@brentru I don't think they are the same problem. The issue I am reporting is that the Wiznet 5500 has no on-board long term memory for a MAC address, so if you don't give it one after boot, it makes up a random number for the MAC, and causes an IP address change every time you reboot. The Arduino API has a function to assign/force a MAC address. The CP API does not. As far as I know, it is stable as long as power is still applied, so should not cause this issue. I didn't go beyond discovering it was hopping IP address every power cycle, and I put it back in the desk drawer. --- graham (discord), aka phrogger(github)

nickzoic commented 4 years ago

PS: The ESP32 has support for RMII so you can attach an Ethernet interface like the LAN8270 that way, although it is a rather different beast than the way WizNet works.

See: https://github.com/micropython/micropython-esp32/issues/172 and https://github.com/micropython/micropython/issues/4502 and the "network.LAN" class.

ssube commented 4 years ago

Revisited this over the break using v4.1.2, and with a few tweaks to my code (and a fixed IP address), things seem more stable with the Feather M4/Ethernet Wing setup:

[09:38:29] ssube@cerberus ~/code/ssube/prometheus-express (master@93aa0347) > time ./scripts/bench-curl.sh
...
Request: 9800 of 10000
Request: 9900 of 10000
Request: 10000 of 10000

real    6m47.389s
user    1m27.865s
sys     0m51.520s

With the fixed IP, rebooting after an OSError takes a few seconds for the device to come back at the same address, and Prometheus' retry logic can handle that. My use-case is monitoring on a 60s interval, so that simulates a week's worth of requests, which is good enough uptime to start deploying them.

Still seeing issues accepting new connections under load. Specifically, running wrk (this script) puts the chip into an immediate error loop:

[09:55:14] ssube@cerberus ~/code/ssube/prometheus-express (master@93aa0347) > time ./scripts/bench-wrk.sh
Running 1m test @ http://10.2.1.220:8080/metrics
  1 threads and 1 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 5.80s, 0.00B read
  Socket errors: connect 0, read 0, write 64203, timeout 0
Requests/sec:      0.00
Transfer/sec:       0.00B

real    0m5.836s
user    0m0.344s
sys     0m0.971s

produces:

Error accepting request: [Errno 128] ENOTCONN
Error accepting request: [Errno 128] ENOTCONN
Error accepting request: [Errno 128] ENOTCONN
Error accepting request: [Errno 128] ENOTCONN
Error accepting request: [Errno 128] ENOTCONN
Error accepting request: [Errno 128] ENOTCONN
Error accepting request: [Errno 128] ENOTCONN

It doesn't appear to recover from this without a ^C ^D reboot or hard reset (maybe it would after logging all 64k?). I can only reproduce this by running wrk, curl in a tight loop doesn't trigger the same behavior. Changing the socket accept depth doesn't appear to make a difference, but I'm not sure what values would make sense here.

I'm going to test this on an Airlift Wing next and see how it works with the ESP32, then pick up a few of those Olimex chips. Built in POE would save me an adapter :D

nickzoic commented 4 years ago

Interesting. I'll check it out.

ssube commented 4 years ago

Got a chance to try my code on the ESP32-POE and -EVB today (running https://github.com/ssube/prometheus_express/blob/master/examples/olimex-esp32-random.py), with fantastic results on both wifi and wired. Socket timeouts appear to work there, allowing me to collect sensor information in between requests. Not seeing the ENOTCONN errors from wrk either, those only appear with the Wiznet hardware so far. Chrome, curl, and most important Prometheus are scraping the ESPs just fine.

nickzoic commented 4 years ago

The two issues raised here have split into a couple of different places, so I'm going to close this one for clarity and work can continue over there:

The docs and source seem to support socket.settimeout but it appears to invoke a method (syscall?) on the NIC, which throws:

See #2097 for work on adding settimeout to WizNet 5500. Once I actually get something working I'll backport it to MicroPython too.

The issue I am reporting is that the Wiznet 5500 has no on-board long term memory for a MAC address, so if you don't give it one after boot, it makes up a random number for the MAC

See #2274 for an approach to providing a stable MAC address