OpenCyphal / pycyphal

Python implementation of the Cyphal protocol stack.
https://pycyphal.readthedocs.io/
MIT License
117 stars 106 forks source link

OSError No buffer space available during firmware update / large data transfer with socketcan #233

Closed wiboticalex closed 2 years ago

wiboticalex commented 2 years ago

Occasionally during a firmware update Yakut will crash and the socket that pycyphal has created will claim that there is "No buffer space available" aka ENOBUFS.

Traceback:

2022-07-05 18:02:02 0101068 ERR pycyphal.presentation._port._server: Server(dtype=uavcan.file.Read.1.1, input_transport_session=CANInputSession(InputSessionSpecifier(data_specifier=ServiceDataSpecifier(service_id=408, role=<Role.REQUEST: 1>), remote_node_id=None), PayloadMetadata(extent_bytes=300))) task failure: [Errno 105] No buffer space available
Traceback (most recent call last):
  File "/home/alex/.local/share/virtualenvs/uavcan_v1-Kt-sRht1/lib/python3.9/site-packages/pycyphal/presentation/_port/_server.py", line 201, in task_function
    await self.serve_for(handler, _LISTEN_FOREVER_TIMEOUT)
  File "/home/alex/.local/share/virtualenvs/uavcan_v1-Kt-sRht1/lib/python3.9/site-packages/pycyphal/presentation/_port/_server.py", line 182, in serve_for
    return await self.serve(handler, monotonic_deadline=loop.time() + timeout)
  File "/home/alex/.local/share/virtualenvs/uavcan_v1-Kt-sRht1/lib/python3.9/site-packages/pycyphal/presentation/_port/_server.py", line 170, in serve
    await self._do_send(response, meta, response_transport_session, loop.time() + self._send_timeout)
  File "/home/alex/.local/share/virtualenvs/uavcan_v1-Kt-sRht1/lib/python3.9/site-packages/pycyphal/presentation/_port/_server.py", line 309, in _do_send
    return await session.send(transfer, monotonic_deadline)
  File "/home/alex/.local/share/virtualenvs/uavcan_v1-Kt-sRht1/lib/python3.9/site-packages/pycyphal/transport/can/_session/_output.py", line 252, in send
    return await self._do_send(can_id, transfer, monotonic_deadline)
  File "/home/alex/.local/share/virtualenvs/uavcan_v1-Kt-sRht1/lib/python3.9/site-packages/pycyphal/transport/can/_session/_output.py", line 163, in _do_send
    if await self._send_handler(transaction):
  File "/home/alex/.local/share/virtualenvs/uavcan_v1-Kt-sRht1/lib/python3.9/site-packages/pycyphal/transport/can/_can.py", line 329, in _do_send
    num_sent = await self._maybe_media.send(
  File "/home/alex/.local/share/virtualenvs/uavcan_v1-Kt-sRht1/lib/python3.9/site-packages/pycyphal/transport/can/media/socketcan/_socketcan.py", line 153, in send
    raise err
  File "/home/alex/.local/share/virtualenvs/uavcan_v1-Kt-sRht1/lib/python3.9/site-packages/pycyphal/transport/can/media/socketcan/_socketcan.py", line 139, in send
    await asyncio.wait_for(
  File "/usr/lib64/python3.9/asyncio/tasks.py", line 479, in wait_for
    return fut.result()
  File "/usr/lib64/python3.9/asyncio/selector_events.py", line 441, in sock_sendall
    n = sock.send(data)
OSError: [Errno 105] No buffer space available

I have found that patching pycyphal's transport/can/media/socketcan/_socketcan.py in _make_socket with

_SO_SNDBUF = 7
s.setsockopt(socket.SOL_SOCKET, _SO_SNDBUF, 1)

allows updates (and I assume other large data transfers) to occur with no issues from my testing. I assume a proper fix using this method would probably set SNDBUF to something greater than 1 though.

There is some discussion of this issue in section 3.4 of http://rtime.felk.cvut.cz/can/socketcan-qdisc-final.pdf, although I found that I did not have to adjust the default queue length of 10 in my case even with using the minimum SNDBUF.

A couple other ways that this could be worked around that I found are increasing the default queue length (although is just hiding the issue?) or adding some retry logic to the send function when OSError's errno == errno.ENOBUFS.

Any thoughts?

pavel-kirienko commented 2 years ago

As I understand it, the two solutions are both technically valid: either reduce the SO_SNDBUF size such that it is no greater than twice the size of the TX queue (multiplied by the CAN frame control structure size) or increase the TX queue size to the same effect. I normally use the latter:

https://gist.github.com/pavel-kirienko/32e395683e8b7f49e71413aebf5e1a89#file-setup_slcan-L101

But we can easily modify PyCyphal to set SO_SNDBUF to some small value to make the default-size txqueuelen block correctly as well. What you suggested is correct, except that setting the SNDBUF to 1 byte seems arbitrary -- the kernel will override that setting to some minimum anyway, so you could just as well use zero.

Would you send a pull request?