Open mjm opened 1 month ago
There is a comment in the code that explains why gen_udp has problems with this:
`/* "code" analysis is the same for both SCTP and UDP above,
So, EINTR is "not supposed" to be possible. Clearly, when on Unix Domain Socket, this can happen (on Liinux)...
Should have asked this before, but what flavor and version of Linux did you test this with?
So, EINTR is "not supposed" to be possible.
EINTR
is documented as a valid error for all of send
, sendto
and sendmsg
if you get a signal, so the comment is wrong. Unless the vm traps it using a signalfd, that is :)
I mentioned the comment as an explanation of the behavior, not a justification. Regardless, I have done some testing:
On FreeBSD (14.1), OpenIndiana (Hipster 2023.10), MacOS (14.4.1/23.4.0), NetBSD (9.0) the result is 'enoent'.
I have also tested this on the following versions of Linux without being able to reproduce the issue: Ubuntu 22.04.5 (6.8.0-47-generic), Ubuntu 20.04.6 (5.4.0-196-generic), Linux Mint 21 (5.15.0-122-generic), LMDE 5 (5.10.0-33-amd64), SLES 12 (3.12.60-52.54-default), SLES 12-SP2 (4.4.74-92.35-default).
Here is a PR for testing: https://github.com/bmk/otp/tree/bmk/kernel/20241030/gen_udp_blocking_send_on_local
Should have asked this before, but what flavor and version of Linux did you test this with?
In production, we're running on Google Kubernetes Engine, so the nodes are running Container-Optimized OS cos-113-18244-151-27. When I was creating the reproduction example, I was running on Docker Desktop on macOS 4.34.3 (170107). I'm not sure what version of Linux that's using on the VM it manages.
In both contexts, sysctl net.unix.max_dgram_qlen
appears to be 10. I think it being so low is why this happens.
Aha. On my machine:
$ sysctl net.unix.max_dgram_qlen net.unix.max_dgram_qlen = 512
If you can, please test my branch, and see if that solves the problem.
Okay, today I'll see if I can get that built today in a context where I've actually had the problem.
I was able to build your branch in a Docker container and test it alongside both 27.1.2 and 25.3.2.15. The former reproduces the bug, while the latter does not because the logic for handling EINTR special doesn't exist yet in that version.
Your branch did not reproduce the problem!
Describe the bug
When using the gen_udp module with Unix domain sockets, sending packets can return an
EINTR
error, which seems to be unexpected by thesendto
implementation in inet, as it responds with an{inet_reply, Port, Ref}
message (no reply value) that goes unhandled bysendto
and ends up in the calling process's mailbox.To Reproduce
I've reduced this to a small reproduction in Elixir: https://gist.github.com/mjm/490abd286e526fceaeb0e373414e1214
It reproduces for me on Linux but not on macOS, so I used
docker run -it elixir /bin/bash
to get a Linux Elixir environment. Then you can paste the module in the gist into twoiex
sessions, and runUdsBlockExample.test_listen()
in one, andUdsBlockExample.test_socket()
in the other.test_socket()
will raise an error that it received an unexpected inet_reply message.Expected behavior
This example code should run without error, as inet_reply messages should not leak out of these calls.
In production, this is manifesting as some of our genservers suddenly receiving these unexpected messages after we switched to using Unix domain sockets for reporting telemetry to statsd.
Using the new
socket
inet_backend
also causes this to work as expected.Affected versions
In production we hit this on OTP 26.2.5 but it also reproduces on the latest OTP 27.
Additional context
The undesired messages come from this code path in the inet driver.
A comment a short bit above this suggests that
EINTR
should not happen for UDP, and that seems to be true, but it appears that it can happen for AF_UNIX datagram sockets, at least on Linux.And here is where
sendto
is not handling this shape of message, which is what allows it to leak. The implementation ofsend
above this has a case for handling 3-tuples, butsendto
assumes that won't happen.