dbus/socket: handle unexpected return of 1 from SIOCOUTQ

This fixes a problem we encountered where all messages to a single peer became blocked while messages to other peers continued to be delivered.

After much tracing it was discovered that the socket associated with that peer could get stuck in a state where dbus-broker was waiting for an EPOLLOUT event that would never happen, because the kernel socket buffer was already empty. The EPOLLOUT event that was generated when the socket buffer became empty was effectively ignored.

The problem is caused by the use of the SIOCOUTQ ioctl() in src/dbus/socket.c:socket_dispatch_write()...

Sometimes, after an EPOLLOUT notification from the kernel, which would normally cause a pending message to be written to the socket, the SIOCOUTQ ioctl returns 1 rather than 0. This causes dbus-broker to assume that there is an outstanding message in the kernel buffer and that it will receive another EPOLLOUT event when the buffer becomes empty, so it refrains from writing and clears the cached EPOLLOUT event using dispatch_file_clear(). The subsequent EPOLLOUT event that was assumed would be coming never happens because in reality there is nothing left to write in the socket buffer, and as epoll is used in edge-driven mode (using EPOLLET), the socket never wakes up again for writing and so the output queue is stalled.

Obviously 1 is an unlikely value for SIOCOUTQ, because dbus never sends such short messages. The reason it is returning 1 is (probably) because the kernel counter of the write space used (sk_wmem_alloc in struct sock) is overloaded as a reference counter...

sk_wmem_alloc starts at 1 (a simple initial reference) and is incremented by skb->truesize in skb_set_owner_w() when a packet is sent. The 1 of the initial reference is subtracted in sk_wmem_alloc_get(), the function that is called to implement SIOCOUTQ. The sock_wfree() function decrements sk_wmem_alloc when an skb is destroyed and calls the sk_write_space socket op, which is unix_write_space() for a unix domain socket. unix_write_space() uses unix_writable() to determine if the socket is writable and wakes up any pollers with EPOLLOUT if it is.

However, in order to avoid the possibility of a race condition, sock_wfree() subtracts 1 from the amount that it decrements from sk_wmem_alloc before it calls the sk_write_space op, thereby holding a temporary reference. This reference is decremented from sk_wmem_alloc after the sk_write_space call. So, if a task woken by unix_write_space() performs the SIOCOUTQ ioctl() before the final decrement takes place, it can be returned a value of 1 rather than 0, thereby giving the false impression that there is still data waiting to be sent.

You could argue that this is a kernel bug. However, a simple fix is for socket_dispatch_write() in dbus-broker to require the value returned by SIOCOUTQ to be greater than 1 rather than greater than 0 before assuming that there will be another EPOLLOUT event on the socket.

The temporary reference held by sock_wfree() has been in the kernel since commit d99927f4d93f36553699573b279e0ff98ad7dea6, linux 2.6.32. We've only encountered this issue on an ARMv8 system running on a multiple core processor, but with only one core enabled. It has not been encountered with the same software running on a single core ARMv5 system, or on a cut-down version of the software running on an x86_64 system. Presumably this discrepancy is due to scheduling differences.

That nobody else has reported this kernel bug (if that is what it is), is probably because few application are brave enough to use epoll in edge-triggered mode!

Thanks for figuring this out and for the detailed write-up! I'll leave the review to @dvdhrm :)

However, in order to avoid the possibility of a race condition, sock_wfree() subtracts 1 from the amount that it decrements from sk_wmem_alloc before it calls the sk_write_space op, thereby holding a temporary reference. This reference is decremented from sk_wmem_alloc after the sk_write_space call. So, if a task woken by unix_write_space() performs the SIOCOUTQ ioctl() before the final decrement takes place, it can be returned a value of 1 rather than 0, thereby giving the false impression that there is still data waiting to be sent.

You could argue that this is a kernel bug.

Oh dear! I think this is a kernel bug, and I think your analysis is correct.

Every sock_wfree() retains a single reference when calling into ->sk_write_space(). Technically, there could even be multiple threads in parallel, all retaining a single reference, all finish calling into ->sk_write_space(), and then simultaneously dropping the last ref-count. With kernel-preemption enabled, this can be an arbitrary number of threads, and they can be blocked for an arbitrary amount of time, and thus SIOCOUTQ might even return >1. Unlikely, but I would at least like to try fixing this as well.

                                       However, a simple fix is for
socket_dispatch_write() in dbus-broker to require the value returned by SIOCOUTQ to be greater than 1 rather than greater than 0 before assuming that there will be another EPOLLOUT event on the socket.

I think I agree. The fact that the behavior has been like that for decades calls for a workaround in user-space. Nevertheless, I will send a message to netdev@vger next week. Maybe we can find a better solution together. I don't see an easy way to fix that, which does not sacrifice performance needlessly.

Regarding the fix: The wmem counter is incremented with skb->truesize, which itself includes the size of struct skbuff. Hence, we might be able to find a value >1 but <sizeof(skbuff) which would even allow us to work in situations were multiple threads release their skb in parallel. I will have to check next week, whether that idea holds.

The temporary reference held by sock_wfree() has been in the kernel since commit d99927f4d93f36553699573b279e0ff98ad7dea6, linux 2.6.32. We've only encountered this issue on an ARMv8 system running on a multiple core processor, but with only one core enabled. It has not been encountered with the same software running on a single core ARMv5 system, or on a cut-down version of the software running on an x86_64 system. Presumably this discrepancy is due to scheduling differences.

Technically, single-core machines with kernel-preemption enabled could hit this, but it is quite unlikely (or maybe I am missing something?). Furthermore, we only use SIOCOUTQ for messages that transmit file-descriptors (it is an accounting workaround which I would love to get rid of, if the kernel gains suitable replacement mechanisms). So yeah, I believe that this is hard to hit.

Out of curiosity, you see this on Fedora? Or do you deploy dbus-broker manually?

That nobody else has reported this kernel bug (if that is what it is), is probably because few application are brave enough to use epoll in edge-triggered mode!

It was a headache making it work, but we have gotten excellent benchmarks so far. Given that few people use it, I am aware that there might be lurking kernel bugs. But so far, you are the first to report one, so I am still eager to keep it.

Thanks a lot for the report. I will try to investigate more next week. Your proposed workaround should fix this, and I think your analysis is spot on.

[snip]

I think I agree. The fact that the behavior has been like that for decades calls for a workaround in user-space. Nevertheless, I will send a message to netdev@vger next week. Maybe we can find a better solution together. I don't see an easy way to fix that, which does not sacrifice performance needlessly.

I can reproduce the problem without much difficulty, so I'm happy to test any proposed kernel patches, although I'll have to back port them to the 4.14 kernel used on the embedded system in question, so there may be some delay.

Regarding the fix: The wmem counter is incremented with skb->truesize, which itself includes the size of struct skbuff. Hence, we might be able to find a value >1 but <sizeof(skbuff) which would even allow us to work in situations were multiple threads release their skb in parallel. I will have to check next week, whether that idea holds.

Agreed - that would be a sensible modification to my approach.

[snip]

Technically, single-core machines with kernel-preemption enabled could hit this, but it is quite unlikely (or maybe I am missing something?). Furthermore, we only use SIOCOUTQ for messages that transmit file-descriptors (it is an accounting workaround which I would love to get rid of, if the kernel gains suitable replacement mechanisms). So yeah, I believe that this is hard to hit.

As far as I can see the SIOCOUTQ ioctl() is used in socket_dispatch_write() whenever there are pending messages, regardless of whether they contain fd arguments. We do use messages with file descriptors quite a lot in our application.

Out of curiosity, you see this on Fedora? Or do you deploy dbus-broker manually?

It's running on an embedded system with a completely custom user space, though it is fairly conventional - e.g. it uses glibc.

BTW - thanks a lot for dbus-broker. It works really well for us in the embedded environment (with a few patches). The main benefits of its design are the lack of bloated library dependencies and a well controlled memory footprint (although 1 of my patches addresses this). I'll post my patches in case you are interested in what we had to change to make it work for our use case.

[snip]

[off topic] Here is a branch with my embedded system changes in case you're interested: https://github.com/cgpe-a/dbus-broker/commits/embedded Let me know if you want me to submit pull requests for any of the changes.

I think I agree. The fact that the behavior has been like that for decades calls for a workaround in user-space. Nevertheless, I will send a message to netdev@vger next week. Maybe we can find a better solution together. I don't see an easy way to fix that, which does not sacrifice performance needlessly.

I can reproduce the problem without much difficulty, so I'm happy to test any proposed kernel patches, although I'll have to back port them to the 4.14 kernel used on the embedded system in question, so there may be some delay.

I sent a message to netdev@vger.kernel.org. Lets see, maybe people are interested to fix this, but I suspect we will just have to carry the workaround forever.

Regarding the fix: The wmem counter is incremented with skb->truesize, which itself includes the size of struct skbuff. Hence, we might be able to find a value >1 but <sizeof(skbuff) which would even allow us to work in situations were multiple threads release their skb in parallel. I will have to check next week, whether that idea holds.

Agreed - that would be a sensible modification to my approach.

I now pushed your fix plus a follow-up that adds an in-depth comment and increases the counter to 128.

Technically, single-core machines with kernel-preemption enabled could hit this, but it is quite unlikely (or maybe I am missing something?). Furthermore, we only use SIOCOUTQ for messages that transmit file-descriptors (it is an accounting workaround which I would love to get rid of, if the kernel gains suitable replacement mechanisms). So yeah, I believe that this is hard to hit.

As far as I can see the SIOCOUTQ ioctl() is used in socket_dispatch_write() whenever there are pending messages, regardless of whether they contain fd arguments. We do use messages with file descriptors quite a lot in our application.

Yeah, "pending" here means pending in the kernel buffers. And there we only ever add messages to this queue when they contain file-descriptors. For all other messages we just never track them in kernel buffers, because there is no need to do it.

All in all, this entire tracking of kernel-buffers is only required to reliably track inflight file-descriptors that the kernel accounts on the sender. We need to know how many file-descriptors are queued on outgoing queues, otherwise clients can cause the broker to exceed its inflight quota and thus causing any other system service to be unable to send messages with file-descriptors, anymore. dbus-daemon just ignores this problem, but we wanted to be resilient against such local DoS, so we track inflight FDs.

Out of curiosity, you see this on Fedora? Or do you deploy dbus-broker manually?

It's running on an embedded system with a completely custom user space, though it is fairly conventional - e.g. it uses glibc.

BTW - thanks a lot for dbus-broker. It works really well for us in the embedded environment (with a few patches). The main benefits of its design are the lack of bloated library dependencies and a well controlled memory footprint (although 1 of my patches addresses this). I'll post my patches in case you are interested in what we had to change to make it work for our use case.

Interesting! Always good to know where dbus-broker is deployed. I will have a look at your queued patches and pick them up, as long as they apply in general. But on my first glance they all looked reasonable.

Thanks a lot for the report! This sounds like a nasty race to run into, and I am more than glad for the detailed investigation!

With the holidays over, I gave your patches another look. First of all, nobody ever replied to the netdev email I sent. For the record, here it is:

https://www.spinics.net/lists/netdev/msg708996.html

Maybe, if I find the time, I will try writing SIOCOUTQ_FDS which just tells us the number of FDs queued on the socket. Or SIOCOUTQ_USER, which tells us the amount of user-data queued on the socket, rather than the kernel accounted memory. Both would improve our code significantly and allows us to track queued FDs.

Anyway, not today.

I now pushed a patch that makes memfd_create() optional, so we can run without CONFIG_SHMEM. I chose a slightly different path than you did, and I avoided the goto. I hope you don't mind.

I also pushed code that makes the map size a property of the log (log->map_size), similar to one part of your patches. This should at least simplify the patch you have on top. I also reduced the maximum log-size to 2MiB, because our initial debugging assumptions just never held true. Regarding adding the command-line switch, I wonder why you want this? I mean, this only changes the map-size, nothing is allocated unless such a log-message is actually generated. Did you hit issues with this? Do you use strict overcommit? Also, wouldn't it be enough for you to use CFLAGS=-DLOG_SIZE_MAX=<your-value> to change it in your build?

Lastly, I am no big fan of the SO_PEERGROUPS patch. This is part of the kernel API, and we usually prefer just relying on our API headers to be up-to-date, than cherry-picking everything. Also note that the sockopt is different on SPARC and PARISC, where it is not set to 59. Can't you just use CFLAGS=-DSO_PEERGROUPS=59 in your build-flags? I am not sure which way is cleaner, though.

I will close this PR, as it is now resolved. But feel free to continue the discussion!

bus1 / dbus-broker

dbus/socket: handle unexpected return of 1 from SIOCOUTQ #248