axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.7k stars 393 forks source link

Understanding task_work and recvmsg with buffers #1165

Closed PickingUpPieces closed 3 weeks ago

PickingUpPieces commented 1 month ago

During my journey to understand io_uring, a couple of questions arose about the scheduling of task_work and the control messages with recvmsg.

Task_work

I’m confused about the “default (scheduling) mode” of task_work. From my understanding task_work is executed on retries and posting completions and is generally triggered with IPIs [3,6]. I’ve assumed that in the “default mode”, my program is interrupted doesn't matter if it's executing a syscall or in userspace, and it will perform the task_work for the specific task. This is explained by Pavel Begunkov [1] and in the manpage of io_uring_enter (under IORING_SETUP_COOP_TASKRUN)[2]. In the “networking guide” [1] and on the manpage of io_uring_enter (under IORING_SETUP_DEFER_TASKRUN) [2], it is stated that task_work is run whenever an application transitions from kernel to userspace. I think, I’m missing a piece of the puzzle about how task_work works.

Additionally, does enabling the IORING_SETUP_DEFER_TASKRUN option imply the IORING_SETUP_COOP_TASKRUN option? From my understanding, by enabling IORING_SETUP_DEFER_TASKRUN, all work is only executed when calling io_uring_enter (with specific flags set). Is there then an advantage on enabling IORING_SETUP_COOP_TASKRUN as well with IORING_SETUP_DEFER_TASKRUN?

Getting msg_control from recvmsg with provided buffers

When using recvmsg multishot, the control messages of the msghdr and the payload are written into the provided buffer [4,5]. But how do I get the cmsg messages if I use the “normal” recvmsg (io_uring_prep_recvmsg) with provided buffers. When I checked the returned buffer on a recvmsg request, the buffer only holds the payload data, not any other data. I need the control messages for using GRO with recvmsg. Am I missing something on how to get the msghdr→ msg_control information for a request or is it currently only supported with multishot recvmsg?

This issue is more of a documentation/clarification issue; I hope it is fine that I opened it in this repo. If not, ignore it! Thanks a lot!

Ressources

[1] https://kernel-recipes.org/en/2023/schedule/on-the-way-to-io_uring-networking/ (Slide 21-29)

[2] https://www.man7.org/linux/man-pages/man2/io_uring_setup.2.html

[3] https://github.com/axboe/liburing/wiki/io_uring-and-networking-in-2023#task-work

[4] https://man7.org/linux/man-pages/man3/io_uring_prep_recvmsg.3.html

[5] https://man7.org/linux/man-pages/man3/io_uring_recvmsg_out.3.html

[6] https://kernel-recipes.org/en/2022/whats-new-with-io_uring/

axboe commented 3 weeks ago

This probably should've been two separate discussions raised, rather than a single issue...

DEFER_TASKRUN does not imply the same behavior as COOP_TASKRUN. For the latter, the task_work will be run for any transition back from the kernel to userspace. For DEFER_TASKRUN, it's only run when the task waits on completions, either by using one of the io_uring_wait_cqe() variants, or by calling io_uring_get_events(). I don't think the kernel will complain if you set COOP_TASKRUN with DEFER_TASKRUN, but it won't make a difference. The former refers to task_work that the kernel knows about (it's referenced off the task struct itself), while the latter is an io_uring private type of task_work that the kernel doesn't have any insight into.

For non-multishot receive, msg_control is delivered just like it is with a recvmsg(2) syscall. Multishot has to place the data somewhere else, or it would be overwritten by the next trigger of the receive. And since you can have many come in before you process them in the application, that would not work so well.

axboe commented 3 weeks ago

Closing this one, please open as a discussion if there are further questions.