Falling back to syscall on EACCES

XanClic commented 1 month ago

Hi,

I wonder why UffdBuilder::open_file_descriptor() falls back to the syscall only if opening the device file failed with ENOENT. On both of the main systems I use for work, /dev/userfaultfd is not accessible by users other than root (mode rw-------, owner root:root), and I can’t remember having it explicitly configured this way, so I assume this is the default. Opening /dev/userfaultfd thus returns EPERM, which leads to open_file_descriptor() just failing instead of trying the syscall as well. One of my systems has 1 in /proc/sys/vm/unprivileged_userfaultfd, so the syscall would work if we were to use it—but we never do, because the device file is there, just not accessible.

Is there a reason why open_file_descriptor() only falls back to the syscall on ENOENT, and not when encountering other errors?

Background: We’re planning to add postcopy migration support to virtiofsd, which is a VM device emulation for filesystem passthrough between host and guest. To do that, we want to use the support that the rust-vmm vhost crates provide, which rely on userfaultfd-rs to do so.

virtiofsd can sandbox itself, which, as a side effect, will hide /dev/userfaultfd. Consequently, on the system where unprivileged userfaultfd is allowed (but the device file is not accessible by non-root users), a sandboxed virtiofsd can successfully create a userfaultfd with this crate (because opening the device file returns ENOENT, so the syscall is used), but a non-sandboxed virtiofsd cannot (because it returns EACCESS, failing immediately).

I’d be happy to send a PR, but the code is very explicit about only falling back on ENOENT (including the comment above), which is why I’m hesitant. Clearly there’s a reason, but I can’t see it from the code or the commit log.

pchickey commented 1 month ago

@bchalios, you contributed open_file_descriptor, can I assign this to you?

bchalios commented 1 month ago

Hi @XanClic and sorry for the delayed response.

/dev/userfaultfd was introduced in Linux 6.1 for addressing some security issues related with creating UFFDs via the system call[1]. TL;DR being able to handle page faults triggered at kernel space by an arbitrary user-space process introduces an attack vector. That is why the userfaultfd system call mandates that the calling process needs to have the CAP_SYS_PTRACE capability. The problem of the CAP_SYS_PTRACE is that it allows the process to do many more things than creating userfault file descriptors.

/dev/userfaultfd avoids these issues because it determines if a process can create such file descriptors through the file permissions of /dev/userfaultfd. It is up to the system administrator to decide who can or can't create file descriptors like that, by assigning the appropriate permissions.

So, the answer to your question

Is there a reason why open_file_descriptor() only falls back to the syscall on ENOENT, and not when encountering other errors?

is yes :)

If your process gets an EPERM when trying to open /dev/userfaultfd it is probably because your system administrator doesn't want it to open it. So, we return an appropriate error to reflect that.

. One of my systems has 1 in /proc/sys/vm/unprivileged_userfaultfd, so the syscall would work if we were to use it—but we never do, because the device file is there, just not accessible.

IMHO this seems like a misconfigured system.

The default value of /proc/sys/vm/unprivileged_userfaultfd is 0. This means that someone (probably your system admin) changed this to be 1. If an admin is willing to let every user in the system to create UFFDs via the system call, you could ask them to do the same via /dev/userfaultfd by setting up appropriate file permissions.

[1] https://lwn.net/Articles/819834/

XanClic commented 1 month ago

Thanks for the reply!

I’m the admin of my system, and it’s plausible that I “misconfigured” my system some time (i.e. years) ago so I could use/test postcopy migration with qemu, finding that setting unprivileged_userfaultfd to 1 made it work. In fact, given that Linux 6.1 was only released in December 2022, I assume I actually didn’t misconfigure my system then, but simply set unprivileged_userfaultfd at a time when /dev/userfaultfd wasn’t available yet, and just let it stay this way since then. So regarding “[my] system administrator doesn't want it to open it”, I do want it to open it. My configuration is just old.

I understand that you find this configuration doesn’t make much sense, and I do agree. I should update /dev/userfaultfd’s permissions. However, it is a possible configuration, and nothing stopped me from enacting it, so I still wouldn’t dismiss it. It clearly does exist in practice, specifically on my system.

In any case, I couldn’t gather from your reply why it would be wrong to attempt to call the syscall after opening the device file returned EPERM. As far as I understand, on a well-configured system, using the syscall would then also return EPERM, i.e. nothing would break or change (except for doing a second syscall in the error case). It sounds like it would only make a difference on a system configured like mine.

What I’m a bit afraid of is that we enable postcopy migration in virtiofsd (via the vhost-user-backend crate, which uses the userfaultfd crate to do so) and then users that have the same configuration as I do (which I don’t find too implausible, honestly) find that postcopy migration works with qemu alone, but not when using virtio-fs.

bytecodealliance / userfaultfd-rs

Falling back to syscall on EACCES #68