eminence / procfs

Rust library for reading the Linux procfs filesystem
Other
367 stars 106 forks source link

Process struct should carry an fd and use openat #125

Closed ghost closed 2 years ago

ghost commented 3 years ago

Under Linux, procfs can suffer from TOCTTOU caused by pid reuse. This is evident if you take the self_memory example and have it run on external process. I'll show some output from strace to illustrate the problem:

openat(AT_FDCWD, "/proc/1234/statm", O_RDONLY|O_CLOEXEC) = 3
... snip ...
openat(AT_FDCWD, "/proc/1234/status", O_RDONLY|O_CLOEXEC) = 3

In between calls to openat, our process could be pre-empted, and then the pid could potentially be killed and re-used by another process before our process resumes. This would result in the call to Process::status() returning wrong information, when really it should return an error because the original process 1234 it was referencing is dead. To fix the problem, you'd want the syscalls to look more like this:

openat(AT_FDCWD, "/proc/1234", O_RDONLY|O_CLOEXEC) = 3
openat(3, "statm", O_RDONLY|O_CLOEXEC) = 4
openat(3, "status", O_RDONLY|O_CLOEXEC) = 4

The easiest way to do this would be to replace every area where file paths are joined with an intentional call to openat. The net effect of this would be that Process would no longer store root, it would store a handle to the directory in procfs. This should also be done when reading from entries in /proc/[pid]/task/[tid]. Another change that should probably happen along with this is that process::all_processes() should return an Iterator instead of a Vec, that way it will prevent a huge number of file descriptors being opened at once. This is an API break so I'm proposing it here before submitting any pull request.

eminence commented 3 years ago

Hi,

Thanks for opening this issue. At first glance, this sounds like a nice improvement. Your example with "statm" and "status" is well motivated. I would be definitely willing to see a pull request in this area. One reason that the current API for all_processes() returns a Vec instead of an iterator is to reduce the chances of a process disappearing if doing a slow iteration. I'm not sure if openat really helps for all_processes(), though I guess it'll be a more consistent "snapshot"

You have a good point about needed to manage the large number of open FDs. How does returning an iterator help? If the caller immediately collects everything into a Vec, we would still have an open FD for every running process, right?

eminence commented 3 years ago

Also, I don't see any support for openat in the stdlib. Would you rely on nix for this functionality? Or use libc?

ghost commented 3 years ago

One reason that the current API for all_processes() returns a Vec instead of an iterator is to reduce the chances of a process disappearing if doing a slow iteration.

That will ensure you get a list of pids that was active at the time of iteration but will not prevent the pids from becoming stale. The method I'm suggesting will be slower; that's unavoidable because it's moving from just holding the pid number to now having to do a syscall to grab a reference to a kernel resource. But it will also technically be more correct, the iterator can skip over pids that became stale during processing of the directory.

If we want to preserve the quick iteration, I think it would make more sense to have the method return a Vec of u32 (or more accurately pid_t). That will communicate more what is going on: the pid is just a number that could become stale, you could then pass it to a constructor later that will try to open a handle and return a Result<Process> accordingly.

How does returning an iterator help? If the caller immediately collects everything into a Vec, we would still have an open FD for every running process, right?

Yes. The idea is that for implementing something like top, programs would map over the iterator and pull out the information they need into a different structure. On Linux the default max fds open is 1024, and it's definitely possible to have more than 1024 processes open, so we probably want to avoid pushing up against that limit.

Also, I don't see any support for openat in the stdlib. Would you rely on nix for this functionality? Or use libc?

libc can probably be used if trying to keep dependencies to a minimum, the nix crate is not much more than some minimal safe wrappers around libc functions. Both would work.

lparcq commented 3 years ago

The result from all_processes is always wrong whichever method you use. It's better to minimize the number of opened file descriptors. Btw, it makes no sense to use a program like top on a system where pid sare reused faster than reading the result of all_processes(). However, using openat for Process::new would be much more reliable on heavily loaded systems.

ghost commented 3 years ago

It won't necessarily be wrong if a small number of fds are kept open, for example a top program might want to keep the fd to the currently highlighted process open. That process could then be signaled in a race-free way using pidfd. If the pids are re-used faster than top can read them (i.e. the openat to /proc/<pid>/stat succeeds but /proc/<pid>/cmdline fails) then that process would be discarded and not shown in the output since it's known that the process is already dead.