axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.83k stars 402 forks source link

What's the proper way to read all the PID stat files "/proc/[pid]/stat" in a io_uring friendly way? #1174

Closed markg85 closed 2 days ago

markg85 commented 3 months ago

Hi,

As a fun exercise in IO performance optimization, I started playing with btop to optimize it's IO usage. I got very far to about 1/10 of what it was using mostly C or simpler constructs in C++ and that's been a nice learning new things.

But now i want to go a step further and use io_uring to accomplish the same. I'm a little stuck wrapping my head around how to fully utilize it. It's obviously with a goal to reduce syscalls as much as i can but just moving a file read operation to io_uring (with open and close being still a syscall) seems like not much gain for a lot of added code complexity.

I'm curious if there's some nice "this is how to do it efficiently" sample code somewhere. I'm particularly looking for an example that gives me the data of a predefined list of files where open -> read -> close are all done in io_uring. In my case i'd be receiving the content of /proc/1/stat, /proc/2/stat, /proc/567876/stat, etc... you get the point.

What would be even more awesome is if io_uring could do readdir the readdir and act upon the entries in that readdir. A tricky part is, for my case, i only need to know the PID folders in /proc which is just a subset of all the entries it has. Side note here, this is just how btop did it already and i'm merely trying to optimize it. If there's an alternate - fast - way to get a list of all PIDs then i'd love to hear that too!

I also read something about multishot. I'm not entirely sure if files under /proc qualify for multishot. In this application you can set a millisecond (in 100ms increments) interval to update. Would such a thing be possible in io_uring too? So read a batch of files every x ms?

zyklotomic commented 1 month ago

just another user of / someone excited about io_uring. do you have a forked branch you're working off of to share? curious to see!

i agree, i don't think multishot was made for this. it is an interesting case i don't think io_uring accommodates yet, based on what i know about io_uring

zyklotomic commented 1 month ago

i also don't think reducing syscalls is the only benefit. it may end up being more inefficient too

zyklotomic commented 1 month ago

Also, see https://github.com/torvalds/linux/blob/f6eb0fed6a3957c0b93e3a00c1ffaad84d4ffc31/include/linux/io_uring_types.h#L220-L222. this could be a perfect use case for the indirection. Have an array of sqe's:

pid_read_stats_sqes *sqe = # array of OP_READ for /proc/<pid>/stats sqe's for each of the PIDs.

Then, to read /proc/PID_NUM/stats, enqueue pid_read_stats_sqe[PID_NUM].

markg85 commented 1 month ago

Please, edit your posts instead of a new post per thought. Thank you!

It is difficult to find right io_uring ways to do this. I ended up playing with this as a stand-alone test case but kinda moved on to different things to never really finish it. So the code below is not done at all, not optimized or tuned by any means and not an example of how things should be done.

With those disclaimers out of the way, here's the code:

#include <stdio.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <liburing.h>
#include <dirent.h>     /* Defines DT_* constants */
#include <err.h>
#include <stdint.h>
#include <sys/syscall.h>
#include <ctype.h>
#include <vector>
#include <chrono>
#include <string>
#include <functional>
#include <iostream>
#include <unordered_map>

#define QD  1024

struct iobuf {
   int fd;
   char buf[1024];
   char path[32];
};

inline uint64_t bitshift_stoull(const char* value, const int length) {
    uint64_t result = 0;

    char const* p = value;
    char const* q = p + length;
    while (p < q) {
        result = (result << 1) + (result << 3) + *(p++) - '0';
    }
    return result;
}

inline uint64_t bitshift_stoull(const char* value) {
    return bitshift_stoull(value, strlen(value));
}

struct linux_dirent {
    unsigned long  d_ino;
    off_t          d_off;
    unsigned short d_reclen;
    char           d_name[];
};

std::vector<int> getdents(const char* path)
{
    int                  fd;
    char                 d_type;
    char                 buf[1024];
    long                 nread;
    struct linux_dirent  *d;
    std::vector<int>     pids;

    fd = open(path, O_RDONLY | O_DIRECTORY);

    for (;;) {
        nread = syscall(SYS_getdents, fd, buf, 1024);

        if (nread == 0)
            break;

        for (size_t bpos = 0; bpos < nread;) {
            d = (struct linux_dirent *) (buf + bpos);
            d_type = *(buf + bpos + d->d_reclen - 1);
            if (d_type == DT_DIR) {
                if (isdigit(d->d_name[0])) {
                    const size_t convertedPid = bitshift_stoull(d->d_name);
                    pids.push_back(convertedPid);
                }
            }
            bpos += d->d_reclen;
        }
    }

    return pids;
}

int sample_iouring()
{
    struct io_uring ring;
    int i, fd, ret, pending, done;
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    ssize_t fsize;
    off_t offset;

    std::unordered_map<int, iobuf> pid_data;

    ret = io_uring_queue_init(QD, &ring, 0);
    if (ret < 0) {
        fprintf(stderr, "queue_init: %s\n", strerror(-ret));
        return 1;
    }

    std::vector<int> pids = getdents("/proc");

    // Open all the files in io_uring. we need their fd in read and close.
    for (int pid: pids)
    {
        iobuf buf;
        buf.fd = -1;
        snprintf(buf.path, sizeof(buf.path), "/proc/%d/stat", pid);

        sqe = io_uring_get_sqe(&ring);
        io_uring_prep_openat(sqe, 0, buf.path, O_RDONLY, 0);

        // Store our pid. We need this to know which pid we opened with which fd
        io_uring_sqe_set_data(sqe, (void*)pid);

        pid_data[pid] = buf;
    }

    ret = io_uring_submit(&ring);
    if (ret < 0) {
        fprintf(stderr, "io_uring_submit: %s\n", strerror(-ret));
        return 1;
    }

    // We now want to wait till all files are open.
    ret = io_uring_wait_cqe_nr(&ring, &cqe, pids.size());
    if (ret < 0) {
        fprintf(stderr, "io_uring_wait_cqe: %s\n", strerror(-ret));
        return 1;
    }

    int nr = 0;
    unsigned head;

    // iterate all the CQEs we got
    io_uring_for_each_cqe(&ring, head, cqe) {
        pid_data[cqe->user_data].fd = cqe->res;
        nr++;
    }
    // now advance CQ ring with the number we've seen
    io_uring_cq_advance(&ring, nr);

    for (int pid: pids)
    {
        sqe = io_uring_get_sqe(&ring);
        io_uring_prep_read(sqe, pid_data[pid].fd, pid_data[pid].buf, sizeof(pid_data[pid].buf), 0);
        sqe = io_uring_get_sqe(&ring);
        io_uring_prep_close(sqe, pid_data[pid].fd);
    }

    ret = io_uring_submit(&ring);
    if (ret < 0) {
        fprintf(stderr, "io_uring_submit: %s\n", strerror(-ret));
        return 1;
    }

    done = 0;
    pending = ret;
    fsize = 0;
    for (i = 0; i < pending; i++) {
        ret = io_uring_wait_cqe(&ring, &cqe);
        if (ret < 0) {
            fprintf(stderr, "io_uring_wait_cqe: %s\n", strerror(-ret));
            return 1;
        }

        done++;
        fsize += cqe->res;
        io_uring_cqe_seen(&ring, cqe);
    }

    printf("Submitted=%d, completed=%d, bytes=%lu\n", pending, done,
           (unsigned long) fsize);
    io_uring_queue_exit(&ring);

    std::cout << pid_data.size();

    return 0;
}

int sample_c()
{
    std::vector<int> pids = getdents("/proc");

    //
    // NOTE! The opendir route here isn't any faster then the getdents function.
    //
    // DIR* dirp = opendir("/proc");
    // std::vector<int> pids;

    // struct dirent* dp;

    // while ((dp = readdir(dirp)) != NULL) {
    //     // Skip "." and ".." entries
    //     if (strcmp(dp->d_name, ".") == 0 || strcmp(dp->d_name, "..") == 0) {
    //         continue;
    //     }

    //     // Check if it's a directory using d_type
    //     if (dp->d_type == DT_DIR) {
    //         if (isdigit(dp->d_name[0])) {
    //             const size_t convertedPid = bitshift_stoull(dp->d_name);
    //             pids.push_back(convertedPid);
    //         }
    //     }
    // }

    // closedir(dirp);

    std::unordered_map<int, iobuf> pid_data;

    for (int pid: pids)
    {
        iobuf buf;
        buf.fd = -1;
        snprintf(buf.path, sizeof(buf.path), "/proc/%d/stat", pid);

        buf.fd = open(buf.path, O_RDONLY);
        read(buf.fd, buf.buf, sizeof(buf.buf));
        close(buf.fd);
        pid_data[pid] = buf;
    }

    std::cout << pid_data.size();

    return 1;
}

int sample_cpp_stl()
{

}

int main(int argc, char *argv[])
{
    auto timer = [](std::string pre, std::function<void()> func)
    {
        auto t1 = std::chrono::high_resolution_clock::now();
        func();
        auto t2 = std::chrono::high_resolution_clock::now();

        auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count();
        std::cout << pre << " took: " << duration << " milliseconds/n";
    };

    timer("io_uring", sample_iouring);
    timer("c", sample_c);
    // timer("c", sample_cpp_stl);

    return 1;
}

And here's the CMakeLists.txt file to compile this:

cmake_minimum_required(VERSION 3.29)
project(proc_reader)

find_package(PkgConfig REQUIRED)
pkg_check_modules(liburing REQUIRED IMPORTED_TARGET liburing)

add_executable(proc_reader main.cpp)
target_link_libraries(proc_reader PUBLIC PkgConfig::liburing)
target_include_directories(proc_reader PUBLIC PkgConfig::liburing)

Some notes:

  1. The c function is about 20-30 times faster then the io_uring version.
  2. I wanted to start a C++ STL version too (as that's what btop uses) but never bothered making it. It would be slower for sure.
  3. While nearly nothing is optimized, the getdents function is (and is used in both io_uring and c). It's goal is to use as few cycles as possible to find the numeric folders (mostly because of bitshift_stoull). And yes, i did in fact benchmark this and this way is about as fast as you can get (stoull) without going into CPU instructions specifically, the defaults in C and C++ are slow.
  4. The std::cout << pid_data.size(); is just there to confirm that both functions find an equal number of processes.
  5. Before one says "remove the vector" or "unordered_map" and io_uring will be faster.. The results count here. I need to have blobs of data from each process to - in another function - parse. I need to have all these blobs in memory. As long as those conditions are met it's all good.

I am keen on hearing improvements on the io_using version. It's a mix and match of examples i found "gluing" it all together. I would imagine io_uring to be able to be faster or at least on par with c.

zyklotomic commented 1 month ago

Is the following line for sure necessary?

    ret = io_uring_wait_cqe_nr(&ring, &cqe, pids.size());

I think this is defeating the whole point of io_uring as an asynchronous syscall facility by waiting for all requests to be ready before reading the responses.

markg85 commented 1 month ago

Perhaps.

But worst case io_uring should be on par with serial c, right?

zyklotomic commented 1 month ago

I don't think so for the general case actually, io_uring isn't a panacea. It has to be the right workload for the benefits of io_uring to shine. I do think it could work for this case.

For example, as an analogy, imagine an API service that serves requests one by one as they come. when made asynchronous, the service will have higher throughput, but most likely also higher individual latencies. This applies to io_uring too in my experience.

I'm down to have a go at trying to optimize that code too. If i understand correctly, we would like to periodically read from /proc/<pid>/stat, right?

markg85 commented 1 month ago

You're making the wrong analogy here. If i were to be building a webserver then you'd be right.

I'm not!

I want to know all the values in stat files once every x milliseconds and whatever is fastest wins here. Yes, that is periodically reading all the /proc/<pid>/stat files as you can see in the example code. Throughput (io_uring should shine!) and latency (done as fast as possible) is what counts in this usecase.

isilence commented 1 month ago

io_uring can be asynchronous when, for example, reading from a disk because the actual work, i.e. data transfer, is done asynchronously by the disk. Reading proc files, however, is just executing kernel code copying some data to the user, it can't be done truly async, CPU has to run the code.

Now, for such files io_uring should usually try to execute reads while submitting requests (in the submission syscall), basically turning it into sequentially processing of the requests one by one, and that would give you perf comparable to synchronous syscalls like read(2). If you're seeing 20-30 times difference, something is wrong. Possibly, the file doesn't support non-blocking execution, which is why io_uring needs to do additional work by offloading execution to another thread. That we'd need to look into.

markg85 commented 1 month ago

If you're seeing 20-30 times difference, something is wrong. Possibly, the file doesn't support non-blocking execution, which is why io_uring needs to do additional work by offloading execution to another thread. That we'd need to look into.

Please don't hesitate to help. The io_uring side and calls is magic to me. Yeah, i can "hack to get it to work" but i don't know the intricate details that might make a substantial difference here. The code is above, please do help.

zyklotomic commented 1 month ago

@markg85 Would you happen to know the mechanism with which btop (or top, htop, or any process viewer for that matter) handles refreshing the processes list on a high level?

Since processes can of course spawn and die, the pids we want to monitor within /proc/<pid> may change. I'm wondering if these process viewers are smart enough to iteratively update the process list. Or do we have to getdents each refresh? It might not even be more efficient to do iterative updates?

The reason I'm asking is because there are potential optimizations with io_uring if we can do some initialization, such as having pre-filled-out SQE's that we can just resubmit for each iteration. In such a case, I think the one-off initialization cost can be entirely amortized away.

I think the fact that procfs isn't a real fs definitely alters the performance characteristics here as suggested by Pavel. We also don't have to re-open the files each time we want to read from procfs since it's not a real fs. Each read should provide fresh and complete data.

Down to look at the sources for btop, htop, and co. in the meantime though! I'll report back. 😄

markg85 commented 1 month ago

@markg85 Would you happen to know the mechanism with which btop (or top, htop, or any process viewer for that matter) handles refreshing the processes list on a high level?

I kinda know for btop. It essentially does what my example code shows. Read the list of pids and display them every x milliseconds. There's no "pid cache" so if a process dies it just vanishes from the gui at the next refresh.

Note that while i'm using getdents, btop uses plain C++ mechanisms, shouldn't matter much for the idea but it's definitely a lot slower. You might want to look at their proc handling code here. There's more files being read besides /proc<pid>/stat but that one is the hot path in my benchmarks taking up most cpu cycles. My above example is based on this loop.

What would help is if you could combine file system monitoring and proc file reading within io_uring. So if there where to be a mechanism to essentially tell io_uring: "monitor this folder and give me the data of changed files". That would prevent quite a bit of complexity as you'd then "only" parse data from pid files.

We also don't have to re-open the files each time we want to read from procfs since it's not a real fs. Each read should provide fresh and complete data.

But the syscalls aren't there to support that, right? A read required a open file descriptor and an open descriptor has to be closed. Hence open->read->close. Or am i missing something here?

zyklotomic commented 1 month ago

That's the thing with trying to use regular fs mechanisms with /proc though, unfortunately. Were you thinking of something like the inotify API to monitor fs changes? You can't monitor changes for "files" in /proc, they're not real files.

Thanks for the references, will read them soon!

On Tue, Aug 13, 2024, 5:04 AM Mark Gaiser @.***> wrote:

@markg85 https://github.com/markg85 Would you happen to know the mechanism with which btop (or top, htop, or any process viewer for that matter) handles refreshing the processes list on a high level?

I kinda know for btop. It essentially does what my example code shows. Read the list of pids and display them every x milliseconds. There's no "pid cache" so if a process dies it just vanishes from the gui at the next refresh.

Note that while i'm using getdents, btop uses plain C++ mechanisms, shouldn't matter much for the idea but it's definitely a lot slower. You might want to look at their proc handling code here https://github.com/aristocratos/btop/blob/00c90884c328eb3e44a0ab094e833161092aae9c/src/linux/btop_collect.cpp#L2404. There's more files being read besides /proc/stat but that one is the hot path in my benchmarks taking up most cpu cycles. My above example is based on this loop https://github.com/aristocratos/btop/blob/00c90884c328eb3e44a0ab094e833161092aae9c/src/linux/btop_collect.cpp#L2606 .

What would help is if you could combine file system monitoring and proc file reading within io_uring. So if there where to be a mechanism to essentially tell io_uring: "monitor this folder and give me the data of changed files". That would prevent quite a bit of complexity as you'd then "only" parse data from pid files.

— Reply to this email directly, view it on GitHub https://github.com/axboe/liburing/issues/1174#issuecomment-2285739905, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFNMDXJIFMH5NNU7CQQDVNTZRHDYZAVCNFSM6AAAAABJ6JL2W6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBVG4ZTSOJQGU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

markg85 commented 1 month ago

Were you thinking of something like the inotify API to monitor fs changes? You can't monitor changes for "files" in /proc, they're not real files.

Ah didn't know that, thank you fel telling! I wasn't planning on doing that. Consider it just thinking out loud.

But since monitoring it isn't possible then it might make even more sense to have a special io_uring function for procfs, does it not?

isilence commented 1 month ago

But since monitoring it isn't possible then it might make even more sense to have a special io_uring function for procfs, does it not?

It doesn't, you're talking about quite a niche use case, and I don't think there are many people who really care about its performance. It might make sense, but only if that special handling is really really small. You can just as likely ask inotify folks to make it work with procfs (if it doesn't as @zyklotomic said), otherwise something tells me it means re-implementing inotify in io_uring with blackjack on top.