Spurious (?) multishot poll wakeups

Dear colleagues,

I'm trying to debug a problem with spurious poll events when multishot mode (IORING_POLL_ADD_MULTI) is enabled.

Unfortunately, I haven't managed to make a minimal reproducer for the issue yet.

Currently, it's reproducible with LXC when it's built with -Dio-uring-event-loop=true. Reproducer itself is fairly easy: ./build/src/lxc/tools/lxc-start -F test_ct -l TRACE -o lxcstart.log just enter a few symbols in the terminal (or press Enter a few times) and it'll stuck forever.

A stuck reason is that LXC process gets poll CQE then performs read from FD, just after read it gets a new CQE to the same FD, goes to read it and stuck forever.

I've started an investigation of this problem from the userspace (LXC) side and found that:

disabling IORING_POLL_ADD_MULTI solves the problem (https://github.com/lxc/lxc/commit/7fd671dbce98d139e52e8c4266f1050ef49ea8af)
making terminals FDs non-blocking (without disabling multishot mode) makes reproductions rarer (https://github.com/lxc/lxc/commit/715fb4effaa7f35e642056d2a10d5b8c393bfac0)

It was a hint for me and I've started digging the problem on the kernel side. The main point of interest in the kernel is function io_poll_check_events:

static int io_poll_check_events(struct io_kiocb *req, bool *locked)
{
    int v;

    /* req->task == current here, checking PF_EXITING is safe */
    if (unlikely(req->task->flags & PF_EXITING))
        return -ECANCELED;
printk("%px %s:%d io_poll_check_events %px %ps\n", req, __func__, __LINE__, req->file, req->file->f_op);
    do {
        v = atomic_read(&req->poll_refs);
printk("%px %s:%d iteration %d %d %px %ps\n", req, __func__, __LINE__, v, req->cqe.res, req->file, req->file->f_op);
        if (unlikely(v != 1)) {
            /* tw should be the owner and so have some refs */
            if (WARN_ON_ONCE(!(v & IO_POLL_REF_MASK)))
                return IOU_POLL_NO_ACTION;
            if (v & IO_POLL_CANCEL_FLAG)
                return -ECANCELED;
            /*
             * cqe.res contains only events of the first wake up
             * and all others are to be lost. Redo vfs_poll() to get
             * up to date state.
             */
            if ((v & IO_POLL_REF_MASK) != 1)
                req->cqe.res = 0;

            if (v & IO_POLL_RETRY_FLAG) {
                req->cqe.res = 0;
                /*
                 * We won't find new events that came in between
                 * vfs_poll and the ref put unless we clear the
                 * flag in advance.
                 */
                atomic_andnot(IO_POLL_RETRY_FLAG, &req->poll_refs);
                v &= ~IO_POLL_RETRY_FLAG;
            }
        }

        /* the mask was stashed in __io_poll_execute */
printk("%px %s:%d PRE vfs_poll %d %px %px %ps\n", req, __func__, __LINE__, req->cqe.res, &req->cqe.res, req->file, req->file->f_op);
        if (!req->cqe.res) {
            struct poll_table_struct pt = { ._key = req->apoll_events };
            req->cqe.res = vfs_poll(req->file, &pt) & req->apoll_events;
printk("%px %s:%d vfs_poll %d %px %ps\n", req, __func__, __LINE__, req->cqe.res, req->file, req->file->f_op);
            /*
             * We got woken with a mask, but someone else got to
             * it first. The above vfs_poll() doesn't add us back
             * to the waitqueue, so if we get nothing back, we
             * should be safe and attempt a reissue.
             */
            if (unlikely(!req->cqe.res)) {
                /* Multishot armed need not reissue */
                if (!(req->apoll_events & EPOLLONESHOT))
                    continue;
                return IOU_POLL_REISSUE;
            }
        }
        if (req->apoll_events & EPOLLONESHOT)
            return IOU_POLL_DONE;
printk("%px %s:%d continue iteration %px %ps\n", req, __func__, __LINE__, req->file, req->file->f_op);
        /* multishot, just fill a CQE and proceed */
        if (!(req->flags & REQ_F_APOLL_MULTISHOT)) {
            __poll_t mask = mangle_poll(req->cqe.res &
                            req->apoll_events);
printk("%px %s:%d writting CQE %px %llx\n", req, __func__, __LINE__, req->file, req->cqe.user_data);
            if (!io_aux_cqe(req->ctx, *locked, req->cqe.user_data,
                    mask, IORING_CQE_F_MORE, false)) {
printk("%px %s:%d overflow\n", req, __func__, __LINE__);
                io_req_set_res(req, mask, 0);
                return IOU_POLL_REMOVE_POLL_USE_RES;
            }
        } else {
printk("%px %s:%d io_poll_issue\n", req, __func__, __LINE__);
            int ret = io_poll_issue(req, locked);
            if (ret == IOU_STOP_MULTISHOT)
                return IOU_POLL_REMOVE_POLL_USE_RES;
            if (ret < 0)
                return ret;
        }

        /* force the next iteration to vfs_poll() */
        req->cqe.res = 0;
printk("%px %s:%d req->cqe.res = 0 %px %px %llx\n", req, __func__, __LINE__, &req->cqe.res, req->file, req->cqe.user_data);
        /*
         * Release all references, retry if someone tried to restart
         * task_work while we were executing it.
         */
    } while (atomic_sub_return(v & IO_POLL_REF_MASK, &req->poll_refs) &
                    IO_POLL_REF_MASK);

    return IOU_POLL_NO_ACTION;
}

I found a strange thing, sometimes after req->cqe.res = 0, on the next iteration vfs_poll isn't called. It seemed strange to me and I found that req->cqe.res is changed to non-zero value in between of the iteration from the io_poll_wake function.

It's worth to mention that problem happens with the tty_fops and ptmx_fops FDs, but we also use signalfd and socket FDs in the same io_uring instance.

The purpose of filling this issue is to share my experience (maybe someone else met this problem too) and to get a debugging advice. :)

Kind regards, Alex

https://lore.kernel.org/lkml/ccf6cea1-1139-cd73-c4e5-dc9799708bdd@living180.net/T/

It quite suspiciously reminds me of this issue. Long story short, multishot poll should only be used with nonblocking files. I assumed Christian was going to fix it up, apparently didn't happen. I'd encourage you to submit a patch to LXC.

making terminals FDs non-blocking (without disabling multishot mode) makes reproductions rarer (https://github.com/lxc/lxc/commit/715fb4effaa7f35e642056d2a10d5b8c393bfac0)

I tested it back then, worked for me. Is there a chance there is yet another blocking file you missed? It might also be a separate problem.

Hi Pavel,

https://lore.kernel.org/lkml/ccf6cea1-1139-cd73-c4e5-dc9799708bdd@living180.net/T/

It quite suspiciously reminds me of this issue. Long story short, multishot poll should only be used with nonblocking files. I assumed Christian was going to fix it up, apparently didn't happen. I'd encourage you to submit a patch to LXC.

yes, I came to the same conclusion during debugging and committed this patch (https://github.com/lxc/lxc/commit/715fb4effaa7f35e642056d2a10d5b8c393bfac0). It makes reproduction rarer.

I tested it back then, worked for me. Is there a chance there is yet another blocking file you missed? It might also be a separate problem.

I'm almost sure that I've covered all places, but I'll check again. As reproduction became rare after making FDs non-blocking it makes harder to debug what's happening. But my understanding is that we are getting into the infinite loop when poll CQE arrives, we go to read (and get something), then a new poll CQE arrived, we go to read (get nothing), new poll CQE arrives, we go to read (get nothing) and it's infinite sequence. My suspicion was that for some reason read wakes POLLIN for ttys... but I've checked git log for TTYs discipline driver and ptmx and found nothing suspicious.

Kind regards, Alex

yes, I came to the same conclusion during debugging and committed this patch (lxc/lxc@715fb4e). It makes reproduction rarer.

Ah, right, and thank you for finally fixing this one!

I tested it back then, worked for me. Is there a chance there is yet another blocking file you missed? It might also be a separate problem.

I'm almost sure that I've covered all places, but I'll check again. As reproduction became rare after making FDs non-blocking it makes harder to debug what's happening. But my understanding is that we are getting into the infinite loop when poll CQE arrives, we go to read (and get something), then a new poll CQE arrived, we go to read (get nothing), new poll CQE arrives, we go to read (get nothing) and it's infinite sequence. My suspicion was that for some reason read wakes POLLIN for ttys... but I've checked git log for TTYs discipline driver and ptmx and found nothing suspicious.

Ok, so it's busy looping doing nothing useful. We need to understand what wakes it up. Can you run bpftace for 10-20s after that state of nothingness kicks in? This one should do:

bpftrace -e 'kprobe:io_poll_wake { @[kstack] = count(); }'

This should do, you can specify pid and output log file with -p and -o respectively.

Any news on that one?

Hi Pavel,

I'm really sorry for long delay with reply. I've just missed a notification and then forget to back to the issue. After we have landed this https://github.com/lxc/lxc/pull/4304 and stopped using a multi shot mode polling everything works just fine.

So, I don't think that there is any issue from the io_uring side. If I found anything interesting in there or if I see any extra issues I'll definitely go and do some kernel debugging and tracing by myself and then contact you with some insights ;-)

I think we can close this for now.

Sorry for taking your time!

Kind regards, Alex

axboe / liburing

Spurious (?) multishot poll wakeups #849