tcp plugin hang when using blocking file descriptor

carns commented 8 years ago

I've encountered a CCI hang (when used with Mercury) in which the ctp_tcp plugin blocks on read() within ctp_tcp_get_event() when trying to drain the pipe.

If I understand the desired behavior correctly, we want to drain the pipe if the event queue is empty at this point. I think the safest way to do this is to make the read fd for the pipe nonblocking by doing something like this:

fcntl(tep->pipe[0], F_SETFL, O_NONBLOCK);

... after the pipe is created to make sure that there is no danger of a drain attempt blocking. Then change the draining logic to run the do/while loop as long as the read() result == sizeof(a) instead of != sizeof(a);

scottatchley commented 8 years ago

On Jun 6, 2016, at 4:38 PM, carns notifications@github.com wrote:

I've encountered a CCI hang (when used with Mercury) in which the ctp_tcp plugin blocks on read() within ctp_tcp_get_event() when trying to drain the pipe.

If I understand the desired behavior correctly, we want to drain the pipe if the event queue is empty at this point. I think the safest way to do this is to make the read fd for the pipe nonblocking by doing something like this:

fcntl(tep->pipe[0], F_SETFL, O_NONBLOCK);

... after the pipe is created to make sure that there is no danger of a drain attempt blocking. Then change the draining logic to run the do/while loop as long as the read() result == sizeof(a) instead of != sizeof(a);

Hi Phil,

CCI uses level-triggered semantics rather than edge-triggered. Each new event should add a byte to the pipe and getting the event should drain one byte from the pipe. We cannot drain all bytes from the pipe or the caller would never block even though there are events ready for reaping.

I assume that we are reading in get_event(), which is about to return an event to the caller. Is that correct?

How easy is this to trigger?

We might be adding an event somewhere without writing a byte to the pipe. Are there other events queued for reaping? They would be on ep->evts. If there are, how many (you will need to walk the list)?

Scott

carns commented 8 years ago

It is entirely possible that I misunderstood the intent of the code :)

The (unmodified) code snippet in question is:

        /* drain fd so that they can block again */
        if (TCP_CONN_IS_BLOCKING(tep)) {
                char a[1];
                do {
                        /* Nothing explicit to do here */
                } while (tcp_event_queue_is_empty (ep) &&
                         read (tep->pipe[0], a, sizeof(a)) != sizeof (a));
        }

Which is in the ctp_tcp_get_event() function. The problem is easy to trigger with Mercury if the Mercury na_cci plugin uses file descriptor polling to determine when to get events; it eventually blocks in the read() call in the above snippet.

The CCI user in this case (the na_cci code in Mercury) calls poll() to block until their is a reason to call cci_get_event(). The poll() call has a timeout, though, and cci_get_event() is called unconditionally following the poll() (whether poll found events to check, or whether the poll hit a timeout), so it seems like the cci_get_event() function must be made safe to call whether an event is present or not. I'm not sure in the hang that I see whether there was really an event there or not.

At any rate, it's totally fine with me if there is a one-to-one mapping of pulling exactly one byte out of the pipe per event retrieved, but it just didn't look like the existing code was set up that way when I first looked at it.

Would it be helpful if I created a standalone reproducer program? Alternatively if you see how to make a proper fix I'm happy to test it with my Mercury example.

carns commented 8 years ago

I just looked at the verbs plugin, and it's corresponding logic looks more like what you described:

        if (ev) {
                char one = 0;
                verbs_ep_t *vep = ep->priv;

                TAILQ_REMOVE(&ep->evts, ev, entry);
                if (vep->fd && TAILQ_EMPTY(&ep->evts)) {
                        debug(CCI_DB_EP, "%s: reading from pipe", __func__);
                        read(vep->pipe[0], &one, 1);
                        assert(one == 1);
                }
        } else {
                ret = CCI_EAGAIN;
        }

It seems like the TCP and Verbs plugins have different assumptions about how to use the pipe. The Verbs code drains at most one byte, and only when an event was found. The Verbs code works as far as I know but I have not been running exactly the same test scenario for each.

scottatchley commented 8 years ago

On Jun 7, 2016, at 9:17 AM, carns notifications@github.com wrote:

It is entirely possible that I misunderstood the intent of the code :)

The (unmodified) code snippet in question is:
    /* drain fd so that they can block again */
    if (TCP_CONN_IS_BLOCKING(tep)) {
            char a[1];
            do {
                    /* Nothing explicit to do here */
            } while (tcp_event_queue_is_empty (ep) &&
                     read (tep->pipe[0], a, sizeof(a)) != sizeof (a));
    }

This is wrong. :-)

Perhaps I did not explain the intent to Geoffroy correctly.

Which is in the ctp_tcp_get_event() function. The problem is easy to trigger with Mercury if the Mercury na_cci plugin uses file descriptor polling to determine when to get events; it eventually blocks in the read() call in the above snippet.

The CCI user in this case (the na_cci code in Mercury) calls poll() to block until their is a reason to call cci_get_event(). The poll() call has a timeout, though, and cci_get_event() is called unconditionally following the poll() (whether poll found events to check, or whether the poll hit a timeout), so it seems like the cci_get_event() function must be made safe to call whether an event is present or not. I'm not sure in the hang that I see whether there was really an event there or not.

At any rate, it's totally fine with me if there is a one-to-one mapping of pulling exactly one byte out of the pipe per event retrieved, but it just didn't look like the existing code was set up that way when I first looked at it.

Would it be helpful if I created a standalone reproducer program? Alternatively if you see how to make a proper fix I'm happy to test it with my Mercury example.

Not yet. We need to fix the above and then run your tests.

CCI / cci

tcp plugin hang when using blocking file descriptor #42