Partial writes not handled?

MoritzMaxeiner commented 8 years ago

Prelude

From the posix standard regarding send and write it follows, that should a call from libasync to send be interrupted after it has already written some bytes it - and subsequently libasync's send wrapper functions for sockets - will return the number of bytes succesfully written (i.e. moved into the operating system send buffer) before the interrupt occurred. Furthermore, EAGAIN or EWOULDBLOCK are only set by a call to send if the write would block, not if it was interrupted, so libasync will (correctly) not create a WRITE event later. See here, here, and here.

The issue

From what I can see here and here asynchronous - to the best of my understanding - seems to expect libasync to create a WRITE event when a send was not able to move all the requested bytes into the OS send buffer (as shown above, it will not).

Fortunately asynchronous buffers the data correctly, but as it stands, if it gets interrupted after sending some bytes, the rest of the data will not get sent until the next time someone actively tries to send data (resulting in a potentially significant - and theoretically unbounded - delay). AFAIK loops akin to the following in both write and onWrite are necessary to ensure interruption of the send system call will not have such a detrimental effect.

if (this.writeBuffer.empty) {
    uint sent = void;
    do {
        sent = connection.send(cast(ubyte[]) data);
        data = data[sent .. $];
    } while (sent > 0 && !data.empty);
    if (sent == 0) writeBuffer = data.dup;
}

if (!writeBuffer.empty) {
    uint sent = void;
    do {
        sent = connection.send(cast(ubyte[]) writeBuffer);
        writeBuffer = writeBuffer[sent .. $];
    } while (sent > 0 && !writeBuffer.empty);
}

dcarp commented 8 years ago

Do you have a test for it? Is it ok to call send in the blocking loop?

MoritzMaxeiner commented 8 years ago

Testing whether the issue happens is - I think - extremely hard, considering that you would need to ensure that the thread currently in the send system call gets interrupted by a signal only after moving some bytes into the OS send buffer (OSSB). But to clarify the point, here are the relevant quotes regarding Linux' write and send:

If write() is interrupted by a signal after it successfully writes some data, it shall return the number of bytes written.

The above applies to any calls to write, as there are no further specifications for being interrupted regarding sockets (blocking or non-blocking), pipes, or FIFOs in the man page.

The send() function is equivalent to sendto() with a null pointer dest_len argument, and to write() if no flags are used.

The above applies, as libasync dos not use flags in its calls to send.

If you mean whether I have a test regarding the proposed changes to asynchrounous' write and onWrite: I'm using something just like it in my local changes for the next version of i3ipc-d, which will support dispatching events with libasync and it works fine.

Regarding

Is it ok to call send in the blocking loop?

My apologies, I'm not sure what you are exactly asking here, as the above loops cannot block. Calls to libasync's send will either:

make progress by sending some bytes, returning a value greater than zero and at most the number of bytes requested to write
have the send system call fail (i.e. returning -1) and return zero because either the OSSB is already completely full or at least too full for the requested bytes to fit into (which of the two is operating system specific, as POSIX AFAIK allows both variants. Linux uses the former, always trying to send at least some bytes, if possible, until the OSSB is completely full; other OSs' may choose the latter; neither FreeBSD nor OS X for example, say which one they use, so we can't make any assumptions here).

That is also the reason why - when a call to libasync's send returns a value greater than zero and less than the number of bytes requested to be written - we have to keep trying to write, as on platforms that choose the former (like Linux) it is impossible for us to distinguish between

the send system call wrote until it was interrupted by a signal and returned the number of bytes written into the OSSB up until that interruption -> The OSSB was not necessarily full (It could have been, but it needn't be, we just don't know) -> No WRITE event may be generated
the send system call wrote until the OSSB was full and returned the number of bytes it wrote into the latter -> A WRITE event will be generated. In other words, we cannot know whether a WRITE event will come and have to keep trying to write until either all the bytes were successfully written, or we know for certain a WRITE event will be generated for us once the OSSB.

The same logic applies to when we are handling a WRITE event, of course, since there is no guarantee that the WRITE event will only be generated once enough bytes are available in the OSSB to write all of our data. On platforms like Linux, it might be generated once a single bytes becomes free in the OSSB (unlikely, I grant you, but possible), so we - again - have to try writing until we are certain either all of our data was written, or a WRITE event will be generated for us in the future.

MWumpusZ commented 8 years ago

I can confirm and reproduce that, when send fails to send all data, pauseWriting is called (so far so good), but resumeWriting never is because no WRITE event is ever generated - matching the behavour described by Cairama.

I'm having some difficulty getting a minimal code sample together, since in our application there are some intermediate wrappers around your libasynchronous, but in broad terms I'm just sending a lot of data "at once" over a TCP connection to a local listener, and reliably getting pauseWriting called at around the 7 MB mark - that's a non-obvious number, so there's probably PC speed / OS network implementation / other factors influencing that.

I hacked up asynchronous-0.6.3/src/asynchronous/libasync/events.d to log calls to pauseWriting and resumeWriting as well as all received TCPEvents, and as Cairama inidicated, the expected WRITE event never occurs after the pauseWriting (or at all, in my simple example).

If you still can't reproduce the issue, I can send you the not-very-minimal example I have - it's not like the code should be unfamiliar to you ;-) Ghosts from the past :)

MoritzMaxeiner commented 8 years ago

POSIX.1-2013 also requires this behaviour explicitly:

When writing to a STREAM that is not a pipe or FIFO: If O_NONBLOCK is set and part of the buffer has been written while a condition in which the STREAM cannot accept additional data occurs, write() shall terminate and return the number of bytes written.

On a side-note: An analogous issue happens when receiving on stream sockets (technically all sockets that aren't message-based/datagram-based) as POSIX.1-2013 specifies for recv/recvfrom/recvmsg:

For message-based sockets, such as SOCK_DGRAM and SOCK_SEQPACKET, the entire message shall be read in a single operation.

as well as for read, which recv is equivalent to with no flags:

If -1 is returned when any data is transferred, it is difficult to recover from the error on a seekable device and impossible on a non-seekable device. Most new implementations support this behavior. The behavior required by POSIX.1-2008 is to return the number of bytes transferred.

POSIX.1-2008 does not specify when an implementation that buffers read()s actually moves the data into the user-supplied buffer, so an implementation may choose to do this at the latest possible moment. Therefore, an interrupt arriving earlier may not cause read() to return a partial byte count, but rather to return -1 and set errno to [EINTR].

Consideration was also given to combining the two previous options, and setting errno to [EINTR] while returning a short count. However, not only is there no existing practice that implements this, it is also contradictory to the idea that when errno is set, the function responsible shall return -1.

TLDR:

On datagram/message-based sockets (SOCK_DGRAM, SOCK_SEQPACKET, SOCK_RAW, SOCK_RDM) one call to recv/recvfrom/recvmsg - if it succeeds - must always return an entire datagram (as much as fits into the buffer, at least). Any subsequent such system call (ignoring MSG_PEEK) on that socket cannot return any part of the already returned datagram.
On all other sockets (I know only SOCK_STREAM) a recv/recvfrom/revcmsg is specifically allowed to return a short count (e.g. when it was interrupted after transferring some bytes into the provided buffer). Herein lies the issue: Since we are using edge-triggering (at least on epoll and kqueue) that means if we receive a short count we will not get another READ event. Unfortunately POSIX.1-2013 (see above) makes it explicitly impossible in this case to distinguish between receive was interrupted after transferring some bytes and receive transferred all available bytes from the receive buffer into the user's buffer and there is still space left in the user's buffer. So, yet again, we have to keep trying to receive until we are certain that we have retrieved all bytes from OS the receive buffer and there are no "stale" bytes left, for which no READ event gets generated more new bytes arrive.

duselbaer commented 8 years ago

Hi Dragos,

is there any progress on this issue? Can we somehow support?

dcarp commented 8 years ago

Hi R., a small test program to demonstrate the problem would help. Till now I couldn't reproduce it, but I'll work on it this week-end.

MoritzMaxeiner commented 8 years ago

a small test program to demonstrate the problem would help.

As I stated in the above, it's not a matter of testing. It's a matter of specification conformance. Whether or not one can reliably trigger the bug is immaterial to the problem of not conforming to the API specification.

Till now I couldn't reproduce it, but I'll work on it this week-end.

Don't try to reproduce it, read the specification I linked to in depth, decide whether or not you agree with my interpretation of it (it is always possible that I did not interpret it correctly), and document your decision. If you agree with my interpretation, change the code appropriately; if you don't, close the issue (and please give a reason as to why you disagree). Trying to reliable trigger a bug that may or may not happen depending on implementation choices made by the OS (and in the case of Linux may change with every kernel release - though it's unlikely) is a shortcut to hell: With this I mean that the OS could theoretically remember which process starts a system call on a nonblocking socket and not send signals for the duration of that system call (making it impossible to trigger the bug on that particular OS, despite it possibly happening on another OS). I doubt Linux does that, but the point is that unless one reads the OS' source code (which is not possible for e.g. OS X), one doesn't know. So one has to write compliant code and (optionally) be angry at the people back then who designed readiness IO.

MWumpusZ commented 8 years ago

In the send direction, both duselbaer and I have seen the described issue show up in (different) "real" code. I can have another go at getting some minimal sample code together on Monday, but Calrama's caveat - that it may still Work For You because of OS differences - is certainly correct.

Regarding the receive direction: I'm strongly inclined to agree with Calramas conclusion, although I believe that the short read on a stream socket isn't necessarily a decisive point. More on that latter when I've had a chance to digest the details.

MWumpusZ commented 8 years ago

Regarding the Issue in reading, which the patch doesn't address (which is fine; if it's an issue it deserves it's own ticket): I've gone over the opengroup stuff again and am still unsure either way; there would be a problem only if recv is permitted to return

more than nothing
AND less than the requested buffer size
AND less than it has available

... when it is interrupted. Is this allowed? Or is it only allowed to return -1 when interrupted? I haven't been able to tell from the spec :-(

MoritzMaxeiner commented 8 years ago

Regarding the Issue in reading, which the patch doesn't address (which is fine; if it's an issue it deserves it's own ticket):

Since I'm currently not using asynchronous actively (for reasons that have nothing to do with this), I'm conflicted about opening another issue I probably won't be able to dedicate enough time to, which is why I'm going to quote what I believe to be the relevant specification passages below for now.

if recv is permitted to return

more than nothing

AND less than the requested buffer size

AND less than it has available

... when it is interrupted. Is this allowed? Or is it only allowed to return -1 when interrupted? I haven't been able to tell from the spec :-(

POSX.1-2013/functions/recv:

The recv() function is equivalent to recvfrom() with null pointer address and address_len arguments, and to read() if the socket argument refers to a socket and the flags argument is 0.

POSIX.1-2013/functions/read:

Earlier versions of this standard allowed two very different behaviors with regard to the handling of interrupts. In order to minimize the resulting confusion, it was decided that POSIX.1-2008 should support only one of these behaviors. Historical practice on AT&T-derived systems was to have read() and write() return -1 and set errno to [EINTR] when interrupted after some, but not all, of the data requested had been transferred. However, the US Department of Commerce FIPS 151-1 and FIPS 151-2 require the historical BSD behavior, in which read() and write() return the number of bytes actually transferred before the interrupt. If -1 is returned when any data is transferred, it is difficult to recover from the error on a seekable device and impossible on a non-seekable device. Most new implementations support this behavior. The behavior required by POSIX.1-2008 is to return the number of bytes transferred.

POSIX.1-2008 does not specify when an implementation that buffers read()s actually moves the data into the user-supplied buffer, so an implementation may choose to do this at the latest possible moment. Therefore, an interrupt arriving earlier may not cause read() to return a partial byte count, but rather to return -1 and set errno to [EINTR].

Highlighting by me. To me the above leaves only one possible interpretation for how a POSIX.1-2013 compliant recv implementation must behave: It's three semantic steps are as follows (any of which may be interrupted):

Do internal work that does not touch the user provided buffer.
If any bytes are available in the OS receive buffer, transfer a single byte from it to the user provided buffer.
Transfer any remaining available bytes from the OS receive buffer into the user provided buffer.

If recv gets interrupted

in step 1: It must return -1 and must set EINTR,
in step 2 before transferring this first byte: It must return -1 and must set EINTR,
in step 2 after transferring this first byte: It must return 1,
in step 3: It must return the number of bytes transferred from the OS receive buffer into the user provided buffer before the interrupt occurred.

dcarp commented 8 years ago

@MWumpusZ , @duselbaer, is your issue fixed in v0.6.5? Regarding the read, I hope that it works correctly. Have you seen any problems there? Otherwise I think that libasync cannot help there and a rewrite using epoll for GNU/Linux will be necessary.

duselbaer commented 8 years ago

No problems here using read so far.

dcarp / asynchronous

Partial writes not handled? #9

Prelude

The issue