facebook / wdt

Warp speed Data Transfer (WDT) is an embeddedable library (and command line tool) aiming to transfer data between 2 systems as fast as possible over multiple TCP paths.
https://www.facebook.com/WdtOpenSource
Other
2.86k stars 391 forks source link

reads and writes crashing with EAGAIN #165

Open sdm900 opened 7 years ago

sdm900 commented 7 years ago

I have an issue where we get errors when sending large files over long fast links.

I instrumented the code (so I could see what the errno was) and got

E0901 04:06:28.172408 38457 SenderThread.cpp:389] wdt>       Thread[5, port: 38185]  Write error -1 (262144). fd = 8. file = randfile. port = 38185  error 11  Resource temporarily unavailable

which is telling me that the write should be "try again"... but WDT just crashes out.

Lines like

    written = socket_->write(buffer, size, /* retry writes */ true);

are not protected against EAGAIN...

Why? Is their something in the design that is meant to handle this?

Why aren't the error messages spitting out the errno and string version? eg.

<< ". port = " << socket_->getPort() << "  error " << errno << "  " << std::strerror(errno) ;

Thanks.

sdm900 commented 7 years ago

This patch appears to fix it...

diff --git a/util/WdtSocket.cpp b/util/WdtSocket.cpp
index f8043d24d4..354ec1c765 100644
--- a/util/WdtSocket.cpp
+++ b/util/WdtSocket.cpp
@@ -149,6 +149,7 @@ int WdtSocket::writeInternal(const char *buf, int nbyte, int timeoutMs,
     int w = writeWithAbortCheck(buf + written, nbyte - written, timeoutMs,
                                 /* always try to write everything */ true);
     if (w <= 0) {
+      if (errno == EAGAIN || errno == EINTR) continue;
       break;
     }
     written += w;
@@ -158,7 +159,7 @@ int WdtSocket::writeInternal(const char *buf, int nbyte, int timeoutMs,
     }
   }
   if (written != nbyte) {
-    WLOG(ERROR) << "Socket write failure " << written << " " << nbyte;
+    WLOG(ERROR) << "Socket write failure " << written << " " << nbyte << " error " << errno << " " << std::strerror(errno) ;
     writeErrorCode_ = SOCKET_WRITE_ERROR;
     return -1;
   }
sdm900 commented 7 years ago

What I don't understand is why you have a loop in WdtSocket::writeInternal at all. It is calling writeWithAbortCheck which to the best of my understanding is doing the same loop and handling the same errors...

sdm900 commented 7 years ago

I mean I understand that I'm basically setting an infinite timeout on the socket (by ignoring the check in WdtSocket::ioWithAbortCheck - but I'm setting the timeout to read and write timeouts to 20s... so they shouldn't really ever be hit.