Azure / azure-sdk-for-cpp

This repository is for active development of the Azure SDK for C++. For consumers of the SDK we recommend visiting our versioned developer docs at https://azure.github.io/azure-sdk-for-cpp.
MIT License
179 stars 128 forks source link

Idle curl connections already closed by remote side are reused #1716

Closed johnwheffner closed 3 years ago

johnwheffner commented 3 years ago

Idle connections sit in a pool for reuse, for up to 90 seconds. Some servers are pretty eager to close idle connections. In particular, the Azurite storage emulator uses a 5 seconds idle timeout. Connections that are closed by the server while sitting in the idle pool sit in a FIN_WAIT2 state on the client side (and CLOSE_WAIT on the storage service side).

When such a connection gets picked for use, an error will occur after sending out a request header, when waiting for a response. (When the remote end gets the header data, it'll send back a TCP RST.)

This error surfaces up through the SDK pipeline. The upper layers will retry, but (a) we don't want to pay the back-off and retry cost, and (b) if we have enough such idle connections -- more than a few by default, it causes operation failure.

To reproduce, on a build using the curl client, run an Azurite storage emulator, establish at least four concurrent requests, close them all, wait for >5 seconds then make another request, reusing the pooled curl connections. The connections will all have been closed by the remote side, and the request will fail after retrying three times.

johnwheffner commented 3 years ago

The following unit test demonstrates the issue more directly

diff --git a/sdk/core/azure-core/src/http/curl/curl_connection_pool_private.hpp b/sdk/core/azure-core/src/http/curl/curl_connection_pool_private.hpp
index e9c2be4..6f1495e 100644
--- a/sdk/core/azure-core/src/http/curl/curl_connection_pool_private.hpp
+++ b/sdk/core/azure-core/src/http/curl/curl_connection_pool_private.hpp
@@ -24,6 +24,7 @@
 // Define the class name that reads from ConnectionPool private members
 namespace Azure { namespace Core { namespace Test {
   class CurlConnectionPool_connectionPoolTest_Test;
+  class CurlConnectionPool_pooledConnectionClosed_Test;
 }}} // namespace Azure::Core::Test
 #endif

@@ -40,6 +41,7 @@ namespace Azure { namespace Core { namespace Http {
 #if defined(TESTING_BUILD)
     // Give access to private to this tests class
     friend class Azure::Core::Test::CurlConnectionPool_connectionPoolTest_Test;
+    friend class Azure::Core::Test::CurlConnectionPool_pooledConnectionClosed_Test;
 #endif
   public:
     /**
diff --git a/sdk/core/azure-core/src/http/curl/curl_connection_private.hpp b/sdk/core/azure-core/src/http/curl/curl_connection_private.hpp
index 906fe68..14273e6 100644
--- a/sdk/core/azure-core/src/http/curl/curl_connection_private.hpp
+++ b/sdk/core/azure-core/src/http/curl/curl_connection_private.hpp
@@ -15,6 +15,14 @@
 #include <curl/curl.h>
 #include <string>

+#if defined(TESTING_BUILD)
+// Define the class name that reads from ConnectionPool private members
+namespace Azure { namespace Core { namespace Test {
+  class CurlConnectionPool_connectionPoolTest_Test;
+  class CurlConnectionPool_pooledConnectionClosed_Test;
+}}} // namespace Azure::Core::Test
+#endif
+
 namespace Azure { namespace Core { namespace Http {

   namespace Details {
@@ -83,6 +91,11 @@ namespace Azure { namespace Core { namespace Http {
    * @brief CURL HTTP connection.
    */
   class CurlConnection : public CurlNetworkConnection {
+#if defined(TESTING_BUILD)
+    // Give access to private to this tests class
+    friend class Azure::Core::Test::CurlConnectionPool_pooledConnectionClosed_Test;
+#endif
+
   private:
     CURL* m_handle;
     curl_socket_t m_curlSocket;
diff --git a/sdk/core/azure-core/src/http/curl/curl_session_private.hpp b/sdk/core/azure-core/src/http/curl/curl_session_private.hpp
index 49bd8d9..1355c06 100644
--- a/sdk/core/azure-core/src/http/curl/curl_session_private.hpp
+++ b/sdk/core/azure-core/src/http/curl/curl_session_private.hpp
@@ -23,6 +23,7 @@
 // Define the class name that reads from ConnectionPool private members
 namespace Azure { namespace Core { namespace Test {
   class CurlConnectionPool_connectionPoolTest_Test;
+  class CurlConnectionPool_pooledConnectionClosed_Test;
 }}} // namespace Azure::Core::Test
 #endif

@@ -43,6 +44,7 @@ namespace Azure { namespace Core { namespace Http {
 #ifdef TESTING_BUILD
     // Give access to private to this tests class
     friend class Azure::Core::Test::CurlConnectionPool_connectionPoolTest_Test;
+    friend class Azure::Core::Test::CurlConnectionPool_pooledConnectionClosed_Test;
 #endif
   private:
     /**
diff --git a/sdk/core/azure-core/test/ut/curl_connection_pool.cpp b/sdk/core/azure-core/test/ut/curl_connection_pool.cpp
index 86ab67a..818555b 100644
--- a/sdk/core/azure-core/test/ut/curl_connection_pool.cpp
+++ b/sdk/core/azure-core/test/ut/curl_connection_pool.cpp
@@ -10,6 +10,8 @@
 #include "azure/core/http/curl/curl.hpp"
 #endif

+#include <sys/socket.h>
+
 #include <iostream>
 #include <string>
 #include <thread>
@@ -172,5 +174,47 @@ namespace Azure { namespace Core { namespace Test {
 #endif
     }

+    TEST(CurlConnectionPool, pooledConnectionClosed)
+    {
+      Azure::Core::Http::CurlConnectionPool::ClearIndex();
+      // Make sure there are nothing in the pool
+      EXPECT_EQ(Azure::Core::Http::CurlConnectionPool::ConnectionPoolIndex.size(), 0);
+
+      std::vector<std::unique_ptr<Azure::Core::Http::HttpPolicy>> policies;
+      policies.emplace_back(std::make_unique<Azure::Core::Http::TransportPolicy>());
+      Azure::Core::Http::HttpPipeline pipeline(policies);
+
+      Azure::Core::Http::Url url("http://httpbin.org/get");
+      Azure::Core::Http::Request request(Azure::Core::Http::HttpMethod::Get, url);
+
+      {
+        std::unique_ptr<Azure::Core::Http::RawResponse> response;
+        ASSERT_NO_THROW(response = pipeline.Send(Azure::Core::GetApplicationContext(), request));
+        auto responseCode = response->GetStatusCode();
+        int expectedCode = 200;
+        EXPECT_EQ(expectedCode, static_cast<int>(responseCode));
+      }
+
+      // Check that after the connection is gone, it is moved back to the pool
+      ASSERT_EQ(Azure::Core::Http::CurlConnectionPool::ConnectionPoolIndex.size(), 1);
+      auto connectionFromPool = Azure::Core::Http::CurlConnectionPool::ConnectionPoolIndex.begin()
+                                    ->second.begin()
+                                    ->get();
+
+      {
+        auto curlConnection = dynamic_cast<Azure::Core::Http::CurlConnection *>(connectionFromPool);
+        ASSERT_NE(nullptr, curlConnection);
+        ::shutdown(curlConnection->m_curlSocket, SHUT_RDWR);
+      }
+
+      {
+        std::unique_ptr<Azure::Core::Http::RawResponse> response;
+        ASSERT_NO_THROW(response = pipeline.Send(Azure::Core::GetApplicationContext(), request));
+        auto responseCode = response->GetStatusCode();
+        int expectedCode = 200;
+        EXPECT_EQ(expectedCode, static_cast<int>(responseCode));
+      }
+    }
+
 #endif
 }}} // namespace Azure::Core::Test
johnwheffner commented 3 years ago

An approach that's worked for me is upon removing from the pool to first check the connection's liveness by doing a recv with MSG_PEEK; if this returns success or EAGAIN then use the connection, otherwise discard and try again immediately.

vhvb1989 commented 3 years ago

I have created a draft PR for this.

The problem with this fix, is that we still have to pay the performance implication of calling recv() for every connection, even when it might be valid already. So, we are basically moving the check from the retry policy to the connection pool.

I wonder if Azurite can be configurable to change the time for closing idle connections?

johnwheffner commented 3 years ago

The syscall overhead is about 1 usec, a tax I'm happy enough to pay in this context. The other option would be to handle this condition in the error path when using the connection, which is arguably more correct but does not encapsulate the complexity.

I'm not sure whether the Azurite timeout can be tuned -- I didn't find anything when I looked, but I didn't look too hard. However, it's not too surprising for other systems like proxies to close idle connections.

vhvb1989 commented 3 years ago

@johnwheffner what you mentioned made me remember that the curl transport adapter implementation should be taking care of getting a new connection if the one that it used initially didn't work. Since you have the repro env already, can you tell me what do you get from this line: https://github.com/Azure/azure-sdk-for-cpp/blob/master/sdk/core/azure-core/src/http/curl/curl.cpp#L178

while reproducing the issue? I would expect libcurl to return CURLE_UNSUPPORTED_PROTOCOL because the connection is closed and request a new connection from the pool.

vhvb1989 commented 3 years ago

Updating to a new PR where the curl transport adapter will react to CURLE_SEND_ERROR and request a new connection.

@johnwheffner I was expecting CURLE_UNSUPPORTED_PROTOCOL but it looks like CURLE_SEND_ERROR will also be used when the data transfer didn't work.

johnwheffner commented 3 years ago

I tried out this change and unfortunately it doesn't solve the problem. When the remote server closes the connection, it puts the client connection into a CLOSE_WAIT state, in which writes to the socket are still allowed. So sending the request will not return CURLE_SEND_ERROR immediately, and the error will only be surfaced later while trying to read the response.

You can see this in the unit test by changing the CurlConnection::Shutdown call to use SHUT_RD instead of SHUT_RDWR.