birdofpreyru / react-native-static-server

Embedded HTTP server for React Native
https://dr.pogodin.studio/docs/react-native-static-server
Other
135 stars 21 forks source link

[Android] Sporadic failures to shutdown the server (gracefully) #106

Closed birdofpreyru closed 3 months ago

birdofpreyru commented 3 months ago

Does not seem to be a problem in the example app when I move it between background and foreground; but it does happen in my app when Google.Play purchase flow is started (the RN app itself moves to background state then, and sometimes the server fails to stop, never exiting the main loop on Lighttpd site). I wonder whether it is related to a bug fixed in the latest Lighttpd v1.4.76.

birdofpreyru commented 3 months ago

No, v1.4.76 does not help.

A note for myself, on what I figured out so far happens during a failed shutdown attempt, in Lighttpd core:

birdofpreyru commented 3 months ago

Hi @gstrauss , if you have a free moment, will you help me out here, please?

After tracing stuff a bit further I figured out this:

At this point, it feels like a bug in Lighttpd to me, I guess here once connection is marked as errored because of graceful_expire timeout, it should not be repeated for the same connection in the next main loop iteration? Or, perhaps, it should not be marked as errored if it is already in CON_STATE_CLOSE state? Then it will allow the connection to reach HTTP_LINGER_TIMEOUT and, presumably to be closed.

Or... do I misunderstand something here?

birdofpreyru commented 3 months ago

Yeah, I have verified that changing the if-block from

        if (graceful_expire) {
            connection_set_state_error(r, CON_STATE_ERROR);
            changed = 1;
        }

into

        if (graceful_expire && r->state != CON_STATE_CLOSE) {
            connection_set_state_error(r, CON_STATE_ERROR);
            changed = 1;
        }

solves the problem for me — in case when the server does not gracefully shutdown rapidly, it successfully shuts down after the expected 8 + 5 seconds timeout.

gstrauss commented 3 months ago

Thank you for the detailed troubleshooting. I have to ruminate on this a bit.

gstrauss commented 3 months ago

Instead of your suggested patch, my initial take (untested) is that connection_handle_shutdown() should not assign con->close_timeout_ts = log_monotonic_secs; if the state is already r->state == CON_STATE_CLOSE, but I have to read through the code to also assure myself that the state won't bounce between CON_STATE_CLOSE and CON_STATE_ERROR.

gstrauss commented 3 months ago

connection_handle_response_end_state() is called for CON_STATE_RESPONSE_END and for CON_STATE_ERROR. Calling connection_handle_shutdown() changes the state to CON_STATE_CLOSE if the connection is not yet closed, or cleans up the connection and goes to the reset state of CON_STATE_CONNECT if all finished.

Since the intent is that connection_handle_shutdown() should not be called more than once, and it is clearly being called more than once, that appears to be a bug. My hunch is that I need to look carefully to find where CON_STATE_CLOSE might transition to CON_STATE_ERROR.

gstrauss commented 3 months ago

h1_check_timeout(), h2_check_timeout() and connection_revents_err() might play roles, but connection_graceful_shutdown_maint() definitely transitions CON_STATE_CLOSE to CON_STATE_ERROR, as you pointed out. However, we still want changed = 1, so please try this:

--- a/src/connections.c
+++ b/src/connections.c
@@ -908,7 +908,8 @@ connection_graceful_shutdown_maint (server * const srv)
         }

         if (graceful_expire) {
-            connection_set_state_error(r, CON_STATE_ERROR);
+            if (r->state != CON_STATE_CLOSE)
+                connection_set_state_error(r, CON_STATE_ERROR);
             changed = 1;
         }
gstrauss commented 3 months ago

h1_check_timeout(), h2_check_timeout() and connection_revents_err() do not change the state if r->state == CON_STATE_CLOSE, so it looks like you correctly identified the problem in connection_graceful_shutdown_maint(). Does my modified patch above work for you?

birdofpreyru commented 3 months ago

@gstrauss your variant sure works as well, though I am not sure why to set changed if we haven't modified the state? I saw it works without it, thus the connection state machine still is called by the main loop regardless?

Another thought I got — in this scenario, when we arrive to connection_handle_close_state() after expiring the graceful shutdown timeout, perhaps there is no need to additionally wait for HTTP_LINGER_TIMEOUT? I mean, once we already expired the graceful shutdown one, perhaps it makes sense to terminate ASAP, without any further delays?

Also, I noticed that graceful-shutdown-timeout option is never mentioned in Lighttpd docs at https://redmine.lighttpd.net/projects/lighttpd/wiki/Docs_ConfigurationOptions

gstrauss commented 3 months ago

though I am not sure why to set changed if we haven't modified the state?

CON_STATE_ERROR triggers various request/connection cleanups prior to transition to CON_STATE_CLOSE. If already in CON_STATE_CLOSE, setting changed and calling connection_state_machine() optimistically attempts to read from the connection, looking for EOF. As you suggest, maybe that is unnecessary and can wait for read event notification. On the other hand, the optimistic read does sink (read and discard) a bit of extra data from the socket if the kernel socket buffers were full.

Near the top of connection_graceful_shutdown_maint(), if the connection is in CON_STATE_CLOSE and the HTTP_LINGER_TIMEOUT has expired, changed = 1 is set, so as you said, maybe it is unnecessary to set changed = 1 if graceful_expire is set but the state is already CON_STATE_CLOSE.

Another thought I got — in this scenario, when we arrive to connection_handle_close_state() after expiring the graceful shutdown timeout, perhaps there is no need to additionally wait for HTTP_LINGER_TIMEOUT?

As above, CON_STATE_ERROR does some things prior to transition to CON_STATE_CLOSE, e.g. calling shutdown() on the socket and then setting con->close_timeout_ts. One second later, connection_graceful_shutdown_maint() will be run again, will find the connection in CON_STATE_CLOSE, and will adjust con->close_timeout_ts further in the past so that lighttpd will wait at most one more second before HTTP_LINGER_TIMEOUT expires, giving the cilent a chance to respond to the shutdown() on the socket. Therefore, if the default server.graceful-shutdown-timeout is 8 seconds, everything should finish shutting down in 8 seconds + a maximum of 2 more seconds following shutdown() on the sockets on which the graceful shutdown timeout expired. Are you seeing that? With that patch, are you seeing an extra second or two, or are you seeing an extra 5 seconds (HTTP_LINGER_TIMEOUT)?

Also, I noticed that graceful-shutdown-timeout option is never mentioned in Lighttpd docs at https://redmine.lighttpd.net/projects/lighttpd/wiki/Docs_ConfigurationOptions

server.feature-flags documents "server.graceful-shutdown-timeout"

birdofpreyru commented 3 months ago

_Are you seeing that? With that patch, are you seeing an extra second or two, or are you seeing an extra 5 seconds (HTTP_LINGERTIMEOUT)?

You are right, with the patch in place, the server shuts down within a second after the graceful shutdown expiration. My bad.

gstrauss commented 3 months ago

Thanks for confirming. I have committed the following patch with your patch to fix the issue here and a small change to more quickly close connections after the graceful shutdown timeout has expired and the connection state is CON_STATE_CLOSE. There may still be a 1 second delay after graceful shutdown timeout has expired. This allows for lighttpd to set CON_STATE_ERROR to trigger sending shutdown(fd, SHUT_WR) and transitioning to CON_STATE_CLOSE, and then a second later to adjust the con->close_timeout_ts so that HTTP_LINGER_TIMEOUT is expired, too.

--- a/src/connections.c
+++ b/src/connections.c
@@ -884,6 +884,7 @@ connection_graceful_shutdown_maint (server * const srv)
              * (from zero) *up to* one more second, but no more */
             if (HTTP_LINGER_TIMEOUT > 1)
                 con->close_timeout_ts -= (HTTP_LINGER_TIMEOUT - 1);
+            con->close_timeout_ts -= (graceful_expire << 1); /*(-2 if expired)*/
             if (log_monotonic_secs - con->close_timeout_ts > HTTP_LINGER_TIMEOUT)
                 changed = 1;
         }
@@ -907,7 +908,7 @@ connection_graceful_shutdown_maint (server * const srv)
             changed = 1;
         }

-        if (graceful_expire) {
+        if (graceful_expire && r->state != CON_STATE_CLOSE) {
             connection_set_state_error(r, CON_STATE_ERROR);
             changed = 1;
         }
gstrauss commented 3 months ago

I should note that the issue you are seeing here in #106 is a bug in lighttpd, but that bug was exposed due to your clients ignoring the shutdown() sent by lighttpd. Is there something particular about the way the clients are handling shutdown() that could be improved, too?

birdofpreyru commented 3 months ago

By clients do you mean the HTTP client on the other side of connection that fails to close gracefully? I should investigate it… though, in my case the client connecting to the server is the Android’s WebView, not some custom client I’ve implemented myself; but the problem occurs when I’m moving my app into background, sounds like it may interfere with normal handling of connections by WebView.

gstrauss commented 3 months ago

By clients do you mean the HTTP client on the other side of connection that fails to close gracefully? I should investigate it… though, in my case the client connecting to the server is the Android’s WebView, not some custom client I’ve implemented myself; but the problem occurs when I’m moving my app into background, sounds like it may interfere with normal handling of connections by WebView.

Yes, the HTTP client on the other side of the connection. In the scenario you describe, are the clients receiving the shutdown() sent by lighttpd? Are the clients subsequently calling close() on the socket prior to expiration of the graceful shutdown timeout? Is lighttpd receiving a socket event indicating that the client closed the connection? Is lighttpd processing that socket event? Somewhere along the line, something seems not to be happening. It might be WebView. It might be backgrounding your app on Android. It might be lighttpd. It might be something else.

Overall, my question is: What are the reason(s) that the client connections are not being closed prior to the graceful shutdown timeout in lighttpd?

birdofpreyru commented 3 months ago

I have no idea. I guess, some bug on the client side, but for now just falling back to non-graceful shutdown is good enough for me. I'll return to investigating it at a better time later :man_shrugging: