Stiffstream / restinio

Cross-platform, efficient, customizable, and robust asynchronous HTTP(S)/WebSocket server C++ library with the right balance between performance and ease of use
Other
1.15k stars 93 forks source link

acceptor.h seems to block #68

Closed AmarOk1412 closed 4 years ago

AmarOk1412 commented 4 years ago

Hi,

I am using restinio on a server receiving a lot of connections. I saw that sometimes, the server is hanging and doesn't accepts any new connections (but current connections continue to work without issues).

Just before hanging, this is what I see in the logs:

an exception in 'accept_current_connection': remote_endpoint: Transport endpoint is not connected

So, my assumption is that https://github.com/Stiffstream/restinio/blob/bd4382c2fd355efa8ee2f1fffe62f0134b472e30/dev/restinio/impl/acceptor.hpp#L308 is not recovering after that exception. (This line is throwing the exception: https://github.com/Stiffstream/restinio/blob/bd4382c2fd355efa8ee2f1fffe62f0134b472e30/dev/restinio/impl/acceptor.hpp#L372)

I'm currently investigating, but if you have any clue, it will be appreciated :)

eao197 commented 4 years ago

Hi!

What happens next? Do you restart your server, or server starts to accept new connections after some of them were rejected?

AmarOk1412 commented 4 years ago

For now, we have to restart the program each time its happen, every curl I do on the server hangs forever. The server never accept new connections.

eao197 commented 4 years ago

If you run on Linux then maybe you reach the limit of opened descriptors? And RESTinio just can't get a newly accepted socket.

AmarOk1412 commented 4 years ago

it's definitely not the descriptors, if a connection close, a new connection will not be accepted and the total of opened files is low (407 for now, last time I see the server hanging, it was like 230 opened files).

This is the logs I have:

[938158.389029] [proxy:server] [connection:177360] read socket error: Connection reset by peer; parsed bytes: 0 # don't care, this doesn't lead to any error
[938158.397347] [proxy:server] [connection:177359] read socket error: Connection reset by peer; parsed bytes: 0
[938158.425681] [proxy:server] [connection:177344] read socket error: Connection reset by peer; parsed bytes: 0
[938158.575548] [proxy:server] [connection:177409] read socket error: Connection reset by peer; parsed bytes: 0
[938158.578418] [proxy:server] [connection:177411] read socket error: Connection reset by peer; parsed bytes: 0
[938158.611023] [proxy:server] [connection:177413] read socket error: Connection reset by peer; parsed bytes: 0
[938158.612191] [proxy:server] [connection:177414] read socket error: Connection reset by peer; parsed bytes: 0
[938158.793192] [proxy:server] [connection:177413] try to write response, while socket is closed
[938158.793254] [proxy:server] [connection:177413] try to write response, while socket is closed
[938158.888432] [proxy:server] @@@ ERROR: remote_endpoint: Transport endpoint is not connected
[938158.888455] [proxy:server] an exception in 'accept_current_connection': std::exception # Now the server will not accept any new connections
[938159.136770] [proxy:server] [connection:177312] read socket error: Connection reset by peer; parsed bytes: 0 # This is from a current connections
# No new connections (and no log at all from restinio outside from the current connections)
eao197 commented 4 years ago

Ok. I'll try to investigate this tomorrow. Now it's too late in our town.

AmarOk1412 commented 4 years ago

Yeah np, I'm on it too. I will updat the issue as soon I have new informations :)

Good night

AmarOk1412 commented 4 years ago

Ok, it seems that if I catch the exception (try/catch around remote_endpoint), and continue the function, the server seems to continue. However, I don't see the exception loggued for now, so I will confirm this tomorrow

eao197 commented 4 years ago

I can reproduce this problem by myself by introducing some error-producing code into RESTinio. I'll try to solve this issue.

eao197 commented 4 years ago

@AmarOk1412 can you try the updated version from https://github.com/Stiffstream/restinio/tree/0.6-dev-issue-8 branch?

AmarOk1412 commented 4 years ago

Oh, thanks for such a quick feedback. I will try this version today.

AmarOk1412 commented 4 years ago

@eao197 So, yeah, I patched the server, and this exception doesn't seems to occur anymore and it seems more stable. However, this patch is applied on 20 services, and two are still blocking (somewhere in the http server, because the service behind is working correctly)

The backtrace is just useless:

Thread 6 (Thread 0x7ffff1368700 (LWP 2457)):          
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185                                                                                                                         
#1  0x00007ffff72427dd in asio::detail::posix_event::wait<asio::detail::conditionally_enabled_mutex::scoped_lock>(asio::detail::conditionally_enabled_mutex::scoped_lock&) (this=0x7fffec09a530, lock=...)
    at /usr/include/asio/detail/posix_event.hpp:105                          
#2  0x00007ffff726e3bb in asio::detail::conditionally_enabled_event::wait (this=0x7fffec09a528, lock=...) at /usr/include/asio/detail/conditionally_enabled_event.hpp:88
#3  0x00007ffff7270cd2 in asio::detail::scheduler::do_run_one (this=0x7fffec09a4c0, lock=..., this_thread=..., ec=...) at /usr/include/asio/detail/impl/scheduler.ipp:408
#4  0x00007ffff727077e in asio::detail::scheduler::run (this=0x7fffec09a4c0, ec=...) at /usr/include/asio/detail/impl/scheduler.ipp:153
#5  0x00007ffff7271165 in asio::io_context::run (this=0x7fffec008d60) at /usr/include/asio/impl/io_context.ipp:61
#6  0x00007ffff7151baf in asio::detail::resolver_service_base::work_io_context_runner::operator()() (this=0x7fffec040df8) at /usr/include/asio/detail/impl/resolver_service_base.ipp:31       
#7  0x00007ffff736876c in run (this=0x7fffec040df0) at /usr/include/asio/detail/posix_thread.hpp:85
#8  0x00007ffff7150d02 in asio_detail_posix_thread_function (arg=0x7fffec040df0) at /usr/include/asio/detail/impl/posix_thread.ipp:73
#9  0x00007ffff6ed04a4 in start_thread (arg=0x7ffff1368700) at pthread_create.c:456            
#10 0x00007ffff5c6fd0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

Thread 5 (Thread 0x7ffff1b69700 (LWP 2438)):                               
#0  0x00007ffff5c70303 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84        
#1  0x00007ffff726f82a in asio::detail::epoll_reactor::run (this=0x5555557e6e60, usec=-1, ops=...) at /usr/include/asio/detail/impl/epoll_reactor.ipp:470
#2  0x00007ffff7270c0d in asio::detail::scheduler::do_run_one (this=0x5555557e6780, lock=..., this_thread=..., ec=...) at /usr/include/asio/detail/impl/scheduler.ipp:384
#3  0x00007ffff727077e in asio::detail::scheduler::run (this=0x5555557e6780, ec=...) at /usr/include/asio/detail/impl/scheduler.ipp:153
#4  0x00007ffff7271165 in asio::io_context::run (this=0x5555557e2a70) at /usr/include/asio/impl/io_context.ipp:61
#5  0x00007ffff7214752 in dht::DhtProxyServer::DhtProxyServer(std::pair<std::shared_ptr<dht::crypto::PrivateKey>, std::shared_ptr<dht::crypto::Certificate> >, std::shared_ptr<dht::DhtRunner>, unsigned short, std::__cxx11::basic_string<cha
r, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<dht::Logger>)::{lambda()#2}::operator()() const [clone .lto_priv.1489] () at /root/ansible/src/opendht/src/dht_proxy_server.cpp:272
#6  0x00007ffff714887e in _M_invoke (this=0x5555557cfd88) at /usr/include/c++/6/functional:1391
#7  0x00007ffff7146de9 in operator() (this=0x5555557cfd88) at /usr/include/c++/6/functional:1380
#8  0x00007ffff7145050 in std::thread::_State_impl<std::_Bind_simple<dht::DhtProxyServer::DhtProxyServer(std::pair<std::shared_ptr<dht::crypto::PrivateKey>, std::shared_ptr<dht::crypto::Certificate> >, std::shared_ptr<dht::DhtRunner>, uns
igned short, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<dht::Logger>)::{lambda()#2} ()> >::_M_run() [clone .lto_priv.463] (this=0x5555557cfd80) at /usr/include/c++/6/thread:197
#9  0x00007ffff64fae6f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00007ffff6ed04a4 in start_thread (arg=0x7ffff1b69700) at pthread_create.c:456
#11 0x00007ffff5c6fd0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

Thread 3 (Thread 0x7ffff319a700 (LWP 2436)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
#1  0x00007ffff729b774 in __gthread_cond_timedwait(pthread_cond_t*, pthread_mutex_t*, timespec const*) [clone .lto_priv.3730] (__cond=0x5555557e30b8, __mutex=0x5555557e3088, __abs_timeout=0x7ffff31975d0)
    at /usr/include/x86_64-linux-gnu/c++/6/bits/gthr-default.h:871
#2  0x00007ffff727f07b in std::condition_variable::__wait_until_impl (this=0x5555557e30b8, __lock=..., __atime=...) at /usr/include/c++/6/condition_variable:166
#3  0x00007ffff727add1 in std::condition_variable::wait_until (this=0x5555557e30b8, __lock=..., __atime=...) at /usr/include/c++/6/condition_variable:119
#4  0x00007ffff7277232 in std::condition_variable::wait_until<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, dht::DhtRunner::run(dht::DhtRunner::Config const&, dht::DhtRunner::Context&&)::{lambd
a()#2}::operator()() const::{lambda()#1}>(std::unique_lock<std::mutex>&, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&, dht::DhtRunner::run(dht::DhtRunner::Confi
g const&, dht::DhtRunner::Context&&)::{lambda()#2}::operator()() const::{lambda()#1}) [clone .lto_priv.3860] (this=0x5555557e30b8, __lock=..., __atime=..., __p=...) at /usr/include/c++/6/condition_variable:129
#5  0x00007ffff729e3b7 in dht::DhtRunner::run(dht::DhtRunner::Config const&, dht::DhtRunner::Context&&)::{lambda()#2}::operator()() const [clone .lto_priv.3438] () at /root/ansible/src/opendht/src/dhtrunner.cpp:185
#6  0x00007ffff725f6ec in std::_Bind_simple::_M_invoke (this=0x5555557c50c8) at /usr/include/c++/6/functional:1391
#7  0x00007ffff725f567 in std::_Bind_simple::operator() (this=0x5555557c50c8) at /usr/include/c++/6/functional:1380
#8  0x00007ffff725f34c in std::thread::_State_impl<std::_Bind_simple<dht::DhtRunner::run(dht::DhtRunner::Config const&, dht::DhtRunner::Context&&)::{lambda()#2} ()> >::_M_run() [clone .lto_priv.779] (this=0x5555557c50c0)
    at /usr/include/c++/6/thread:197
#9  0x00007ffff64fae6f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00007ffff6ed04a4 in start_thread (arg=0x7ffff319a700) at pthread_create.c:456                            
#11 0x00007ffff5c6fd0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97  

And I don't see any exception from restinio or our code. So, for now, I can't really provide more details. The bug can also come from another place.

eao197 commented 4 years ago

Thanks for the additional info. I'll investigate this issue further, I think there is a probability of some mistake in RESTinio code that leads to some kind of logical deadlock.

eao197 commented 4 years ago

However, this patch is applied on 20 services, and two are still blocking (somewhere in the http server, because the service behind is working correctly)

What do you mean under the term "blocking"? The service doesn't accept new incoming connection? Doesn't send data to already established connections?

Can you provide fragments of logs from blocked services?

AmarOk1412 commented 4 years ago

The service doesn't accept any new connection but is still sending data to current connections

do you want some particular logs? For now, when a service is not accepting new connections, there is nothing different in the logs just:

[100277.293968] [proxy:server] [connection:10133] try to write response, while socket is closed
[100277.297841] [proxy:server] [connection:10133] try to write response, while socket is closed
[100378.008576] [proxy:server] [connection:10060] read socket error: Connection reset by peer; parsed bytes: 0
[100378.008654] [proxy:server] [connection:10061] read socket error: Connection reset by peer; parsed bytes: 0
[100378.008845] [proxy:server] [connection:10058] read socket error: Connection reset by peer; parsed bytes: 0
[100378.008899] [proxy:server] [connection:10059] read socket error: Connection reset by peer; parsed bytes: 0
[100464.511176] [proxy:server] [connection:9653] read socket error: Connection reset by peer; parsed bytes: 0
[100470.178069] [proxy:server] [connection:10404] try to write response, while socket is closed
[100477.049493] [proxy:server] [connection:10435] read socket error: Connection reset by peer; parsed bytes: 0
[100479.481203] [proxy:server] [connection:10444] read socket error: Connection reset by peer; parsed bytes: 0
[100484.064116] [proxy:server] [connection:729] read socket error: Connection timed out; parsed bytes: 0
[100504.494775] [proxy:server] [connection:10389] read socket error: Connection reset by peer; parsed bytes: 0
[100504.494819] [proxy:server] [connection:10391] read socket error: Connection reset by peer; parsed bytes: 0
[100504.495012] [proxy:server] [connection:10390] read socket error: Connection reset by peer; parsed bytes: 0
[100504.495048] [proxy:server] [connection:10392] read socket error: Connection reset by peer; parsed bytes: 0
[100531.168115] [proxy:server] [connection:681] read socket error: Connection timed out; parsed bytes: 0
[100533.216112] [proxy:server] [connection:693] read socket error: Connection timed out; parsed bytes: 0
[100553.696111] [proxy:server] [connection:753] read socket error: Connection timed out; parsed bytes: 0
[100770.797973] [proxy:server] [connection:10803] read socket error: End of file; parsed bytes: 2
[100771.850288] [proxy:server] [connection:10802] read socket error: End of file; parsed bytes: 2

But these logs are also here when all is working.

eao197 commented 4 years ago

Thanks! I think this info can help.

ngrodzitski commented 4 years ago

Maybe a log starting from somewhere before the last successful accept would draw any light on the issue?

AmarOk1412 commented 4 years ago

Ok, so I add some logs, and I reproduced the problem:

This is the logs:

[273659.386618] [proxy:server] @@accept_current_connection begin                                               
[273659.386623] [proxy:server] @@accept_connection_for_socket_with_index begin                                 
[273659.386627] [proxy:server] @@do_accept_current_connection begin                                            
[273659.386631] [proxy:server] @@do_accept_current_connection end                                              
[273659.386633] [proxy:server] @@accept_connection_for_socket_with_index end                                   
[273659.386635] [proxy:server] @@accept_next begin                                                             
[273659.386638] [proxy:server] @@accept_next end                                                               
[273659.386640] [proxy:server] @@accept_current_connection end                                                 
[273659.387915] [proxy:server] @@accept_current_connection begin                                               
[273659.387920] [proxy:server] @@accept_connection_for_socket_with_index begin                                 
[273659.387925] [proxy:server] @@do_accept_current_connection begin                                            
[273659.387928] [proxy:server] @@do_accept_current_connection end                                              
[273659.387931] [proxy:server] @@accept_connection_for_socket_with_index end                                   
[273659.387933] [proxy:server] @@accept_next begin                                                         
[273659.387937] [proxy:server] @@accept_next end                                                           
[273659.387939] [proxy:server] @@accept_current_connection end                                                
[273659.388312] [proxy:server] @@accept_current_connection begin                                          
[273659.388316] [proxy:server] @@accept_connection_for_socket_with_index begin                                 
[273659.388319] [proxy:server] @@do_accept_current_connection begin                                            
[273659.388323] [proxy:server] @@do_accept_current_connection end                                              
[273659.388325] [proxy:server] @@accept_connection_for_socket_with_index end                                   
[273659.388328] [proxy:server] @@accept_next begin                                                            
[273659.388331] [proxy:server] @@accept_next end                                                              
[273659.388333] [proxy:server] @@accept_current_connection end                                                 
[273659.394440] [proxy:server] @@accept_current_connection begin                                               
[273659.394451] [proxy:server] @@accept_connection_for_socket_with_index begin                                 
[273659.394459] [proxy:server] @@do_accept_current_connection begin                                            
[273659.394466] [proxy:server] @@do_accept_current_connection end                                              
[273659.394469] [proxy:server] @@accept_connection_for_socket_with_index end                                   
[273659.394473] [proxy:server] @@accept_next begin                                                         
[273659.394481] [proxy:server] @@accept_next end                                                           
[273659.394485] [proxy:server] @@accept_current_connection end
# DOWN!
[273659.395147] [proxy:server] [connection:248053] read socket error: Connection reset by peer; parsed bytes: 0
[273659.481553] [proxy:server] [connection:248011] read socket error: Connection reset by peer; parsed bytes: 0
[273661.451500] [proxy:server] [connection:247006] try to write response, while socket is closed               
[273662.084901] [proxy:server] [connection:247550] try to write response, while socket is closed
[273662.263634] [proxy:server] [connection:247538] try to write response, while socket is closed
[273663.737180] [proxy:server] [connection:247651] try to write response, while socket is closed
[273663.892809] [proxy:server] [connection:247709] try to write response, while socket is closed
[273663.910638] [proxy:server] [connection:247677] try to write response, while socket is closed
[273664.083379] [proxy:server] [connection:247746] try to write response, while socket is closed               
[273664.258629] [proxy:server] [connection:247569] try to write response, while socket is closed               
[273664.747252] [proxy:server] [connection:247790] try to write response, while socket is closed               
[273665.735901] [proxy:server] [connection:247896] try to write response, while socket is closed               
[273665.932827] [proxy:server] [connection:247915] try to write response, while socket is closed

The patch I use

 dev/restinio/impl/acceptor.hpp | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 64 insertions(+), 2 deletions(-)

diff --git a/dev/restinio/impl/acceptor.hpp b/dev/restinio/impl/acceptor.hpp
index 10eed75..e8c96d7 100644
--- a/dev/restinio/impl/acceptor.hpp
+++ b/dev/restinio/impl/acceptor.hpp
@@ -201,6 +201,10 @@ class acceptor_t final
        void
        open()
        {
+
+           m_logger.warn( [&]{
+               return fmt::format("@@open begin");
+           } );
            if( m_acceptor.is_open() )
            {
                const auto ep = m_acceptor.local_endpoint();
@@ -248,7 +252,14 @@ class acceptor_t final
                        return fmt::format( "init accept #{}", i );
                    } );

+
+                   m_logger.warn( [&]{
+                       return fmt::format("@@call accept next");
+                   } );
                    accept_next( i );
+                   m_logger.warn( [&]{
+                       return fmt::format("@@call accept next end");
+                   } );
                }

                m_logger.info( [&]{
@@ -265,12 +276,20 @@ class acceptor_t final

                throw;
            }
+
+           m_logger.warn( [&]{
+               return fmt::format("@@open end");
+           } );
        }

        //! Close listener if any.
        void
        close()
        {
+
+           m_logger.warn( [&]{
+               return fmt::format("@@close begin");
+           } );
            if( m_acceptor.is_open() )
            {
                close_impl();
@@ -281,6 +300,9 @@ class acceptor_t final
                    return fmt::format( "server already closed" );
                } );
            }
+           m_logger.warn( [&]{
+               return fmt::format("@@close end");
+           } );
        }

        //! Get an executor for close operation.
@@ -307,6 +329,10 @@ class acceptor_t final
        void
        accept_next( std::size_t i ) noexcept
        {
+
+           m_logger.warn( [&]{
+               return fmt::format("@@accept_next begin");
+           } );
            m_acceptor.async_accept(
                this->socket( i ).lowest_layer(),
                asio_ns::bind_executor(
@@ -317,6 +343,10 @@ class acceptor_t final
                            ctx->accept_current_connection( i, ec );
                        }
                    } ) );
+           
+           m_logger.warn( [&]{
+               return fmt::format("@@accept_next end");
+           } );
        }

        //! Accept current connection.
@@ -330,6 +360,10 @@ class acceptor_t final
            std::size_t i,
            const std::error_code & ec ) noexcept
        {
+
+           m_logger.warn( [&]{
+               return fmt::format("@@accept_current_connection begin");
+           } );
            if( !ec )
            {
                restinio::utils::suppress_exceptions(
@@ -353,6 +387,10 @@ class acceptor_t final

            // Continue accepting.
            accept_next( i );
+
+           m_logger.warn( [&]{
+               return fmt::format("@@accept_current_connection end");
+           } );
        }

        /*!
@@ -368,8 +406,14 @@ class acceptor_t final
            //! socket index in the pool of sockets.
            std::size_t i )
        {
+           m_logger.warn( [&]{
+               return fmt::format("@@accept_connection_for_socket_with_index begin");
+           } );
+           
+           auto incoming_socket = this->move_socket( i );
+
            auto remote_endpoint =
-                   this->socket( i ).lowest_layer().remote_endpoint();
+                   incoming_socket.lowest_layer().remote_endpoint();

            m_logger.trace( [&]{
                return fmt::format(
@@ -379,7 +423,6 @@ class acceptor_t final

            // Since v.0.5.1 the incoming connection must be
            // inspected by IP-blocker.
-           auto incoming_socket = this->move_socket( i );
            const auto inspection_result = this->inspect_incoming(
                    incoming_socket );

@@ -403,6 +446,10 @@ class acceptor_t final
                        remote_endpoint );
            break;
            }
+
+           m_logger.warn( [&]{
+               return fmt::format("@@accept_connection_for_socket_with_index end");
+           } );
        }

        void
@@ -410,6 +457,10 @@ class acceptor_t final
            stream_socket_t incoming_socket,
            endpoint_t remote_endpoint )
        {
+
+           m_logger.warn( [&]{
+               return fmt::format("@@do_accept_current_connection begin");
+           } );
            auto create_and_init_connection =
                [sock = std::move(incoming_socket),
                factory = m_connection_factory,
@@ -441,12 +492,19 @@ class acceptor_t final
            {
                create_and_init_connection();
            }
+           m_logger.warn( [&]{
+               return fmt::format("@@do_accept_current_connection end");
+           } );
        }

        //! Close opened acceptor.
        void
        close_impl()
        {
+
+           m_logger.warn( [&]{
+               return fmt::format("@@close_impl begin");
+           } );
            const auto ep = m_acceptor.local_endpoint();

            // An exception in logger should not prevent a call of close()
@@ -461,6 +519,10 @@ class acceptor_t final
            m_logger.info( [&]{
                return fmt::format( "server closed on {}", ep );
            } );
+
+           m_logger.warn( [&]{
+               return fmt::format("@@close_impl end");
+           } );
        }

        //! Server endpoint.

I am on top of master (https://github.com/aberaud/restinio/ it's a fork to allow custom methods)

eao197 commented 4 years ago

I don't see any criminal here. It looks like there is some external problem with the port you use for accepting new connections. Maybe this port is blocked by a firewall or another instance of service is started and intercepts this address/port by using REUSE_ADDR/REUSE_PORT options.

However, the advanced logging needs to be extended such way:

            auto create_and_init_connection =
                [sock = std::move(incoming_socket),
                factory = m_connection_factory,
                ep = std::move(remote_endpoint),
                logger = &m_logger]() mutable noexcept {
                    // NOTE: this code block shouldn't throw!
                    restinio::utils::suppress_exceptions(
                            *logger,
                            "do_accept_current_connection.create_and_init_connection",
                            [&] {
                                logger->warn([]{ return fmt::format("@@create_and_init_connection begin"); });
                                // Create new connection handler.
                                auto conn = factory->create_new_connection(
                                        std::move(sock), std::move(ep) );

                                // If connection handler was created,
                                // then start waiting for request message.
                                if( conn ) {
                                    conn->init();
                                    logger->warn([]{ return fmt::format("@@create_and_init_connection initialized"); });
                                }
                                logger->warn([]{ return fmt::format("@@create_and_init_connection end"); });
                            } );
                };

This is a fragment for do_accept_current_connection() method.

AmarOk1412 commented 4 years ago

Yeah me neither (for something criminal).

Maybe this port is blocked by a firewall or another instance of service is started and intercepts this address/port by using REUSE_ADDR/REUSE_PORT options.

This is very unlikely. There is not a lot of services installed, and if the service cannot bind to the port, it doesn't even start. The host is a server from OVH, so it's very unlikely that the firewall block port 80 or port 100, and if the connections are not accepted, restarting the server immediately works (so again, the port is not used by anything else).

I extended the logs. Need to wait for now.

eao197 commented 4 years ago

Need to wait for now.

OK. Let's see what happens next.

I wrote a small emulator to issue requests to a testing RESTinio service, but there weren't any problems even after 200K requests processed by the service. So, I'm afraid, if there is an error in RESTinio this error triggers only in your setup.

AmarOk1412 commented 4 years ago

I will log with that changes. If you don't see anything else, I will close the issue and will continue to dig.

eao197 commented 4 years ago

I just trying to say that I can't find this issue without your help. So please don't close the issue until we'll find a solution (or source of the problem).

It seems that there are the following cases:

  1. This is some system-releated bug. It means that Asio and RESTinio work correctly and the source of problem somewhere inside the system or system's settings.
  2. This is a consequence of some other bug. For example, memory corruption somewhere in the application that leads to invalidating some sockets' descriptors.
  3. This is a bug in Asio's code.
  4. This is a bug in RESTinio's code. Or an inability of RESTinio to handle some cases of Asio's or system's behavior.

I can't help if this is a case no.1 or 2. But if this is a case no.3 or no.4 then it can affect other users of RESTinio. And I want to find the reason. At least until you have some time for that.

AmarOk1412 commented 4 years ago

Latest logs:

[366601.865899] [proxy:server] @@accept_current_connection begin
[366601.865904] [proxy:server] @@accept_connection_for_socket_with_index begin
[366601.865910] [proxy:server] @@do_accept_current_connection begin
[366601.865914] [proxy:server] @@do_accept_current_connection end
[366601.865916] [proxy:server] @@accept_connection_for_socket_with_index end
[366601.865920] [proxy:server] @@accept_next begin
[366601.865925] [proxy:server] @@accept_next end
[366601.865928] [proxy:server] @@accept_current_connection end
[366601.865948] [proxy:server] @@create_and_init_connection begin
[366601.865966] [proxy:server] @@create_and_init_connection initialized
[366601.865970] [proxy:server] @@create_and_init_connection end
# DOWN!
[366601.877397] [proxy:server] [connection:563465] read socket error: Connection reset by peer; parsed bytes: 0
[366601.877458] [proxy:server] [connection:563467] read socket error: Connection reset by peer; parsed bytes: 0
[366602.165376] [proxy:server] [connection:563162] try to write response, while socket is closed
[366602.197852] [proxy:server] [connection:563425] read socket error: Connection reset by peer; parsed bytes: 0
eao197 commented 4 years ago

Thanks for the logs. I still don't see any problem with RESTinio logic. So maybe it's time to dive into Asio's internals. Which version of Asio do you use?

Also, I want to clarify some things. It seems that you have several instances of the same RESTinio-based services. Those services are identical, and all of those run on different virtual machines (virtual servers). And only one or two instances show that problem. They are the same instances, or they are different instances on different VM?

eao197 commented 4 years ago

if the connections are not accepted

What diagnostic is received by a client when it tries to connect to a hanging service? The connection is broken immediately with a reason like "connection refused by peer" or connection is just timed out without any incoming data within?

AmarOk1412 commented 4 years ago

services are identical and on the same instance. New connections are just timing out without any incoming data. One of the service is receiving way more request than the others, so it can explain why it's hanging more easily, but the second one that hangs has no difference with the rest of the services, so it's really weird.

The asio version we use is: https://github.com/aberaud/asio/ (master) so it's basically the 1.12.2 with some patches for supporting LibreSSL

eao197 commented 4 years ago

New connections are just timing out without any incoming data.

Thanks for the info!

I think we can remove all additional tracing code except that one:

        void
        do_accept_current_connection(
            stream_socket_t incoming_socket,
            endpoint_t remote_endpoint )
        {
            auto create_and_init_connection =
                [sock = std::move(incoming_socket),
                factory = m_connection_factory,
                ep = std::move(remote_endpoint),
                logger = &m_logger]() mutable noexcept {
                    // NOTE: this code block shouldn't throw!
                    restinio::utils::suppress_exceptions(
                            *logger,
                            "do_accept_current_connection.create_and_init_connection",
                            [&] {
                                // Create new connection handler.
                                auto conn = factory->create_new_connection(
                                        std::move(sock), std::move(ep) );

                                // If connection handler was created,
                                // then start waiting for request message.
                                if( conn ) {
                                    conn->init();
                                    logger->warn( [&]{ return fmt::format( "@@new_connection_initialized: {}",
                                            conn->connection_id() ); } );
                                }
                            } );
                };

The purpose of this code is to see connection-id of new connection in the log after the server hungs. I suspect that there can be lines like:

[366601.877397] [proxy:server] [connection:563465] read socket error: Connection reset by peer; parsed bytes: 0

for that connection-id.

AmarOk1412 commented 4 years ago
[459696.915797] [proxy:server] @@accept_connection_for_socket_with_index begin
[459696.915802] [proxy:server] @@do_accept_current_connection begin
[459696.915805] [proxy:server] @@do_accept_current_connection end
[459696.915807] [proxy:server] @@accept_connection_for_socket_with_index end
[459696.915809] [proxy:server] @@accept_next begin
[459696.915812] [proxy:server] @@accept_next end
[459696.915814] [proxy:server] @@accept_current_connection end
[459696.915840] [proxy:server] @@new_connection_initialized: 54302
[459696.919655] [proxy:server] @@accept_current_connection begin
[459696.919660] [proxy:server] @@accept_connection_for_socket_with_index begin
[459696.919666] [proxy:server] @@do_accept_current_connection begin
[459696.919672] [proxy:server] @@do_accept_current_connection end
[459696.919675] [proxy:server] @@accept_connection_for_socket_with_index end
[459696.919678] [proxy:server] @@accept_next begin
[459696.919685] [proxy:server] @@accept_next end
[459696.919688] [proxy:server] @@accept_current_connection end
[459696.919726] [proxy:server] @@new_connection_initialized: 54303
# DOWN!
[459944.932061] [proxy:server] [connection:7660] read socket error: Connection timed out; parsed bytes: 0
[459944.932122] [proxy:server] [connection:7605] read socket error: Connection timed out; parsed bytes: 0
AmarOk1412 commented 4 years ago

I see another difference from my side. services that doesn't accept connections after some time are not launched with systemctl and are inside gdb. I'll try to see if I launch the service via systemctl and launch another in gdb the new one in gdb will stop working after some time.

eao197 commented 4 years ago

I think that a modification of that logging code (https://github.com/aberaud/restinio/blob/master/dev/restinio/impl/connection.hpp#L559-L567) can also be useful. Change it to:

                    m_logger.warn( [&]{
                        return fmt::format(
                                "[connection:{}] request received (#{}): {} {}",
                                connection_id(),
                                request_id,
                                http_method_str(
                                    static_cast<http_method>( parser.method ) ),
                                parser_ctx.m_header.request_target() );
                    } );

and we'll see is a new incoming request really read and parsed or not. If it is read but nothing is send as a reply then there is a problem in request dispatching or request handling logic. If there is no "request received" lines for new incoming requests in the log then the problem in I/O part.

AmarOk1412 commented 4 years ago

mmh

[525112.969642] [proxy:server] [connection:225263] @@ request received (#0): GET /                         
[525113.100842] [proxy:server] @@accept_current_connection begin                                                                
[525113.100851] [proxy:server] @@accept_connection_for_socket_with_index begin                                                  
[525113.100856] [proxy:server] @@do_accept_current_connection begin                                                             
[525113.100862] [proxy:server] @@do_accept_current_connection end                                          
[525113.100864] [proxy:server] @@accept_connection_for_socket_with_index end                               
[525113.100866] [proxy:server] @@accept_next begin                                                         
[525113.100869] [proxy:server] @@accept_next end                                                           
[525113.100871] [proxy:server] @@accept_current_connection end                                             
[525113.100905] [proxy:server] @@new_connection_initialized: 225264                                            
[525113.101643] [proxy:server] [connection:225253] @@ request received (#0): GET /                         
[525113.198214] [proxy:server] @@accept_current_connection begin                                           
[525113.198224] [proxy:server] @@accept_connection_for_socket_with_index begin                                 
[525113.198229] [proxy:server] @@do_accept_current_connection begin                                        
[525113.198233] [proxy:server] @@do_accept_current_connection end                                          
[525113.198235] [proxy:server] @@accept_connection_for_socket_with_index end                               
[525113.198238] [proxy:server] @@accept_next begin                                                         
[525113.198242] [proxy:server] @@accept_next end                                                           
[525113.198244] [proxy:server] @@accept_current_connection end                                                 
[525113.198290] [proxy:server] @@new_connection_initialized: 225265                                                        
[525113.201124] [proxy:server] [connection:225265] @@ request received (#0): POST /b52ec22ecdd48414d8c6c62c3136cb2cba747ccb
[525113.340196] [proxy:server] [connection:225238] read socket error: Connection reset by peer; parsed bytes: 0            
[525113.341693] [proxy:server] [connection:225241] @@ request received (#0): POST /e90dbec03fe749cba90255a999f777d62f4dd1b3
[525113.401816] [proxy:server] [connection:225264] @@ request received (#0): GET /     

Does that mean that the request is received and it should be blocked in the POST /... calllback?

eao197 commented 4 years ago

Does that mean that the request is received and it should be blocked in the POST /... calllback?

No. Not blocked. The request_handle is stored somewhere in your application, but create_response() for it is never called. The request handler returned the control to RESTinio in the normal way. Otherwise, there won't be other log messages.

AmarOk1412 commented 4 years ago

Ok, so good news, I removed the services in gdb and used systemd instead... and I don't see the issue anymore. So my theory for now is that a signal is emitted by gdb (on some devices, SIGUSR1 is emitted). To confirm that I will log received signals and will try to launch a service directly (without gdb nor systemd).

I don't think the issue is located in restinio anymore (This patch: https://github.com/Stiffstream/restinio/tree/0.6-dev-issue-8 resolving the original issue :) ).

eao197 commented 4 years ago

This patch: https://github.com/Stiffstream/restinio/tree/0.6-dev-issue-8 resolving the original issue

This patch is already in v.0.6.3 released today.

AmarOk1412 commented 4 years ago

I think we can close as this doesn't seems to be more gdb related now. And I'll not let the service in gdb for now, will create a new instance when I will have some time to dig that. When I will get the issue, I will just post a comment here.

Anyway thank you for your help and quick answers