EOSIO / eos

An open source smart contract platform
https://developers.eos.io/manuals/eos
MIT License
11.27k stars 3.76k forks source link

Solution: HTTP Service time-out - WebsocketPP Fix (Linux Kernel > 4.4) #2016

Closed roelandp closed 6 years ago

roelandp commented 6 years ago

I fired up a dedibox follow all requirements and preferred systems so I expected a smooth sailing seeing so many running nodes already.

However then I discovered the webservice / http service was not able to respond and timed out after a while. I noticed that whenever I turned of nodeos that the http service would immediately respond with unavailable, whilst when the nodeos was running it was just 'connecting' for a while (30 secs or so) before idling out.

When I was browsing the source at https://github.com/EOSIO/eos/blob/d8db1d3a05e768f5459b46ace8d2bba92aab89d9/plugins/http_plugin/http_plugin.cpp#L218

It slowly started to remind me about a fix I found for the steem/graphene dreaded 'Timer Expired' error when trying to connect to localhost RPC for the wallet. https://github.com/steemit/steem/issues/35#issuecomment-315463930

This appeared to only happen on Kernels bigger then 4.4 .

So I manually adjusted the file libraries/fc/vendor/websocketpp/websocketpp/transport/asio/endpoint.hpp to reflect the fix of the websocketpp source (this is a fix taken from the 'dev version of websocketpp which is not the default included submodule version).

L37 ADD: #include <websocketpp/common/asio.hpp>

L95 (OR SOMEWHERE) REPLACE: , m_listen_backlog(0) with , m_listen_backlog(lib::asio::socket_base::max_connections)

This fixes the timeouts occuring with the HTTP Service on kernels > 4.4

I think you can also Cherry Pick the fix as summarized by Abit for Bitshares: https://github.com/bitshares/bitshares-core/issues/701

jgiszczak commented 6 years ago

Listen backlog is only an issue when there's very high traffic to a single listening port, causing queuing of inbound connections in the kernel. A single connection is not queued. System administrators deploying nodeos should set listen queue depth to an appropriate value themselves. Applications should not attempt to override that value. We anticipate that very busy nodes will deploy traditional http load balancers, obviating much of the need to set higher queue depths on machines running nodeos instances.

The comment in http_plugin.cpp you reference is currently spurious. The http_plugin is using an application-wide instance of boost::io_context (formerly known as boost::io_service), which does not stop running until the application exits, assuming at least one of net_plugin or http_plugin are configured.

roelandp commented 6 years ago

i found that my http service would not run on my node which also used to have Timer.expired (Bitshares / Steem -> localhost wallet connection) errors in the past. By modifying the code as per above it does work.

This is only for Kernel > 4.4. Please see this comment in the websocketpp (dev branch) fix commit:

After a change in Linux Kernel 4.4 the value of 0 causes all connections to be rejected rather than the default value being used. The default is now the asio::socket_base::max_connections value instead (which is the default asio uses when no value is provided).

https://github.com/zaphoyd/websocketpp/commit/0bb33e4bca4ccc42a36aa2321e4fb97f2562e519

And yeah i noticed that the comment ilog("http io service exit"); is shown always :)

jgiszczak commented 6 years ago

I've been running nodeos/eosiod/eosd on a post-4.4 kernel since August and it never times out either net_plugin or http_plugin. Currently running on 4.13 with no issues. I suggest investigating your hosting provider's settings. For instance, I have

$ cat /proc/sys/net/ipv4/tcp_max_syn_backlog 1024

Your provider is defaulting to needlessly aggressive (and arguably wrong) settings if you're experiencing spurious SYN flooding warnings in the log and dropped packets.

roelandp commented 6 years ago

so you can reach the API endpoints when visiting?

On 3 Apr 2018, at 01:46, jgiszczak notifications@github.com wrote:

I've been running nodeos/eosiod/eosd on a post-4.4 kernel since August and it never times out either net_plugin or http_plugin. Currently running on 4.13 with no issues. I suggest investigating your hosting provider's settings. For instance, I have

$ cat /proc/sys/net/ipv4/tcp_max_syn_backlog 1024

Your provider is defaulting to needlessly aggressive (and arguably wrong) settings if you're experiencing spurious SYN flooding warnings in the log and dropped packets.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/EOSIO/eos/issues/2016#issuecomment-378083014, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPJALjBABdWBb6D9u9jgL7qbwhoehFEks5tkrhtgaJpZM4TC68-.

roelandp commented 6 years ago

these ones I mean: http://mowgli.jungle3.eos.roelandp.nl:8765/v1/chain/get_info http://mowgli.jungle3.eos.roelandp.nl:8765/v1/chain/get_info

On 3 Apr 2018, at 01:47, RoelandP Lanparty dnaleor@gmail.com wrote:

so you can reach the API endpoints when visiting?

On 3 Apr 2018, at 01:46, jgiszczak <notifications@github.com mailto:notifications@github.com> wrote:

I've been running nodeos/eosiod/eosd on a post-4.4 kernel since August and it never times out either net_plugin or http_plugin. Currently running on 4.13 with no issues. I suggest investigating your hosting provider's settings. For instance, I have

$ cat /proc/sys/net/ipv4/tcp_max_syn_backlog 1024

Your provider is defaulting to needlessly aggressive (and arguably wrong) settings if you're experiencing spurious SYN flooding warnings in the log and dropped packets.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/EOSIO/eos/issues/2016#issuecomment-378083014, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPJALjBABdWBb6D9u9jgL7qbwhoehFEks5tkrhtgaJpZM4TC68-.

roelandp commented 6 years ago

@jgiszczak it is not about SYN flooding warnings and dropped packets. I have the same output for tcp_max_syn_backlog

I don't want to go into endless discussion really. I think the commit in the official next version of Websocketpp is pretty self explanatory. https://github.com/zaphoyd/websocketpp/commit/0bb33e4bca4ccc42a36aa2321e4fb97f2562e519

Maybe I am mistakenly mixing up http service and the RPC API endpoint and is it not the same thing?

In my case the http api was not responding and idling out after by the 'browser timeout' When implementing the above fix it worked instantly.

I am talking about the NODEIP:NODEPORT/v1/chain/get_info - apis 💤

roelandp commented 6 years ago

Lastly here another discussion about it: https://github.com/zaphoyd/websocketpp/issues/623

Afai understand from your comments you are way deeper into this, but I hope you can give it a shot help me understand how I should revert back to m_listen_backlog(0) and what I alternatively I should change on my box. I still feel it is the correct fix as the maintainer of websocketpp admits the error and it is also updated in websocket 0.8 dev branch (EOS uses current latest release 0.7 (2016))

the m_listen_backlog(0) instructs apparently to use the default setting, but some kernels / boxes interpret it not to go 'default' but drop all. As this is kinda unpredictable they changed it in 0.8 dev branch to a new default in the code: max_connections, at least that is what I understand.

I left my comment here as sometimes this pops up with people I had seen in the past with other graphene chains, and it really is a pretty annoying bug which is unresolved in many cases because it is so deeply hidden in a submodule's library.

jgiszczak commented 6 years ago

Also check the allowed maximum connections:

$ cat /proc/sys/net/core/somaxconn 128

From reading the linked ticket, also disable ufw if your system is using it. It seems to be hyperaggressive about something it shouldn't be. Compose your own firewall rules with iptables if you need them.

I've spent a good deal of time with strace today, and I'm not entirely sure how websocketpp has been working on most everyone's system, including mine. I see the 0 argument in the system call, and my perusal of both the kernel source and glibc source leads me to believe it will be passed unmodified. I am reluctant to just arbitrarily patch websocketpp, but fortunately we're using an actual copy of it rather than a git submodule, so it can be done easily.

Given the still somewhat mysterious nature of the root cause, I've submitted a pull request with the recommended fix.

roelandp commented 6 years ago

@jgiszczak Please note the commit should also have added asio! (as far as I understand from the patch from websocketpp: https://github.com/zaphoyd/websocketpp/commit/0bb33e4bca4ccc42a36aa2321e4fb97f2562e519) #include <websocketpp/common/asio.hpp>

jgiszczak commented 6 years ago

ATC:

Run `nodeos with strace from the build directory as follows:

strace -e trace=listen programs/nodeos/nodeos

Verify the following two lines appear:

listen(11, 128) = 0 listen(12, 128) = 0

jgiszczak commented 6 years ago

@roelandp Adding the include was not necessary. The constant was already available and was being used by the copy constructor, line 134.

andriantolie commented 6 years ago

ATC passes