sifnoded RPC denial of service for UI requests

Background

This issue described here was originally described in ticket #1460, but was later separated and moved to its own ticket when we discovered that it's about a different scenario. Nevertheless, we should keep in mind that both issues might be somehow connected and that one might benefit from solutions/fixes for the other. Anybody working on this ticket should also be familiar with the comments on #1460 to get a better sense of understanding this issue.

The problem

@gzukel found out that a sifnoded (v0.13.1 at the time) would sporadically return errors over RPC queries that are called by UI. He was able to reproduce the problem by running 4 static RPC queries (HTTP POST for GetLiquidityProviderData, GetRewardParams, GetPools and GetPmtpParams). Those queries do not change any state and are supposed to return a valid result at any time. However, if they are called in rapid succession (e.g 100 times) by a number of parallel threads (e.g. 50), the queries start to return HTTP errors.

With a light load (i.e. 1 thread, 5s sleep time between requests) we did not see any errors. The goal of any investigation should therefore be focused on finding the root cause.

What we observed

The issue does not seem to be caused by lack of memory. Memory usage does not increase significalntly under test load. The errors can (and do) happen when there is ample free memory available. The OS does not report any out-of-memory conditions and the sifnoded process does not die or get killed (this is the key difference from #1460).
The issue does not seem to be caused by CPU bottleneck. When errors start happening, there are plenty of free CPU resources.
There does not seem to be a significant difference if we run the load on 4CPU/8G RAM machine or 36CPU/72G RAM. (this is also differentiating factor from #1460)
There is inconclusive evidence that the problem is caused by disk bottleneck. During test load, we did observe an increase in disk load above the baseline (which is expected), but there were no characteristic signs of saturation, i.e. the system was still responding in a timely fashion and within expected tolerances.

At the same time when the RPC endpoint start returning errors we see a significant increase of these messages in the sifnoded logs:

8:35AM INF Dialing peer address={"id":"fdaa88f2a0bacd93590d6ce8f0a9e584ec306afc","ip":"62.133.229.14","port":36656} module=p2p
8:35AM INF Dialing peer address={"id":"3e3307fe457940a8f5a3a4315401f55fe6c016db","ip":"18.211.58.165","port":26656} module=p2p
8:35AM ERR dialing failed (attempts: 1): dial tcp 62.133.229.14:36656: connect: connection refused addr={"id":"fdaa88f2a0bacd93590d6ce8f0a9e584ec306afc","ip":"62.133.229.14","port":36656} module=pex
8:35AM INF Starting Peer service impl="Peer{MConn{178.63.44.171:26656} 30f2c8299d132d8b10b07b85da6a97271e61bfe0 out}" module=p2p peer={"id":"30f2c8299d132d8b10b07b85da6a97271e61bfe0","ip":"178.63.44.171","port":26656}
8:35AM INF Starting MConnection service impl=MConn{178.63.44.171:26656} module=p2p peer={"id":"30f2c8299d132d8b10b07b85da6a97271e61bfe0","ip":"178.63.44.171","port":26656}
8:35AM ERR dialing failed (attempts: 1): dial tcp 18.211.58.165:26656: i/o timeout addr={"id":"3e3307fe457940a8f5a3a4315401f55fe6c016db","ip":"18.211.58.165","port":26656} module=pex

8:36AM INF minted coins from module account amount=112403081268641695195rowan from=mint module=x/bank
8:36AM INF Timed out dur=3000 height=6764540 module=consensus round=0 step=3
8:36AM INF minted coins from module account amount=225000000000000000000rowan from=dispensation module=x/bank

8:37AM ERR failed to write responses err="write tcp 172.31.26.58:26657->172.31.28.185:37288: i/o timeout" module=rpc-server res=[{"id":125454479185,"jsonrpc":"2.0","result":{"response":{"code":0,"codespace":"","height":"6764542","index":"0","info":"","key":null,"log":"","proofOps":null,"value":"..."}}}]

It should be noted that some of these errors (in particular connection refused) are part of normal/expected behaviour, but the increased frequency shows that there is a strong correlation with the problem caused by test load.

Other than that, we did not see any characteristic error messages in sifnoded logs.

Next things to do

Examine what exact errors are being returned over RPC, and if the return values provide any hint about why they are failing.
Investigate where exactly in the code the execution flow switches from "OK path" to "error path" and what is the condition that triggers this switch. Then investigate what specific circumstances and mechanisms trigger the switch.
Run the simulation with different disks that have different IOPS (i.e. HDD vs. GP2 vs. provisioned GP3 vs. NVMe) and see if it makes any difference. If we see any difference which would suggest that the issue is caused by disk bottleneck. In this case, the bottleneck itself should be investigated for the root cause.
Measure performance: use a EC2 instance that offers a good set of performance counters, in particular disk and I/O. When running the tests so far, our instance did not report disk usage.
Any tests/investigations should be done against a fully synchronized, non-archive node that is not serving any other requests, and on non-burstable, non-throttled machine (to minimize the influence/variance due to environment).
Test the hypothesis of running out of system resources, such as network sockets, file descriptors, etc.
Test the requests at a slow rate, but with an external (non-sifnoded) disk load. If we see the same errors, it would suggest that they are caused by disk. Again, this should be investigated for root cause.

How to get the test load script

For the time being I did not commit test load script to any public repository due to the risk of abuse. The original script can be obtained from ChainOps, whereas an slightly improved version (with command-line parametrizations of URL) is also available on request from @jzvikart.

Sifchain / sifnode