Investigate possible network issues when syncing indexer in cloud environment

dib542 commented 11 months ago

During syncing in a cloud environment it can sometimes take a very long time for the import progress to reach fully synced. In the example below it seems to be fully synced after about 7 hours.

1:48AM start indexer
1:48AM start fetching transactions from RPC endpoint
[nothing]
2:01AM received first transactions from RPC endpoint (10 minutes to fetch 1 transaction)
...syncing...
2:09AM synced 656 transactions (656 total)
[nothing]
3:01AM info: keeping up: Unable to sync during poll
3:01AM error: Cannot read properties of undefined (reading 'txs')
3:01AM started syncing again
...syncing...
4:49AM synced 940 transactions (1596 total)
[nothing]
5:41AM info: keeping up: Unable to sync during poll
5:41AM error: Cannot read properties of undefined (reading 'txs')
5:41AM started syncing again
...syncing...
5:51AM synced 306 transactions (1902 total)
[nothing]
6:40AM info: keeping up: Unable to sync during poll
6:40AM error: Cannot read properties of undefined (reading 'txs')
6:40AM started syncing again
...syncing...
6:47AM synced 224 transactions (2126 total)
[nothing]
7:38AM info: keeping up: Unable to sync during poll
7:38AM error: Cannot read properties of undefined (reading 'txs')
7:39AM started syncing again
...syncing...
7:47AM synced 203 transactions (2329 total)
[nothing]
8:39AM info: keeping up: Unable to sync during poll
8:39AM error: Cannot read properties of undefined (reading 'txs')
8:39AM started syncing again?
8:39AM info: keeping up: still polling ...
8:39AM info: keeping up: still polling ...
8:39AM info: keeping up: still polling ...
8:40AM info: keeping up: still polling ...

My leading theory into why error: Cannot read properties of undefined (reading 'txs') is happening after an hour delay is that some of the network requests are being dropped somehow and only time out after almost an hour. I think that to solve this, there should be some logic to abort long running requests during syncing

Here are some charts of the fetching and processing times:

There is a clear cut off between requests less than 11,082ms and requests more than 16,751ms for some odd reason. Requests approach 11 seconds in length and then may take randomly between 16-60 seconds in length, and 3 requests were observed to take more than 65 seconds.

It should be noted that the fetching times may include multiple fetches for multiple back-off retries. In particular the first payload is of 1 item in size, therefore the 1 request time must have been for timing out of a page of 20 items, then timing out a page of 2 items, then receiving a page of 1 item in 480 seconds. it would have been helpful to collect data for just the last (successful) request time.

dib542 commented 11 months ago

With a further investigation using the code in PR #33 it was found that info: keeping up: Unable to sync during poll occurs not after a long pause or request but from a very long recursion of fetch failures (error message: fetch failed) with each loop duration starting at ~15 seconds and with the linear backoff increasing to ~1minute. with ~50 requests this would continue for about an hour, and the final loop exited with an error: status: 500 indicating the service had stopped.

These timings were consistent with the times when the chain node being queried was at around 100% CPU usage and >30% memory usage

And the queries status: 500 errors occurs when the node reaches maximum memory usage, indicating a possible out-of-memory error causing the status: 500 response.

Outside of these areas of high CPU and memory usage, the indexer was able to import transactions fine.

dib542 commented 11 months ago

Doubling the CPU and memory of the node has stabilized the node. The instance is now at 2vCPU and 16Gb of memory.

The following images highlight the previous non-responsive times of the follower node before the upgrade of the instance

duality-labs / hapi-indexer

Investigate possible network issues when syncing indexer in cloud environment #31