duality-labs / hapi-indexer

A Node.js based indexer for the Duality Cosmos chain
1 stars 0 forks source link

Investigate possible network issues when syncing indexer in cloud environment #31

Open dib542 opened 11 months ago

dib542 commented 11 months ago

During syncing in a cloud environment it can sometimes take a very long time for the import progress to reach fully synced. In the example below it seems to be fully synced after about 7 hours.

My leading theory into why error: Cannot read properties of undefined (reading 'txs') is happening after an hour delay is that some of the network requests are being dropped somehow and only time out after almost an hour. I think that to solve this, there should be some logic to abort long running requests during syncing

Here are some charts of the fetching and processing times:

Image Image

There is a clear cut off between requests less than 11,082ms and requests more than 16,751ms for some odd reason. Requests approach 11 seconds in length and then may take randomly between 16-60 seconds in length, and 3 requests were observed to take more than 65 seconds.

It should be noted that the fetching times may include multiple fetches for multiple back-off retries. In particular the first payload is of 1 item in size, therefore the 1 request time must have been for timing out of a page of 20 items, then timing out a page of 2 items, then receiving a page of 1 item in 480 seconds. it would have been helpful to collect data for just the last (successful) request time.

dib542 commented 11 months ago

With a further investigation using the code in PR #33 it was found that info: keeping up: Unable to sync during poll occurs not after a long pause or request but from a very long recursion of fetch failures (error message: fetch failed) with each loop duration starting at ~15 seconds and with the linear backoff increasing to ~1minute. with ~50 requests this would continue for about an hour, and the final loop exited with an error: status: 500 indicating the service had stopped.

These timings were consistent with the times when the chain node being queried was at around 100% CPU usage and >30% memory usage

Image

And the queries status: 500 errors occurs when the node reaches maximum memory usage, indicating a possible out-of-memory error causing the status: 500 response.

Outside of these areas of high CPU and memory usage, the indexer was able to import transactions fine.

dib542 commented 11 months ago

Doubling the CPU and memory of the node has stabilized the node. The instance is now at 2vCPU and 16Gb of memory.

The following images highlight the previous non-responsive times of the follower node before the upgrade of the instance

Image