Open dib542 opened 11 months ago
With a further investigation using the code in PR #33 it was found that info: keeping up: Unable to sync during poll
occurs not after a long pause or request but from a very long recursion of fetch failures (error message: fetch failed
) with each loop duration starting at ~15 seconds and with the linear backoff increasing to ~1minute. with ~50 requests this would continue for about an hour, and the final loop exited with an error: status: 500
indicating the service had stopped.
These timings were consistent with the times when the chain node being queried was at around 100% CPU usage and >30% memory usage
And the queries status: 500
errors occurs when the node reaches maximum memory usage, indicating a possible out-of-memory error causing the status: 500
response.
Outside of these areas of high CPU and memory usage, the indexer was able to import transactions fine.
Doubling the CPU and memory of the node has stabilized the node. The instance is now at 2vCPU and 16Gb of memory.
The following images highlight the previous non-responsive times of the follower node before the upgrade of the instance
During syncing in a cloud environment it can sometimes take a very long time for the import progress to reach fully synced. In the example below it seems to be fully synced after about 7 hours.
[nothing]
[nothing]
info: keeping up: Unable to sync during poll
error: Cannot read properties of undefined (reading 'txs')
[nothing]
info: keeping up: Unable to sync during poll
error: Cannot read properties of undefined (reading 'txs')
[nothing]
info: keeping up: Unable to sync during poll
error: Cannot read properties of undefined (reading 'txs')
[nothing]
info: keeping up: Unable to sync during poll
error: Cannot read properties of undefined (reading 'txs')
[nothing]
info: keeping up: Unable to sync during poll
error: Cannot read properties of undefined (reading 'txs')
info: keeping up: still polling ...
info: keeping up: still polling ...
info: keeping up: still polling ...
info: keeping up: still polling ...
My leading theory into why
error: Cannot read properties of undefined (reading 'txs')
is happening after an hour delay is that some of the network requests are being dropped somehow and only time out after almost an hour. I think that to solve this, there should be some logic to abort long running requests during syncingHere are some charts of the fetching and processing times:
There is a clear cut off between requests less than 11,082ms and requests more than 16,751ms for some odd reason. Requests approach 11 seconds in length and then may take randomly between 16-60 seconds in length, and 3 requests were observed to take more than 65 seconds.
It should be noted that the fetching times may include multiple fetches for multiple back-off retries. In particular the first payload is of 1 item in size, therefore the 1 request time must have been for timing out of a page of 20 items, then timing out a page of 2 items, then receiving a page of 1 item in 480 seconds. it would have been helpful to collect data for just the last (successful) request time.