Eth1provider logging is broken

nflaig commented 6 months ago

Describe the bug

It has been reported by multiple users (1, 2) that in some cases we spam the logs with

Eth1Provider is back online oldState=ERROR, newState=ONLINE

but the actual error is not logged.

The spam happens because some calls work and some fail, it's likely from the eth1 deposit data tracker.

See related comments

Expected behavior

[ ] We should not spam the logs obviously but we also need to make sure that we properly log errors. It does not make sense to change the state because of a single failed request, good state should be ONLINE or OFFLINE but error it not that helpful, or if we track that state, we need to think about how it is updated as a single successful request changes it to ONLINE again and so on, causing the logs to be spammed with "back online" log above.
[x] log errors (once a minute) even if errors happen more frequently to provide feedback to user that there is a persistent issue https://github.com/ChainSafe/lodestar/pull/6863

Steps to reproduce

Based on reports, these two setups can be used to reproduce the issue

run Lodestar with a syncing execution client, preferably Nethermind as issue has been observed there
or run openexecution as a middleware to connect to EL client

Additional context

Operating system

Linux

Lodestar version or commit hash

v1.16.0

philknows commented 3 months ago

Can confirm that this is still happening with Lodestar v1.18.0 and Nethermind v1.26.0.

~~In addition, the logs also include the timeout request errors such as this one:~~

May-13 10:39:32.918[eth1]            error: Eth1Provider faced error state=ERROR, lastErrorAt=5:44:59 AM - Timeout request
Error: Timeout request
    at JsonRpcHttpClient.fetchJsonOneUrl (file:///usr/app/packages/beacon-node/src/eth1/provider/jsonRpcHttpClient.ts:300:17)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at runNextTicks (node:internal/process/task_queues:64:3)
    at listOnTimeout (node:internal/timers:540:9)
    at processTimers (node:internal/timers:514:7)
    at JsonRpcHttpClient.fetchJson (file:///usr/app/packages/beacon-node/src/eth1/provider/jsonRpcHttpClient.ts:227:16)
    at wrapWithEvents.payload.payload (file:///usr/app/packages/beacon-node/src/eth1/provider/jsonRpcHttpClient.ts:146:37)
    at JsonRpcHttpClient.wrapWithEvents (file:///usr/app/packages/beacon-node/src/eth1/provider/jsonRpcHttpClient.ts:206:24)
    at Eth1Provider.getBlockNumber (file:///usr/app/packages/beacon-node/src/eth1/provider/eth1Provider.ts:186:28)
    at Eth1DepositDataTracker.update (file:///usr/app/packages/beacon-node/src/eth1/eth1DepositDataTracker.ts:197:32)

nflaig commented 3 months ago

In addition, the logs also include the timeout request errors such as this one:

This is a different issue, see https://github.com/ChainSafe/lodestar/issues/5935. The log message changed since we refactored the Eth1provider code, but the error is not really related to what is reported in this issue.

nazarhussain commented 3 months ago

The only workflow which can trigger such spam is that eth1 providers identified as providerUrls return error/response alternatively. a.k.a. one request return valid response and next request return error response. This way you will see the Eth1Provider is back online very often than the error case, because we log the error once per minute.

nflaig commented 3 months ago

a.k.a. one request return valid response and next request return error response

yeah that's the issue, we need to change the state change logic to not spam the logs, this scenario happens if nethermind is still syncing as it can serve only a subset of requests, doesn't mean it's offline though

because we log the error once per minute.

if we have continuous errors (frequency < 1min) then we will never log the error again due to a bug in isOneMinutePassed.

I guess these are two separate issues and could be fixed independently.

nazarhussain commented 3 months ago

Let's consolidate discussion here in this issue, otherwise it will lost on the merge PRs.

I went over the discussions I had for it's implementation, that was the inspiration or this behavior:

also maybe every minute log the error message as well if the eth1 is still down ... ( i.e. if we make the request and its still error and the last error display message > 60 seconds , log the message and update last error display time)

So yes there is a bug in the implementation, but not in the createElapsedTimeTracker. That one seems to be good as what is intended for, it's just tracking time since first invocation.

For eth1 provider error we should do via rate limiting helpers for rate limit error per minute.

nflaig commented 3 months ago

So yes there is a bug in the implementation, but not in the createElapsedTimeTracker. That one seems to be good as what is intended for, it's just tracking time since first invocation.

See the test case here

I would expect isOneMinutePassed to be a rate limiter to log an error max once a minute, but current behavior is to log only once if there are continuous errors with a frequency < 1min. It only will log an error again if there was no error for longer than 1min.

nazarhussain commented 3 months ago

The behavior or the createElapsedTimeTracker is fine, there is no reset behavior for it. It just track time since it's first creation.

Create a tracker which keeps track of the last time a function was called

So if the error appears in first 1 minute of creation of the provider it will not be logged. So we need to change this and instead of using createElapsedTimeTracker use some frequency tracker here in the provider.

nflaig commented 3 months ago

So if the error appears in first 1 minute of creation of the provider it will not be logged

this is not true, there is even a test case for it https://github.com/ChainSafe/lodestar/blob/49c16890a97cc44ea280acfae6712913e5091879/packages/utils/test/unit/waitFor.test.ts#L40

I am pretty sure it will trigger the error log if it's called 1 minute after instantiation

The problem is really that intermediate calls reset the timer, which does not make sense to me. Because the logic of isOneMinutePassed should not be "has one minute passed since I was last invoked" but rather "has one minute passed since I last returned true".

ChainSafe / lodestar