cometbft / cometbft

CometBFT (fork of Tendermint Core): A distributed, Byzantine fault-tolerant, deterministic state machine replication engine
https://docs.cometbft.com
Apache License 2.0
626 stars 444 forks source link

consensus/logs: multiple "error adding vote" entries from the same peer #2551

Open czarcas7ic opened 6 months ago

czarcas7ic commented 6 months ago

Bug Report

Setup

CometBFT version (use cometbft version or git rev-parse --verify HEAD if installed from source): v0.37.4 from https://github.com/osmosis-labs/cometbft

Have you tried the latest version: no

Environment: Osmosis v23 mainnet

Config

No config changes

What happened?

Every now and then, we get error log spam (maybe 40 lines) of a single peer failing to process a VoteMessage. The time stamp slowly increments, so it seems like multiple attempts from the peer and not just a single function call.

What did you expect to happen?

If the peer fails to add a vote, it seems like this should be only attempted once and logged once. I could be misunderstanding the logic flow though, so what is happening here could make sense.

How to reproduce it

Running a node on Osmosis mainnet v23. It could have always been present, but our infra has just set up a betterstack logs instance, and I am going through all the errors that happen regularly and attempting to reduce the noise we see from all parts of the stack.

Logs

Screenshot 2024-03-09 at 1 08 22 PM

czarcas7ic commented 6 months ago

Also an important note, these error logs are showing maybe every 200 blocks or so, but every time it is the same peer.

faddat commented 6 months ago

@czarcas7ic the "same peer" part is very interesting.

This should be a validator, have you reached out?

czarcas7ic commented 6 months ago

The furthest I researched was this screenshot sadly, was hoping there was a more straight forward explanation for what was happening

cason commented 6 months ago

So, with Info log level, you can see the reason:

https://github.com/cometbft/cometbft/blob/v0.37.x/consensus/state.go#L2055-L2063

I agree that the operation here is not really great, as the returned error is standard, a generic "error adding vote", and this is the only information provided in Error level.

cason commented 6 months ago

If the peer fails to add a vote, it seems like this should be only attempted once and logged once.

So, when a node receives several invalid votes from the same peer, do you expect us to only produce a single Error log entry? I assume that once per peer and per height (and per vote type)?

cason commented 6 months ago

The point is that different errors can be provided for different votes. Also for this reason, it is not really trivial to only the first error of some type, vote type, height, etc.

cason commented 6 months ago

An alternative solution is really to punish peers that send us invalid votes: https://github.com/tendermint/tendermint/issues/1281

czarcas7ic commented 6 months ago

That would be the desire yes, both reducing the error to a single log and punishing the peer. From watching our logs, these typically come from the exact same peer every time, so maybe there could be some automatic temporary blacklisting that is stored (and continues to be stored even after node reboot).

If the former is not possible I think that is fine, as long as the later prevents them from continually causing this to appear in logs.

cason commented 6 months ago

I don't see any immediate action we can take in order to address this issue. It is not clear whether the peer that sends several invalid votes is misbehaving or if it is just slow and sending old votes.

We would need to improve the logging for this situation, as the produced Error level message is insufficient to understand the reason for the error.

czarcas7ic commented 6 months ago

That makes sense to me, so should the action item be to just increase logging in this area?