Open czarcas7ic opened 6 months ago
Also an important note, these error logs are showing maybe every 200 blocks or so, but every time it is the same peer.
@czarcas7ic the "same peer" part is very interesting.
This should be a validator, have you reached out?
The furthest I researched was this screenshot sadly, was hoping there was a more straight forward explanation for what was happening
So, with Info log level, you can see the reason:
https://github.com/cometbft/cometbft/blob/v0.37.x/consensus/state.go#L2055-L2063
I agree that the operation here is not really great, as the returned error is standard, a generic "error adding vote", and this is the only information provided in Error level.
If the peer fails to add a vote, it seems like this should be only attempted once and logged once.
So, when a node receives several invalid votes from the same peer, do you expect us to only produce a single Error log entry? I assume that once per peer and per height (and per vote type)?
The point is that different errors can be provided for different votes. Also for this reason, it is not really trivial to only the first error of some type, vote type, height, etc.
An alternative solution is really to punish peers that send us invalid votes: https://github.com/tendermint/tendermint/issues/1281
That would be the desire yes, both reducing the error to a single log and punishing the peer. From watching our logs, these typically come from the exact same peer every time, so maybe there could be some automatic temporary blacklisting that is stored (and continues to be stored even after node reboot).
If the former is not possible I think that is fine, as long as the later prevents them from continually causing this to appear in logs.
I don't see any immediate action we can take in order to address this issue. It is not clear whether the peer that sends several invalid votes is misbehaving or if it is just slow and sending old votes.
We would need to improve the logging for this situation, as the produced Error level message is insufficient to understand the reason for the error.
That makes sense to me, so should the action item be to just increase logging in this area?
Bug Report
Setup
CometBFT version (use
cometbft version
orgit rev-parse --verify HEAD
if installed from source): v0.37.4 from https://github.com/osmosis-labs/cometbftHave you tried the latest version: no
Environment: Osmosis v23 mainnet
Config
No config changes
What happened?
Every now and then, we get error log spam (maybe 40 lines) of a single peer failing to process a VoteMessage. The time stamp slowly increments, so it seems like multiple attempts from the peer and not just a single function call.
What did you expect to happen?
If the peer fails to add a vote, it seems like this should be only attempted once and logged once. I could be misunderstanding the logic flow though, so what is happening here could make sense.
How to reproduce it
Running a node on Osmosis mainnet v23. It could have always been present, but our infra has just set up a betterstack logs instance, and I am going through all the errors that happen regularly and attempting to reduce the noise we see from all parts of the stack.
Logs