HelixNetwork / pendulum

Pendulum is a distributed messaging protocol that enables globally available tamper proof timestamps :hourglass_flowing_sand:
https://dev.hlx.ai
Other
10 stars 6 forks source link

Key Rotation failing on some nodes #215

Closed dnck closed 4 years ago

dnck commented 4 years ago

This is a reprint from the Telegram group.

Here is the relevant log lines from the mainnet nodes:

https://pastebin.com/raw/f0sxHUqA

Description

On Oct 21, the syncCheck metric for main_relayer_1 in the mainnet started to increase substantially. This seems to be an issue related to the CandidateTrackerImpl.

Here's what happened with the relayers that remain in sync.

  1. Main_relayer_2 and Main_relayer_3 processed candidate transaction:

002d043307d026c87fcfaedf292ec82de6e4c56b9d2d72db847d7514146d5754.

  1. Both relayers added the new candidate address:

d1bdb61f8a0e7291cf123b646377f5eef42a807ffb07509af0662652cc817b68.

  1. After, they both removed candidate:

bcac8d8c67df165e74c0202ca4c04b672442a69cf4e6f1f928d28a8b79f844a0.

  1. And finally, they found the first processed candidate transactions VALID, and added the new Address:

d1bdb61f8a0e7291cf123b646377f5eef42a807ffb07509af0662652cc817b68 for round 173052.

HOWEVER, the OUT OF SYNC node, main_relayer_1, ONLY DID step 1.

oracle58 commented 4 years ago

I think issue might be that we are trying to get tail, before we are checking solidity, i.e. tail of the bundle is null. In MilestoneTracker for instance the analyzed transaction is the tail, hence this case need not be considered.
Ideally, a candidate transaction should, like a milestone, be identifiable by its tail. I will commit a temporary fix, in which false (which means it is INCOMPLETE, ie not removed from queue) is returned when tail is null and this will be logged. INCOMPLETE candidates are re-analyzed, and once solid, the candidate transaction should be processed correctly, and key rotation executed properly.