HelixNetwork / pendulum

Pendulum is a distributed messaging protocol that enables globally available tamper proof timestamps :hourglass_flowing_sand:
https://dev.hlx.ai
Other
10 stars 6 forks source link

Fix 242 #244

Closed dzhelezov closed 4 years ago

dzhelezov commented 4 years ago

This fix tries to load the validator set from the database #242

dnck commented 4 years ago

Actually, I think I may have tracked down a possible source of confusion. There are two possible causes to the odd behavior when the node restarts with the db of another node.

It's worth the time to go through the protocol.

Every 10 minutes, we're zipping up the data directory of a running validator and sending it to s3.

Suppose now that @dt93 stops the validator, and replaces its data dir with an earlier state of the data dir.

Possible problem 1 @dzhelezov has suggested there may be an issue with copying the data dir of the running node. Perhaps the script copies the db while some important change is being committed, and not finished. Thus, we end up with a corrupted db. The solution would be to make our copies of the db are isolated from on-going changes. Obviously, we can't stop the node, let it shutdown its db connections properly, and then cp the data dir, so we need something that will happen "in-place".

Possible problem 2 Notice I said that the script copies only the data dir of the validator and pastes it into s3. What we fail to do, and this was my own oversight, is copy the current validator key file along with the data dir state. I notice now that when I do this, I end up with a very different pattern in the log file for a restarted validator. Solution here is that I will rendezvous with @dt93 to make sure that when we restart a validator, we're doing so with the correct key file at some higher idx.

Now, back to the commit and PR. I see no reason why this shouldn't be merged. It works fine. However, I notice that I only ever see a return from the first conditional in recoverValidators....

        try {
            RoundViewModel latest = RoundViewModel.latest(tangle);
            if (latest == null) {
                log.debug("Latest round is null");
                return;
            }

So, I'm not sure that this is having the effect that is intended. Instead, I think the solution might be "devops" related. Sorry y'all.