Closed ghost closed 8 years ago
@karek314: I will give my explanation on why this fork occurs.
This type of fork is caused by a mismatch between the expected public key, and the received block's generator public key, when processing a block.
The following gives an example of the error message which will be recorded when a node enters this type of fork:
Note: In 0.2.1, we improved the logging in this case to give more detailed information, and within this explanation I have added comments to the code.
{
"level":"error",
"message":"Expected generator: 7ac9d4b708fb19eaa200eb883be56601ddceed96290a3a033114750b7fda9d0b Received generator: fa7bfd3a2dc0ca55b700247aae4694709d6cdfa34c6bfb0237e032d7aae404f0",
"timestamp":"2016-05-07 03:40:44"
}
{
"level":"info",
"message": "Fork",
"timestamp":"2016-05-07 03:40:44",
"data": {
"delegate":"fa7bfd3a2dc0ca55b700247aae4694709d6cdfa34c6bfb0237e032d7aae404f0",
"block":{
"id":"13609465205005996164",
"timestamp":34054830,
"height":61118,
"previousBlock":
"1280968842307106697"
},
"cause":3
}
In this example the node has received the following public key: fa7bfd3a2dc0ca55b700247aae4694709d6cdfa34c6bfb0237e032d7aae404f0
The expected public key is determined, like so:
private.getKeysSortByVote = function (cb) {
modules.accounts.getAccounts({
isDelegate: 1,
sort: { "vote": -1, "publicKey": 1 },
limit: slots.delegates
}, ["publicKey"], function (err, rows) {
if (err) {
cb(err)
}
cb(null, rows.map(function (el) {
return el.publicKey
}))
});
}
Delegates.prototype.generateDelegateList = function (height, cb) {
private.getKeysSortByVote(function (err, truncDelegateList) {
if (err) {
return cb(err);
}
var seedSource = modules.round.calc(height).toString(); // Establish current round as the seed source
var currentSeed = crypto.createHash('sha256').update(seedSource, 'utf8').digest(); // Create first sha256 hash seed from current round
// Iterate of 101 delegates list sorted by vote
for (var i = 0, delCount = truncDelegateList.length; i < delCount; i++) {
// Iterate over each byte in current hash (up to 4th byte)
for (var x = 0; x < 4 && i < delCount; i++, x++) {
var newIndex = currentSeed[x] % delCount; // Determine new index
var b = truncDelegateList[newIndex]; // Get current delegate at new index
// Swap delegate stack positions
truncDelegateList[newIndex] = truncDelegateList[i];
truncDelegateList[i] = b;
}
// Generate a new hash seed based on current seed
currentSeed = crypto.createHash('sha256').update(currentSeed).digest();
}
cb(null, truncDelegateList);
});
}
Delegates.prototype.validateBlockSlot = function (block, cb) {
// Generate delegate list based on current block height (round)
self.generateDelegateList(block.height, function (err, activeDelegates) {
if (err) {
return cb(err);
}
// Get slot number according to given block timestamp
var currentSlot = slots.getSlotNumber(block.timestamp); // block.timestamp / 10 block time = 3405483
var delegate_id = activeDelegates[currentSlot % slots.delegates]; // delegate_id = 3405483 % 101 delegates = 66
// Expected: 7ac9d4b708fb19eaa200eb883be56601ddceed96290a3a033114750b7fda9d0b (66)
// Received: fa7bfd3a2dc0ca55b700247aae4694709d6cdfa34c6bfb0237e032d7aae404f0
// Compare public keys
if (delegate_id && block.generatorPublicKey == delegate_id) {
return cb(); // Keys match
} else {
// Keys do not match
library.logger.error("Expected generator: " + delegate_id + " Received generator: " + block.generatorPublicKey);
return cb("Failed to verify slot: " + currentSlot);
}
});
}
If the received and expected public keys do not match. The node enters a fork, named cause 3.
I hope this is a full enough explanation on why this fork is occurring. I am currently working on a solution, but posting this explanation to open the discussion and improve understanding.
Upon checking the database snapshot which everyone is seeding their nodes from, I've found a problem:
Example, as of writing the current round is:
lisk_test=# select max(round) from mem_round;
max
-----
694
(1 row)
Rounds are normally applied and "flushed" when they end. However, this is the current state of the db:
lisk_test=# select round, count(*) as actions from mem_round group by round having round < 694;
-[ RECORD 1 ]
round | 137
actions | 192
-[ RECORD 2 ]
round | 4
actions | 1136
-[ RECORD 3 ]
round | 679
actions | 14499
Therefore, that's a total of 3 rounds which have not been applied properly. As a consequence this affects delegate votes, which in turn affects the delegate listing sorted by vote mentioned in this issue, and causing the amount of forks we've been experiencing.
Taking one block which occurred in round: 4 as an example. In block: https://explorer.lisk.io/block/15972574090333109954 at height: 376, there are 25 transactions, all of which are still present in mem_round
as 373 cascading actions. The block was generated and signed correctly, but it's actions are not consistently applied to all nodes, especially so for any which used the snapshot to rebuild from.
Conclusions:
Next steps: I will verify all blocks in the current snapshot, replace the current one at downloads.lisk.io, and then later implement the required checking for the next release. Any snapshots I take in the meantime I will manually check for "unapplied" rounds.
Interestingly there were many forks were active delegated were changing (like yesterday). Provided that the numerous forks were due to this issue (which we don't know for sure), that would explain why many nodes are miscalculating the delegate votes.
As far as I understood, the mem_*
tables are shortcut to simplify and accelerate blockchain computation. I guess we should expect, with time, regular desynchronisation between nodes on these tables (for instance a SIG_KILL occurring).
If a fork 3 happens (or actually whatever error check on mem_*
tables), a recomputation of these tables from scratch could be a good strategy.
This would explain why Crypti would be unstable for a while after a major change in voting from large transactions, shuffling delegate ranks, though it eventually resolves itself. The time syncing has also been a chronic issue.
other idea, remove the mem_round table and perform the computation in memory for the current round.
I believe we found definitive proof of fork cause 3 occuring because of delegates shifting.
Here is an error message from earlier:
error 2016-05-09 08:07:33 Expected generator: e63756373d8bab04d6234851f4fc643ba77d7e8462aa14e89aa864ed0a9d3a4f Received generator: 6beaa7c569c1000f4fcef4ce3133b18609aea52adf95d5992970ea5e0cedda87
info 2016-05-09 08:07:33 Fork { delegate: '6beaa7c569c1000f4fcef4ce3133b18609aea52adf95d5992970ea5e0cedda87',
block:
{ id: '10723986915118145364',
timestamp: 34243170,
height: 75386,
previousBlock: '4610472515399194013' },
cause: 3 }
Below we can see two nodes, one healthy, the other not, and on the right we can see the two delegates public keys which can be compared to the two wallets. Its clearly showing that the delegates swapped places and if we see the public keys from above, its the same two delegates mentioned.
I'm not quite sure what can be done to alieviate this problem but one solution that I am proposing is the following:
At the beginning of the round, delegate votes are fixed in place. These votes will be collected among all nodes until the end of the round. On the final block of the round these votes would be tallied and signed onto that block. It may be helpful for the next block afterwards to pause for longer than normal block time (if thats possible) to allow this final block to propagate through the network. In this way, the shifting issue is resolved and every node is allowed a brief period to get caught up and fully in sync to generate the newest delegate list for the next block.
I agree, the nodes don't have the same information about the votes, above all if votes occurs in the last blocks, at the end of each round. So they generate an errorneus list of active delegates if the order change at the end.
Either we prevent this fork to occur, either we make it recoverable.
I expect the solution of @Isabello being heavy to implement because we include a new type of block in the blockchain. However waiting 60s to start the next round (ie there will be 60s between 2 cross-round blocks) could be enough to start the mainnet and prevent "enough" fork cause 3. We can work then on a better solution later.
I'll admit to not understanding every aspect of the code, but the solution mentioned sounds good to me. A new temporary table in the DB that is a static snapshot of the delegate list at the beginning of the round and gets replaced at the end of each round. All the other code and structure can remain the same. Just for forging, the temporary/static table of delegates would be used instead of the current data.
Was just talking to rabid about it. I could be missing something, but how much overhead can there be with one more table with 101 rows and 3 columns. rank, publicKey, and forging slot (since this is randomized)
Edit: GreXX's suggestion of just another column in the current table that contains the previous round's rank on that round's last block, seems more efficient then a whole new table.
Edit 2: Looking at the code. I wonder if it would be easier to wait to apply votes until the last or first block in a round instead of recording the previous round's ranks
I'm thinking there should be 2 blockchains. One is the complete history, transactions of a complete forging round, voting, and delegate rank in a block with a block cycle of once every forging round, and another with the blocks forged by each delegate, going back a number of forging rounds for redundancy.
The permanent blockchain is verified in background by the delegates as a separate task from forging to the short term temporay chain.
Voting and change of delegate ranking seems to cause network instability, and snapshots are coming out invalid. having a permanent blockchain with long block time allows better validation and prevents partial, truncated invalid data from corrupting main chain. working short term chain going back several cycles for forging gives some redundancy and can be processed faster.
namecoin and blockstack both use a 2 step process for registering names on a blockchain. The first step is a "pre-order" that records the users request on the blockchain but does not officially register it. After 6 blocks (btc/nmc), the user must send a registration transaction to complete. If the pre-order is not followed by the registration within Y blocks, the pre-order expires and is discarded. Similarly, a registration without a prior pre-order is ignored as invalid.
Perhaps a similar system for casting votes would help?
Both the "record" and "cast" transactions must be present within the correct time spacing and sequence to be valid
There is something similar binded in lisk, but that you don't see:
What can happen very unlikely is that 2 usernames get registered in different area of the network. After a few blocks, while the user got a success registering his name, it will disappear (and his fees retruned) because there were no consensus.
We had a deep discussion recently. What came out:
slots.getSlotNumber(block.timestamp)
would be slots.getSlotNumber(block.roundrank);
and gets much more reliable.Do nodes do any type of P2P timestamp comparison with each other like in bitcoin?https://en.bitcoin.it/wiki/Block_timestamp
why not drop back a few blocks instead of relying on the current block
jfeldis wrote:
namecoin and blockstack both use a 2 step process for registering names on a
blockchain...
I could see this being used with multisig also. A reject should be an option for multisig, not only approval in my opinion. Message added for reason might be nice also.
As issue has been closed couple times. I assume that some of this fixes has been applied in 0.2.3. But issue still occurs. Im aware it's in progress now, but maybe it will be helpful.
log 2016-05-22 14:07:39 GET /api/multisignatures/pending?publicKey=d912ed2548ad2428d7aa274ed5af3c4feab73ac0cc69502163f8348b5ec454dd from 10.0.0.2
info 2016-05-22 14:07:40 Loading 1439 blocks from 99.198.173.42:7000
error 2016-05-22 14:07:40 Expected generator: 80455bf1ac7dc974452b1880f4fe4f4af4c74d26ac060586d1ca42b63b82cb70 Received generator: 2e4d19a7eb30bc047a29d661c884672b5a4eaee1c212ef5469d5b4cad0cb6d64
info 2016-05-22 14:07:40 Fork { delegate: '2e4d19a7eb30bc047a29d661c884672b5a4eaee1c212ef5469d5b4cad0cb6d64',
block:
{ id: '12418910332365484318',
timestamp: 35369160,
height: 27076,
previousBlock: '5350555175686575766' },
cause: 3 }
warn 2016-05-22 14:07:40 Block 12418910332365484318 is not valid, ban 60 min 99.198.173.42:7000
warn 2016-05-22 14:07:40 Failed to verify slot: 3536916
warn 2016-05-22 14:07:40 Failed to load blocks, ban 60 min 99.198.173.42:7000
log 2016-05-22 14:07:41 POST /peer/blocks from 108.61.123.223
log 2016-05-22 14:07:41 POST /peer/blocks from 51.255.87.64
as far i heard it's fixed, i have just reproduced with 0.3.0
error 2016-05-25 08:14:09 Expected generator: 366cfff7cc7f96e16c512b9df04f5bc05b694cb253a46e52f1ac8d79be223b06 Received generator: 326544de0b5ae1d73c5b7c233f873de57b02fbec890c90847aa46e621a164b3d
info 2016-05-25 08:14:09 Fork { delegate: '326544de0b5ae1d73c5b7c233f873de57b02fbec890c90847aa46e621a164b3d',
block:
{ id: '15487781969400448893',
timestamp: 45810,
height: 3243,
previousBlock: '14749820397961569831' },
cause: 3 }
Here i have reproduced twice, because of low internet connectivity. Sounds that lagging could trigger fork 3 now.
Yes i agree, i couldn't reproduce it on machine with stable connectivity. but one with poor internet connection enters forks very often. How about implement mechanism to remove last block and it's data when node can't move on with blockchain ?
It is already done but this case is much more complicated and involve a wrong database state most of the case On May 26, 2016 14:15, "karek314" notifications@github.com wrote:
Yes i agree, i couldn't reproduce it on machine with stable connectivity. but one with poor internet connection enters forks very often. How about implement mechanism to remove last block and it's data when node can't move on with blockchain ?
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/LiskHQ/lisk/issues/82#issuecomment-221854487
@karmacoma @MaxKK can we close this bug? there is some internal documentation to improve network reliability of the network preventing such forks. Maybe we could open up some discussion somewhere is the wiki instead, because it is difficult to keep track of this kind of issues in github/issues
@MaxKK Agree, we will close here and focus on improving network reliability on a case by case basis. For example: https://github.com/LiskHQ/lisk/issues/213. We should also improve our fork documentation here: https://github.com/LiskHQ/lisk-docs/blob/development/Troubleshooting.md or a wiki.
As far as I know, DPOS consensus is not BFT, so the solution is improving the consensus layer. In bad network condition, the DPOS consensus will cause many nodes see different new block (same height, different hash) be producing by different delegate node. Sometimes, bad network condition can even cause the new block producing by delegate can not be confirmed, because the last irreversible block (LIB) need (n-1)*2/3 confirmations by witness node, so in this condition the block will be suspend forever. So, the DPOS is not safe and is not BFT, lisk and bitshares and steemit have the same problem.
@MaxKK @karmacoma @fix
It has been fixed with bash lisk.sh rebuild but lisk couldn't automatically fix db. I guess issue happened during connectivity problems, i have reported similar issue with previous database engine.