Remove artificial limitation in block production

IntersectMBO / ouroboros-consensus

Implementation of a Consensus Layer for the Ouroboros family of protocols

https://ouroboros-consensus.cardano.intersectmbo.org

Apache License 2.0

31 stars 22 forks source link

Remove artificial limitation in block production #668

Open edsko opened 4 years ago

edsko commented 4 years ago

Block production was using anachronisticLedgerView to request a ledger view for the leader check, and failing with TraceNoLedgerView if that failed to produce a ledger view. However, as part of the work on input-output-hk/ouroboros-network#1933 we realized using anachronisticLedgerView here is wrong, and we should use applyChainTick instead: there is no reason why we can't produce a block even if we are far ahead. I have however left in the call to anachronisticLedgerView (soon to be ledgerViewForecastAt), because the consensus tests are currently failing without it. I strongly suspect that this is because of a limitation of the test infrastructure (not quite setting parameters strictly enough) rather than a true bug. We should try to remove it and update the tests.

Note however that once the hard fork timing stuff is merged, we will re-introduce a restriction: if our ledger is far behind, we can't even convert the current UTCTime to a SlotNo, and so we can't do the leader check. That might in fact reintroduce the same limitation, and fix this problem without the current artificial check.

nfrisby commented 4 years ago

there is no reason why we can't produce a block even if we are far ahead.

Given the typical relationships between various constants we've maintained so far, producing a block in this case would create a chain that directly violates the Chain Growth invariant. You and I have discussed this before, and you emphasized that the protocol is supposed to make such violations impossible. And since the protocol ensures the Chain Growth invariant, the node shouldn't have to.

However, when a Byron node is "catching up" (which the relevant comments in the source identify as the only scenario in which we expect this case to arise), it is beyond the scope of the analyses in the paper. Only Ouroboros Genesis is designed to support "catch up" -- everything else (Praos, (P)BFT, etc) is just supposed to "do its best" as I understand it. The Praos paper in particular assumes that any new stakeholder has, by the time it joins, somehow already selected a chain that some other pre-existing node has most recently selected. (Maybe stakepool registration is enough to ensure this?)

So I think -- until we have the Genesis rule -- the node should go willingly choose to not forge if it can clearly see doing so will create a chain that violates Chain Growth.

(I agree that the hard fork clock concerns might make this question moot, and you and I already have plans to discuss that further. I'm just recording this prompt here somewhat anachronistically.)

What do you think?

nfrisby commented 1 year ago

We looked over this in the Consensus call. I do not think we should merge this PR. The main reason:

Forecasting and ticking use separate ledger calculations.
The forecasting calculation can-and-should-be cheaper than ticking.
The node kernel computes the forecast (ideally cheap) every time it checks whether it leads; but then node kernel computes the tick (comparatively expensive) only when it actually does lead. In other words: it does the cheap thing frequently and the expensive thing rarely.

The performance bug under recent scrutiny is that forecasting is way more expensive than it can-and-should-be. Once that's fixed, the current code's behavior will be exactly what we want.

nfrisby commented 1 year ago

I had an idea (along the lines of what Javier did in his PR) to "centralize and cache the ticking". Something like the ChainDB maintains a ledger state that has been ticked to the wallclock (usually that means: it ticks by one slot every one second). The node kernel could use that for the leadership check and the new block.

The catch is that I think the ChainDB receives the newest remote block more than a second after that block was minted. And so the "ledger state ticked to the wall clock" would often be ahead of the ledger state needed to validate the newest remote block. If instead the ChainDB tended to receive the block during the block's own slot, then both that "wallclock ledgerstate" would suffice for both the node kernel's leadership check and the ChainDB's validation of the next remote block.

Edit: so, since we don't really have an easy way to force blocks to propagate within 1s (:D), this idea is probably a dead-end.

dnadales commented 1 year ago

Would this idea correspond to input-output-hk/ouroboros-network#4054 ?

nfrisby commented 1 year ago

I would say that idea is only partly related to Issue 4054. They both incrementally maintain a ticked ledger state as time passes. However, 4054 is just treating a symptom of a performance bug. My idea above (though I think it's a dead-end; see my edit above) would be trying to combine what are currently two separate ticking computations, and doing so would happen to remove the need for forecasting (which is the same as my comment on 4054).

nfrisby commented 9 months ago

Some comments on the above discussions, FWIW.

The window in which the HFC can translate the current wallclock to a slot is much larger than the window in which it can forecast. So there is still an artificial limit that could be removed.
This ledger PR made it so that the expensive epoch-boundary-crossing calculation only happens once by creating the thunk early enough that all later extensions will share it. So the forecast should be cheap every time except the first time that thunk is forced.
Theoretically, the current long-term general plan of Quantitative Timing Agreements should eventually prevent any surprise performance regression that would have been defended (ie _masked!) by a defense strategy a la Javier's caching.
The current Consensus API does not allow for incremental ticks: you can't tick a second time until you apply a block to the result of the first tick. There are first-principle reasons to relax this invariant, but it's not entirely unmotivated.
Generally, the only time we'd care about a node minting a block that would obviously violate the Chain Growth property is if there is a network-wide disaster that lasted several hours (~8hr or more) and it would perhaps be beneficial for blocks to mint anyway during that disaster. The anticipate disaster plan involves a off-chain-organized cooperative effort to retcon the chain during such a disaster, so the blocks minted during it probably don't matter anyway.

I would suggest that that last bullet point is pre-requisite for this Issue: is there even any scenario in which the block minted when the current slot's ledger view is not forecast-able would be relevant/useful?