cardano-scaling / hydra

Implementation of the Hydra Head protocol
https://hydra.family/head-protocol/
Apache License 2.0
276 stars 84 forks source link

Support timed transactions #196

Closed ch1bo closed 1 year ago

ch1bo commented 2 years ago

Why

We want to have support for transactions with validity ranges (Figure 3, page 10 in Alonzo spec). This is the only piece of the Alonzo specification which is currently unsupported.

What

The hydra-node should not reject transactions with a lower or upper validity bound set. For that i will need access to a current slot when applying transactions to its ledger.

As the Hydra Head is "isomorphic", having the slot only update when the layer one progresses is fine. i.e. A transaction will be valid on the L2 if it would be valid on the L1.

While a more granular resolution on time (not only on each block) would be possible using wall clock time, this is out of scope for this feature. We will add that later.

How

To be discussed

ch1bo commented 2 years ago

Some notes in tracking time

ch1bo commented 2 years ago

As we see user requests & asking about this we decide to prioritize this a bit higher, aiming for doing this within scope of a 1.0.0 hydra-node.

ch1bo commented 2 years ago

Came across this limitation when trying to fix when ReadyToFanout messages are sent out. Our internal notion of slots (or time in general) is lacking and it's not easy to find the right time when the fanoutTx could be posted (e.g. in e2e tests).

If we track progress of time (in slots) on every seen block, we could also replace the Delay on some wall clock time for ShouldPostFanout with a Wait checking whether enough time has passed against the time passed on chain. We still might want to have a greater fidelity on the internal ledger though. (USP?)

ch1bo commented 1 year ago

This has been mentioned as good to have by @Quantumplation. A first version which just uses L1 time would be sufficient for them (which we now have since #483).

ch1bo commented 1 year ago

Meanwhile, we have implemented ADR20 and have a Tick UTCTime event from Chain to HeadLogic layer. Also, the protocol logic holds a chainState which always has a latest chainStateSlot :: a -> ChainSlot. This should make this change fairly simple:

Only weird thing about this: Redundant information is kept around, i.e. it's not guaranteed that UTCTime and ChainSlot of a Tick are consistent. But the alternative would be to have the mans to do the conversion. For Cardano, this would be a systemStart and eraHistory.. which is annoying and if it's kept in the chain layer, it would mean another round trip / state to keep there.

abailly-iohk commented 1 year ago

Possible next steps:

pgrange commented 1 year ago

FYI this is how lucid does add the validity time to a transaction

ch1bo commented 1 year ago

So with lucid there would not be a need for providing a SlotNo because they are projecting any point in time onto the network-local slots here and here

pgrange commented 1 year ago

So with lucid there would not be a need for providing a SlotNo because they are projecting any point in time onto the network-local slots here and here

Exactly. That makes the interface trivial in a way: you only give a UTC time when you build your transaction.

Also, note that this is done in user land in a library embarked in the client software. In our case, that would mean keep the responsibility to the client to compute the slot number from the UTC time and then send us the transaction expressed in slot. Our responsibility would then be to ensure our slots number match a predictable conversion from UTC time so that they can do the computation. Using the slot numbers coming from L1 should do the trick

abailly-iohk commented 1 year ago

So what you are saying @pgrange is: We just need to expose SlotNo from the underlying L1 as our measure of time on L2?

pgrange commented 1 year ago

I think it's worth exploring that and checking if it makes sense, yes.

uhbif19 commented 1 year ago

Add current time + slot to GetUTxO query response and to the TxValid, TxSeen server outputs.

Maybe this could be separate command/response? Having ability to request Head state would be nice too (while we do not depend on this).

ch1bo commented 1 year ago

Isn't there more to this? Especially as there is a even a section in the Hydra paper appendix about complications with time (or are we good because we do the "coordinated" protocol)?

@GeorgeFlerovsky raised a good point today: It's not only about providing the means of time (a slot) to the validation, but also important how to decide when the transactions get validated by the hydra nodes. Currently a transaction gets validated twice:

GeorgeFlerovsky commented 1 year ago

@uhbif19 You now have a lot of experience with the practical side of constructing L2 transactions.

It would be useful to get your input on what the desired API would be to query Hydra state regarding time and construct/submit time-bounded transactions. 🙏

GeorgeFlerovsky commented 1 year ago

@ch1bo @pgrange @ffakenz With Ouroboros on L1, each transaction can be validated in four different contexts:

  1. When the local tx submission server receives a client request to submit a new transaction, it tries to add the transaction to the local node's mempool. The transaction is applied to the "seen" ledger state (confirmed ledger state + mempool) with validation using the mempool's current slot number, which is updated to the blockchain tip's slot number every time it changes. If validation fails, the transaction is not added to the mempool and not propagated to peers, instead returning an error to the client.
  2. When the node receives a new transaction from a remote peer, it tries to add the transaction to its mempool in the same way as it would for a local transaction, but it reacts differently in case of validation failure. If the transaction fails phase-1 validation, it is not added to the mempool or propagated; if it fails phase-2 validation, it is converted into a collateral-spending transaction, added to the mempool, and propagated to peers.
  3. Each block-producing node checks its chain/ledger state at every slot (updated by simply polling the node's system clock) to determine whether it should forge a block as a slot leader. When it is the slot leader in a given slot, then it ticks its ledger state with the slot, determines a snapshot of its mempool that is consistent with the ticket ledger, and forges a new block with the largest prefix of the mempool transactions that fits the block constraints. Determining this snapshot requires the mempool transactions to be re-validated with respect to the ticked ledger's slot, which may be different from the slot with which they were originally validated when they were added to the mempool. During this re-validation, some transactions that were previously valid may be invalidated at the ticked ledger's slot, with the corresponding consequences depending on whether phase-1 or phase-2 validation fails.
  4. Each node fetches blocks from peers, forming several candidate chains that the node is considering to adopt. The node periodically checks the longest prefix of valid blocks in each candidate chain by validating its new blocks sequentially until an invalid block is detected. During this validation, each block is applied (via Cardano ledger's whole-block transition) against the accumulated ledger state using the block's slot number.

Each of these contexts serves a different purpose in the Cardano protocol:

  1. Local transaction validation ensures that honest transaction submitters only broadcast valid new transactions, avoiding penalties for invalid transactions.
  2. Remote transaction validation mainly ensures that honest nodes do not store or propagate received invalid transactions further across the network. It also penalizes phase-2 invalid transactions by replacing them with collateral-spending transactions and propagating those substitutes to the network.
  3. Transaction validation during block forging ensures that honest slot leaders only broadcast valid new blocks.
  4. Transaction validation during block fetching ensures that nodes only adopt valid new blocks.

In the Hydra L2 context, the rationales for validation in the local transaction and block forging/fetching contexts still apply, but the rationale for remote transaction validation arguably no longer applies. Unlike Ouroboros L1, the Hydra Head L2 protocol always broadcasts new transactions to all protocol participants regardless of their validity. Furthermore, the Hydra Head protocol does not need the DOS protections of Ouroboros because the protocol already provides a simple and legitimate method for any participant to deny service (close/ignore peer connections and close the head) and for any participant to react to DOS attempts (close the malicious peer connection and close the head).

Therefore, the Hydra Head L2 protocol can get away with only requiring transactions to be validated by transaction submitters, snapshot leaders constructing snapshots, and participants verifying received snapshots. When participants who aren't the snapshot leader receive new transactions, they can simply cache them without validation in a basic (tx-id, tx) store, until those transactions are required for snapshot construction or validation. The substitution of collateral-spending transactions for phase-2 invalid transactions (if it's still needed on L2) can just as easily be done during snapshot construction.

Regarding how the passage of time and time-bounded transactions should work on Hydra, I suggest the following:

GeorgeFlerovsky commented 1 year ago

It's getting a little late, so I'll have to comment later on the actual API that I would recommend to help app developers build time-bound transactions.

In general, I think using actual time instead of slots in API methods (with conversion behind the scenes) may be preferable.

abailly-iohk commented 1 year ago

Thanks a lot for the very detailed thread @GeorgeFlerovsky! I will need to mull over all this to unpack it but I already have 2 comments:

GeorgeFlerovsky commented 1 year ago

We would rather rely on the time provided by the chain which is coarser but safer.

Incrementing L2 time only when L1 blocks arrive would result in an even coarser time resolution than L1 (~20s vs 1s), which would be kinda funny given L2's higher frequency of snapshot confirmation...

What if we used local L2 clocks but monitored their skew relative to L1?

GeorgeFlerovsky commented 1 year ago

Snapshots are (currently) not verified by other parties once issued by a leader, beyond the multisignature check

This may no longer be valid when we start allowing time to pass between snapshots, because parties' mempools would contain transactions that they validated relative to a different slot than the next snapshot. Furthermore, you don't know when to expect the next snapshot to arrive.

I don't think you can avoid snapshot validation if you allow time to progress.

GeorgeFlerovsky commented 1 year ago

When the ledger is ticked, the transactions in the mempool are reapply-ed to the ledger at the given slot which means the phase-2 validations (and signatures check) are not run again so the tx is either valid or rejected (e.g no collateral can be consumed at this point)

Interesting. Yeah, I suppose that makes sense — phase-2 and signatures are time invariant.

GeorgeFlerovsky commented 1 year ago

we would rather rely on the time provided by the chain which is coarser but safer.

Also, in what way would relying on time provided by L1 absolve L2 participants from having to synchronize their clocks on L2?

If different participants are communicating with different L1 nodes, then they might be seeing different L1 chain tips.

If they are all talking to the same L1 node, how is that different from using local clocks synchronized to the same NTP server?

abailly-iohk commented 1 year ago

Also, in what way would relying on time provided by L1 absolve L2 participants from having to synchronize their clocks on L2?

The slot number observed on L1 (and currently propagated to the Head Logic through the Tick event) is guaranteed to be the same (up to eventual consistency of the chain) for all nodes observing the same chain. Indeed, it might be the case that different L2 nodes connected to different L1 nodes would observe different blocks and therefore different slots but that would be a transient state and ultimately those blocks would be rolled back.

Which raises the interesting question of: What happens in a hydra-node when we observe such a rollback to a previous slot?

GeorgeFlerovsky commented 1 year ago

Which raises the interesting question of: What happens in a hydra-node when we observe such a rollback to a previous slot?

Indeed... 😅

Also, due to how slowly things progress on L1, hydra nodes might suffer this time inconsistency for a significant duration, relative to the frequency of transactions/snapshots on L2.

GeorgeFlerovsky commented 1 year ago

Actually, L1 chain selection avoids adopting blocks that are too far in the future. (See Cardano Consensus and Storage Layer, section 11.5)

When we have validated a block, we then do one additional check, and verify that the block’s slot number is not ahead of the wallclock time (for a detailed discussion of why we require the block’s ledger state for this, see chapter 17, especially section 17.6.2). If the block is far ahead of the wallclock, we treat this as any other validation error and mark the block as invalid.

Marking a block as invalid will cause the network layer to disconnect from the peer that provided the block to us, since non-malicious (and non-faulty) peers should never send invalid blocks to us. It is however possible that an upstream peer’s clock is not perfectly aligned with us, and so they might produce a block which we think is ahead of the wallclock but they do not. To avoid regarding such peers as malicious, the chain database supports a configurable permissible clock skew: blocks that are ahead of the wallclock by an amount less than this permissible clock skew are not marked as invalid, but neither will chain selection adopt them; instead, they simply remain in the volatile database available for the next chain selection.

Furthermore, I think that it would be quite infrequent for an honest chain candidate to be replaced by a longer/denser candidate that has a lower chain tip slot number. Thus, when a new candidate chain is selected, we just need to be robust to the temporary burst of rollbacks and roll-forwards as the L1 adjusts.

With that in mind, and keeping the general validation mechanism the same as you currently describe in the coordinated hydra head spec, I would suggest the following to make L2 robust to L1 slot rollbacks:

These delays may reduce the frequency of L2 transaction confirmation, but I think they would make the L2 distributed clock fairly robust. It may happen that some L2 snapshots end up using slot numbers that are unoccupied by the chain that ultimately prevails on L1; however, I don't really think that's a problem.

What do you think, @abailly-iohk @ch1bo ?

ch1bo commented 1 year ago

I think we have a general consensus here. I would like to comment/correct some things said though

Incrementing L2 time only when L1 blocks arrive would result in an even coarser time resolution than L1 (~20s vs 1s), which would be kinda funny given L2's higher frequency of snapshot confirmation...

The L1 only "increments time" when it adopts a block. That is in the best case 20s on mainnet. If no block gets adopted, time is not advanced for the ledger .. even though slot length is 1s. We stumbled over this when Handling Time in the Hydra protocol itself on the L1.

Snapshots are (currently) not verified by other parties once issued by a leader

That is not true. All parties (including the snapshot leader) do validate the transactions in a requested snapshot (ReqSn message) here. This validation is currently done against the last confirmed snapshot and would use the currentSlot last known (from L1) within this story.

Set the new snapshot's slot to the leader's current slot.

Not sure if we need to specify the slot for a snapshot. On the one hand it would make the validation more deterministic: if the leader has picked valid transactions against some slot and announces that slot also on the snapshot request, all honest other parties will come to the same conclusion using that slot. On the other hand though, we would need to check the requested slot against our local, last known currentSlot to be "not too old or too new" (what's the bound?), which makes this very similar to validate the requested transactions against the same currentSlot directly.

Maybe one is more robust than the other, but I suggest we will find out and go with the simplest solution: just validate against the latest slot as the Chain layer reported via a Tick. It is crucial though that we make it observable from the outside (at least logs) why a snapshot was not signed.

abailly-iohk commented 1 year ago

Scope refinement:

uhbif19 commented 1 year ago

They are supposed to know which (L1) network their Head is opened on

So client should reuse L1 time handle for slot converting?

GeorgeFlerovsky commented 1 year ago

@ch1bo @abailly-iohk OK, I think that this is a workable approach/scope for now. 👍

Let's try it and see if/which issues may arise in dApps trying to use time-bounded L2 transactions.

(Ultimately, it would be great to have independent time synchronization on L2, but it's a tricky thing to implement — let's maybe tackle it later on)

GeorgeFlerovsky commented 1 year ago

Nitpick:

That is in the best case 20s on mainnet.

The average (not best) case is 20s, based on active slot coefficient = 5%. A block can and does occasionally get produced less than 10s after the next block. For example, there are only 4s between these two recently forged blocks in epoch 410:

abailly-iohk commented 1 year ago

And sometimes, it takes several minutes to have 2 blocks...

On Thu, May 11, 2023 at 12:20 AM George Flerovsky @.***> wrote:

Nitpick:

That is in the best case 20s on mainnet.

The average (not best) case is 20s, based on active slot coefficient = 5%. A block can and does occasionally get produced less than 10s after the next block. For example, there are only 4s between these two recently forged blocks in epoch 410:

— Reply to this email directly, view it on GitHub https://github.com/input-output-hk/hydra/issues/196#issuecomment-1542884433, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATBEKRR54263ZLXULJS6WUDXFQID5ANCNFSM5NEQHIKA . You are receiving this because you were mentioned.Message ID: @.***>

GeorgeFlerovsky commented 1 year ago

true