IntersectMBO / cardano-node

The core component that is used to participate in a Cardano decentralised blockchain.
https://cardano.org
Apache License 2.0
3.06k stars 718 forks source link

Cluster-based tests #211

Closed deepfire closed 4 years ago

deepfire commented 5 years ago

Context

We want cluster-based integration tests for the node.

Current scope (to be extended)

  1. Cluster consensus validation: https://github.com/input-output-hk/cardano-node/issues/106
    • status: implementation mostly done
  2. Extend to mixed cluster of old and new nodes, in PBFT era/mode, both producing blocks, connected via proxies: https://github.com/input-output-hk/cardano-node/issues/255
    • status: stuck debugging cardano-sl cluster startup
  3. cardano-byron-proxy test:
    1. Basic functionality
    2. Heap profiling
    3. Strictness check using @edsko 's WHNF checker.
  4. Cluster Tx submission

Implementation

NixOS tests that can run a cluster in a VM are a good foundation for many of those.

This basis functionality was merged in https://github.com/input-output-hk/cardano-node/pull/177

deepfire commented 5 years ago

After myself trying to make cardano-byron-proxy serve a static chain (with no new block announcements), @avieth suggested the following:

Byron proxy just plays the cardano-sl game: it can't download a chain unless it has the header hash of the tip. If you want it to download from a Byron peer that has a static chain, it can be done without much difficulty. Either:

  1. you know the hash of the tip that you want, and you can patch byron-proxy to request it in particular or
  2. patch cardano-sl to announce its tip header periodically even if it does not change
deepfire commented 5 years ago

So, the latest status is -- as per latest developments in https://github.com/input-output-hk/iohk-ops/tree/serge/cardano-cluster :

  1. The legacy node service definition was modified to provide an instanced systemd service, allowing several legacy nodes to run on a single system -- similar as was done for the new node.
  2. A new legacy cluster configuration was created in https://github.com/input-output-hk/cardano-sl/tree/serge/ci-genesis -- its keys will be supplied in https://github.com/input-output-hk/cardano-node/tree/serge/mainnet-ci. This will be shared by all components of the mixed cluster: legacy segment, proxy and Byron Rewrite segment.
  3. NTP-to-local-clocks pinning was implemented, in line with @cleverca22's suggestion.
  4. The test legacy cluster still doesn't make blocks, despite full connectivity & not being in recovery mode. There are several suspicions on why that is so.
  5. In particular, @dcoutts suggested that we try starting the cluster in OBFT mode (it starts in Byron Classic mode, by default).

So the next piece of work is trying to figure out how to make the legacy cluster start in OBFT, without.

deepfire commented 4 years ago

There was a discussion on how to simplify #2 -- the mixed cluster, to try avoiding the issue with cardano-sl cluster not starting with multiple nodes sharing a single localhost address.

The idea was to use an existing mainnet cluster as the source of blocks (which are necessary for the proxy to function, as per above).

Sadly, this breaks on two points (and a half):

  1. It just won't allow us simultaneous block creation on both sides of the proxy (since it'll be tied to mainnet) -- so this will have to be thrown away and re-done properly (in mainnet-independent fashion) anyway.
  2. It'll create problems with the relay taking a lot of time to sync (since it'll always be behind when it starts).
  3. De-isolation -- allowing NixOS test environment talk to real mainnet -- isn't exactly trivial -- while it's definitely doable, it'll still take some work. This isn't of course preventive, but stll this takes away from the attractivity of the option -- it's now comparable amount of work to others.
deepfire commented 4 years ago

There is a simpler option to try with cardano-sl potentially being stuck due to all nodes sharing localhost -- we can employ VDE[1] to give distinct nodes distinct, routable IP addresses.

--

  1. https://github.com/virtualsquare/vde-2, available on NixOS.
deepfire commented 4 years ago

The VDE route almost worked.. except the routing itself became interesting -- the kernel was choosing the same route for all packets, since all tapX interfaces are local! ..

..and this follows to the same dreaded error as with the previous attempt with using different loopback addresses -- the network-transport-tcp sees a mismatch between stated and actual address, and fails: https://github.com/input-output-hk/network-transport-tcp/blob/2634e5e32178bb0456d800d133f8664321daa2ef/src/Network/Transport/TCP.hs#L1621

Duh! Should have expected that..

So I'm currently playing with source routing policies, which would make the kernel assign choose different interfaces, that would actually depend on the source address: https://www.tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.rpdb.simple.html

UPDATE: I'm getting different source addresses now, however the problem now is, the mapping between TAP interfaces and the source addresses seems random :joy:

deepfire commented 4 years ago
  1. Cutting out the network-transport-tcp address check did the trick -- the nodes agreed to connect/talk to each other.

However, that didn't resolve the problem with the cardano-sl nodes not making blocks.

So I started looking into switching the legacy nodes into OBFT node right from start (they currently start in Ouroboros Classic mode).

  1. Found the OBFT era being determined by the unlockStakeEpoch field of BlockVersionData: https://github.com/input-output-hk/cardano-sl/blob/master/chain/src/Pos/Chain/Update/BlockVersionData.hs#L148

  2. Regenerated genesis with unlockStakeEpoch being equal to the magic OBFT value -- and no MPC messages appear in cardano-sl's logs anymore, which suggests the change was effective.

No blocks, though..

deepfire commented 4 years ago

Ok, I've gone with the supposedly well-oiled AWS setup of cadano-sl, however, it somehow manages to fare even worse than a cluster confined to a multi-node-on-single-machine (although, yes, there are other differences -- because the single-machine cluster required systemd service instancing and a lot of fiddling in general).

The error cardano-sl gives at cluster startup is (with some initial context):

Oct 16 16:27:22 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.node:Info:ThreadId 132] [2019-10-16 16:27:22.26 UTC] Application: cardano-sl:1, last known block version 0.2.0, systemTag: linux64
Oct 16 16:27:22 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.node:Info:ThreadId 132] [2019-10-16 16:27:22.26 UTC] Genesis stakeholders (7 addresses, dust threshold 7 coin(s)): GenesisWStakeholders: {33111eddbb08270d: 1, 540fb9f1c0415491: 1, 6132662df7ccd698: 1, 773d6255ced70494: 1, 8dba875898ab11ac: 1, f7dedd2205451763: 1, f825bd9e9df8670d: 1}
Oct 16 16:27:22 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.node:Info:ThreadId 132] [2019-10-16 16:27:22.26 UTC] GenesisDelegation (stakeholder ids): [773d6255ced70494 -> d5f8ce7d1937176c, 33111eddbb08270d -> aa84a9d0f69f2493, 8dba875898ab11ac -> ac68bdca1fae8f14, f7dedd2205451763 -> fcb3a4f1b35e5868, 540fb9f1c0415491 -> 98ca509664413dbf, 6132662df7ccd698 -> f050f7380f318dd4, f825bd9e9df8670d -> f3b7b1477a80fda3]
Oct 16 16:27:22 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.node:Info:ThreadId 132] [2019-10-16 16:27:22.26 UTC] First genesis block hash: 1a28c5b6d7b98239, genesis seed is 76617361206f7061736120736b6f766f726f64612047677572646120626f726f64612070726f766f6461
Oct 16 16:27:22 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.node:Info:ThreadId 132] [2019-10-16 16:27:22.26 UTC] Current tip header: GenesisBlockHeader:
Oct 16 16:27:22 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]:     hash: 1a28c5b6d7b982396995008f856640cc68fbaf923ddbde42ac232b69d972863c
Oct 16 16:27:22 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]:     previous block: 41a0739cb8cf98a176a990f8a90b2ca616e5413e2377d6c84841c46b5b6026b0
Oct 16 16:27:22 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]:     epoch: #0
Oct 16 16:27:22 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]:     difficulty: 0
Oct 16 16:27:22 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.node:Info:ThreadId 132] [2019-10-16 16:27:22.26 UTC] Waiting 303 seconds for system start
...
Oct 16 16:32:26 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.node.slotting:Notice:ThreadId 149] [2019-10-16 16:32:26.00 UTC] New slot has just started: 0th slot of 0th epoch
Oct 16 16:32:26 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.node.slotting:Debug:ThreadId 149] [2019-10-16 16:32:26.00 UTC] Waiting for 19993571mcs before new slot
Oct 16 16:32:26 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.node:Debug:ThreadId 142] [2019-10-16 16:32:26.00 UTC] Our tip header: GenesisBlockHeader:
Oct 16 16:32:26 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]:     hash: 1a28c5b6d7b982396995008f856640cc68fbaf923ddbde42ac232b69d972863c
Oct 16 16:32:26 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]:     previous block: 41a0739cb8cf98a176a990f8a90b2ca616e5413e2377d6c84841c46b5b6026b0
Oct 16 16:32:26 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]:     epoch: #0
Oct 16 16:32:26 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]:     difficulty: 0
Oct 16 16:32:26 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.node:Info:ThreadId 142] [2019-10-16 16:32:26.00 UTC] Difference between current slot and tip slot is: 0
Oct 16 16:32:26 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.node:Debug:ThreadId 138] [2019-10-16 16:32:26.00 UTC] There are no new confirmed update proposals for our application
Oct 16 16:32:26 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.MonadPseudoRandom:Error:ThreadId 148] [2019-10-16 16:32:26.00 UTC] rollbackSsc: most genesis block is passed to rollback
Oct 16 16:32:51 c-b-1 3n073v230xhgd46jpz4zf1n32xcqzc4c-unit-script-cardano-node-legacy-start[2961]: [cardano-sl.consolidate:Error:ThreadId 119] [2019-10-16 16:32:51.27 UTC] DBMalformed "Can't retrieve genesis block, maybe db is not initialized?"

There is a lead, of course..

deepfire commented 4 years ago

For the sake of completeness -- the way genesis is generated is via https://github.com/input-output-hk/cardano-sl/blob/master/scripts/prepare-genesis/default.nix

Jimbo4350 commented 4 years ago

@deepfire can we close this?

deepfire commented 4 years ago

@Jimbo4350, I don't think so -- not all of the bullet items are done.

CodiePP commented 4 years ago

will be moved to cardano-benchmarking