[FR] - Live topology updates without node restart

andrejpodzimek commented 3 years ago

External

Area Other Any other topic (Delegation, Ranking, ...).

Describe the feature you'd like While running topologyUpdater.sh hourly, it is recommended to restart the relay node each time to pick possible topology changes. This is a big problem, because the node startup takes more than 10 minutes (from systemctl restart to the point when the socket appears and communicates), despite the fact that my SSDs can serve around 5 GB/s. (The CPU is a bottleneck during initialization.)

This would lead to a downtime 1/6 of the time. In a stake pool configuration this is bad, because one could miss block minting or the like. It should be safe to just run topologyUpdater.sh hourly (as recommended) and tell cardano-node to reload the topology each time, ideally without downtime (as long as the topology doesn’t change dramatically).

Describe alternatives you've considered I considered two relay nodes on the same machine and on two different ports for redundancy, so that one is always up, even when one is being reloaded. Unfortunately, this would come with storage and computational overhead and there doesn’t seem to be a guarantee that block minting wouldn’t be missed with one of the relays down. (If there is such a guarantee, then this needs to be documented, i.e., how exactly having multiple hosts and ports could help.)

rdlrt commented 3 years ago

topologyUpdater.sh is not part of cardano-node repo, but guild-operators repo. The recommendation is to run push (non-interruptive) hourly to report your node being live to the service and available to other peers to fetch, while you'd want to pull and refresh topology daily (only as a recommendation as a good practice to help each other out until P2P is available, not necessity).
The restart should not take 10 minutes, if you're using SIGINT to kill processes.
As part of P2P updates, topology refresh will already be available as part of signal to node

regel commented 3 years ago

Simple questions:

in which Docker image version will P2P be released? 1.29? 1.30?
what will be the impact (new command line flags, others) to run cardano-node and use this feature?
how will p2p topology updates appear in the logs?
how and when can we experiment with this feature on the testnet?
Where can we find roadmap, specification, and documentation information for this new feature? How to subscribe to new release announcement?

regel commented 3 years ago

P2P related information found on https://roadmap.cardano.org/.

Timeline:

May 28, 2021:

P2P testnet updates

This week, the team completed the second milestone of the P2P deployment, delivering an engineering testnet, which allows for automatic peer selection in the network. During this stage, the team tested and implemented different user configurations, established interoperability between legacy and P2P nodes, and produced a video that reflects automated peer selection.

The team had a call with SPOs, where they explained P2P project goals, P2P system design, and the concept of hot, cold, and warm peers. They also introduced the goals of the third milestone in P2P deployment (semi-public testnet), explaining that there will be a switch in the node to enable selection of either the new P2P mode or the existing (non-P2P) one. During the semi-public testnet delivery, the team will be inviting a small group of SPOs to help test system functionality.

June 4, 2021

This week, the team worked on a P2P and non-P2P diffusion API, fixed some issues, and worked on server tests and scheduling within io-sim.

The team fixed the stateTVar signature, worked on simultaneous TCP connections opened by the handshake protocol, and rebased p2p-master branches.

June 11, 2021

This week, the team worked on the P2P master branch, updated the cardano-ping protocol in line with keep-alive protocol changes, and worked on clean connection shutdown properties. They added a missing API to io-sim-classes and also worked on cardano-cli with the Alonzo team.
June 18, 2021

This week, the team continued working on P2P testnet functionality, including switching between P2P and non-P2P networks, diagnosis of deadlock events in the connection manager, enhanced connection shutdown properties, and strict TVar interface.
June 25, 2021

This week, the team developed a reviewable version of the P2P switch feature. They also merged a clean connection shutdown PR (connection-manager part), rebased the p2p-master branch on top of multiplexer clean connection shutdown -and tested it in combination-, and worked on different schedules of io-sim.
July 2, 2021:

This week, the team worked on the P2P switch feature, which is now in review, improved some logging properties, and cleaned up the P2P master branch in the cardano-node GitHub repository.
July 9, 2021:

This week, the team completed the integration of the P2P switch feature, which allows SPOs to run a node either in P2P mode or with statically configured peers. They worked on error notifications when supplying a wrong topology file, improved logging JSON instances, and made improvements to the cardano-node p2p-master branch.
July 16, 2021:

This week, the team upgraded the P2P switch, worked on network tracers, fixed some tests, and merged the server simulation. They are now in the process of running simulation tests.
July 23, 2021:

This week, the team fixed some tests in the cardano-node repository, made logging improvements, and made changes to the P2P master branch.
July 30, 2021:

The team also restructured the P2P to non-P2P switch for better clarity and easier maintenance of the data diffusion API.
August 20, 2021:

This week, the team worked on the P2P master branch to implement support for node v.1.28.0.
August 27, 2021:

This week, the team continued working on network simulations, resolved some Cardano node issues, and rebased the P2P master branch on top of the Cardano node v.1.28.0. They are now in the process of testing the P2P suite.

This is the last update I could find. I still dont know after fetching this information if the feature is included or not in 1.28 or 1.29, testnet, or mainnet, and what the name of the P2P config switch will be, but this is progress. Let read the code in the p2p-master branch and find out.

Finally, there is also this issue discussed on the forum where the current topology update of api.clio.one and the issues it causes for Kubernetes

regel commented 3 years ago

As of today, the p2p-master branch does not seem to be merged in a stable cardano-node release yet:

This branch is 14 commits ahead, 257 commits behind master.

This README file contains information on the new topology file format we can expect, but I guess it may still change in the future if P2P testing is still ongoing. I did not find a reference to the EnableP2P config switch in the documentation. Missing in the docs?

regel commented 3 years ago

Hey, I just released Helm Charts to run cardano node containers 🐳 in Kubernetes.

It does solve the Peer to Peer topology update by repeating this process every 24 hours:

reading the on-chain data once per day 🕐 to find registered nodes,
then vetting (is it alive, is the metadata hash valid, and so on) 🆗these registered nodes,
finally selecting random nodes in this "valid" set of nodes,
Publishing the 'new' topology to an internal Redis pub-sub topic,
the new 🆕 topology file is received by an internal Redis client, and written in the filesystem mount of the relay container where it triggers a restart of the pod 6 the pod is restarted automatically and uses the new topology 🍾

Everything in this implementation is fully autonomous 🚗, and fully decentralised since it runs locally in the cluster. The topology update process is fair and transparent. Topology is updated and discovered automatically from the blockchain data itself and nothing else!

Link to the code on Github

Link to the peer to peer extension: Github

coot commented 2 years ago

This is available in (not yet supported) p2p version, and it will not be backported to non-p2p nodes.

IntersectMBO / cardano-node

[FR] - Live topology updates without node restart #3038