ethereum / go-ethereum

Go implementation of the Ethereum protocol
https://geth.ethereum.org
GNU Lesser General Public License v3.0
47.63k stars 20.17k forks source link

StaticNodes / TrustedNodes useless in PoA Setup #23210

Open hickscorp opened 3 years ago

hickscorp commented 3 years ago

Background

We are running a set of sealers and nodes in a docker swarm. Upon restarting the cluster, discovery between nodes doesn't work - so we have to add each node and sealer to each other node and sealer.

Initially (and when we were running Geth 1.8) we had an extra container, in charge of connecting to the HTTP RPC, querying each node, and registering it with all others using their HTTP RPC as well. The reason for this is that docker swarm doesn't guarantee that an IP address will be the same for the same container - so our extra container had a script to "scrape" IP addresses and building enode:// URLS to distribute across the nodes.

Problem Statement

More recently, we updated to Geth 1.10 and were pleasantly surprised when we discovered that enode:// specifications can now accept DNS names (so we wouldn't need this scraper container). We tried and having an enode://...@dns_name@nodiscover=1 works great. So we decided that instead of having a "scraping service" in an extra container, we would instead connect our cluster using static-nodes.json (which didn't work, apparently deprecated) and then using a geth.toml file. At this point, our TOML file was pretty much looking like this:

[Node.P2P]
NoDiscovery = true
StaticNodes = [enode1, enode2, etc.]
TrustedNodes = []

Unfortunately, this attempt failed short because Geth refuses to boot if any of the StaticNodes / TrustedNodes is unreachable - so there's a bit of a catch-22 situation here when restarting the whole cluster.

Note that we tried using either / both StaticNodes and TrustedNodes.

Suggestion 1

It would be quite nice to have these StaticNodes and / or TrustedNodes act as a warning rather than a FATAL error - this way the node would boot, fail to contact the nodes, and retry later on.

Current State of Things

We didn't stop just here. We made a script such as this one:

var peers = [ ...all of our enodes... ];

// This function connects to all peers listed in the `peers` variable.
function connectPeers() {
  return peers.map(function (peer) {
    return {
      peer: peer,
      addPeer: admin.addPeer(peer),
      addTrustedPeer: admin.addTrustedPeer(peer)
    };
  });
};

// --- Execution. ---

connectPeers();

We were hopeful that there would somehow be an option to tell Geth to run this script. We discovered that we cannot really do that, unless we use geth console or geth attach - which doesn't work in our case since we still would need to run this manually after the cluster has started.

Suggestion 2

Maybe allow for a script to be executed when geth started without console / attach, so that admin.addPeer / admin.addTrustedPeer could be used.

karalabe commented 3 years ago

This seems odd to me. Could you post the error message you are getting? Static/trusted nodes are merely suggestions. They should not result in any errors if they are offline. My best guess is that parsing the enode IDs fail, which should be apparent from the error message.

hickscorp commented 3 years ago

@karalabe thanks a lot for your help.

I know for a fact that the parsing goes well. The message is:

Fatal: /root/geth.toml, line 2: (p2p.Config.StaticNodes) lookup node1-new on 127.0.0.11:53: no such host After which geth exits.

EDIT: Just FYI I have found a workaround - sleeping for 5 seconds before booting geth. This way, it gives 5s for all the containers to spin up. Geth crashes a few times, the containers are restarted, and when a happy coincidence of all the nodes being up happens, things connect. But I really think that the "trying to connect to the nodes specified in the TOML file" should not be a Fatal, but a Warning and let Geth boot as usual and retry later...

hickscorp commented 3 years ago

Bump - any chance that connecting to peers specified in StaticNodes / TrustedNodes could gracefully be a warning and retried rather than a boot error?

hickscorp commented 2 years ago

Bump?

hickscorp commented 2 years ago

Anyone please? Is it that I didn't phrase the issue correctly, or that there isn't any interest in addressing it?

haidarabdillah commented 1 year ago

any update on this issue? im really interesting to use domain than ip its more flexible when the ip can't be acces