ava-labs / avalanchego

Go implementation of an Avalanche node.
https://avax.network
BSD 3-Clause "New" or "Revised" License
2.12k stars 669 forks source link

When using alternate public-ip resolution and ipv6 networking, node prefers ipv6 and is sidelined as being offline. #3078

Open haight6716 opened 3 months ago

haight6716 commented 3 months ago

Describe the bug When:

To Reproduce

Expected behavior Node should appear healthy on stats.avax.network. It does not.

Screenshots

    "jsonrpc":"2.0",
    "id"     :1,
    "method" :"info.getNodeIP"
}' -H 'content-type:application/json;' 127.0.0.1:9650/ext/info
{"jsonrpc":"2.0","result":{"ip":"[2601:602:8e00:ce98:86c5:2292:57ec:656b]:9651"},"id":1}

Operating System Ubuntu 12.04

Additional context Discussed in node-support discord: https://discord.com/channels/578992315641626624/620633143002660874/1246927981780271205

Possible enhancement would be to publish both ip and ipv6 keys, as appropriate, in the info.getNodeIP API result. Let clients decide which they prefer. Currently it seems a single node can serve only ipv4 or ipv6, never both. This might lead to a bifurcated network in the worst case.

This suggests that an ipv6 node will always be shown as "offline" by stats.avax.network. The stats-tracker should also be fixed to avoid this - ipv6 addresses should be acceptable public ips, but if the stats server has no ipv6 address, it cannot connect.

Workarounds

StephenButtolph commented 3 months ago

When I attempt to connect to the IP linked above, it appears not to be dial-able:

dial tcp [IP]:port: connect: no route to host

I don't think this is an issue with anything not supporting IPv6... It seems like the service is reporting an IP that isn't dial-able. This isn't necessarily unexpected...

Possible enhancement would be to publish both ip and ipv6 keys, as appropriate, in the info.getNodeIP API result.

These IPs are gossiped throughout the p2p network. It is expected for a node to report what IP they are dial-able at. I don't think it's reasonable for nodes to report multiple IPs to connect to. (Similarly, DNS servers only report a single IP for each DNS request.)

This suggests that an ipv6 node will always be shown as "offline" by stats.avax.network.

stats.avax.network does support ipv6 addresses as far as I know.

haight6716 commented 3 months ago

You are probably getting 'no route to host' due to your own lacking ipv6 infrastructure. I assure you it is routable from here. That said, I'm no longer listening on that IP because I used the workaround to bring my node back online. The same server is listening on port 80, so you can try that to see how it should work:

This should work:

ETA - IP changed: 2601:602:8e00:ce98:36fd:86c:8e35:9aa0

$ telnet 2601:602:8e00:ce98:86c5:2292:57ec:656b 9651
Trying 2601:602:8e00:ce98:86c5:2292:57ec:656b...
Connected to 2601:602:8e00:ce98:86c5:2292:57ec:656b.

Multiple IPs are completely reasonable. DNS allows the client to decide while making the query - either asking for A or AAAA records. With the default being both. Note also multiple IPs are returned, even for one protocol via DNS. See google's result:

$ dig any www.google.com

; <<>> DiG 9.18.18-0ubuntu0.22.04.2-Ubuntu <<>> any www.google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 65219
;; flags: qr rd ra; QUERY: 1, ANSWER: 11, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;www.google.com.            IN  ANY

;; ANSWER SECTION:
www.google.com.     110 IN  A   74.125.199.105
www.google.com.     110 IN  A   74.125.199.147
www.google.com.     110 IN  A   74.125.199.104
www.google.com.     110 IN  A   74.125.199.103
www.google.com.     110 IN  A   74.125.199.106
www.google.com.     110 IN  A   74.125.199.99
www.google.com.     110 IN  AAAA    2607:f8b0:400e:c02::6a
www.google.com.     110 IN  AAAA    2607:f8b0:400e:c02::93
www.google.com.     110 IN  AAAA    2607:f8b0:400e:c02::63
www.google.com.     110 IN  AAAA    2607:f8b0:400e:c02::67
www.google.com.     21410   IN  HTTPS   1 . alpn="h2,h3"

If stats.avax.network does support v6, can you show an example of a v6 node being reported as online?

I agree stats.avax.network does not support v6 IPs. And that's a problem if avalanchego does support ipv6. All v6 nodes will be reported incorrectly as "not online" even though they are participating in the network, accepting connections on 9651.

V6 gets no respect.

haight6716 commented 3 months ago

To check your own v6 infrastructure, try connecting to google's v6 webserver:

$ telnet 2607:f8b0:400e:c02::6a 80
Trying 2607:f8b0:400e:c02::6a...
Connected to 2607:f8b0:400e:c02::6a.
Escape character is '^]'.

Proof the IP is listening publicly (dialable): image

http://www.ipv6scanner.com/cgi-bin/main.py

No respect

image

Jk, thanks for all you do! Lemme know what I can do to help, I know a little go.

StephenButtolph commented 3 months ago

Looking into it - thanks for all the info

haight6716 commented 3 months ago

I lost electrical power a few hours ago. I'll update with my new IP whenever I get back online.

New v6 IP on my node you can test against: 2601:602:8e00:ce98:e50b:f01e:251f:e293 2601:602:8e00:ce98:36fd:86c:8e35:9aa0 (I swear I don't typically reboot this often)

Running servers on a consumer internet connection, whee! But seriously, comcast is pretty rock solid and fast. kudos also for the proper /64 ipv6 delegation from them. Say what you want about the customer-facing stuff.

ETA:

I set up a test node, NodeID-KAAKyFf5rSAXCdEDyC8hWzJFkLJPBN4qX , that does not have the workaround applied. You can see it here: 2601:602:8e00:ce98:3252:fd81:354:a31e

I might shut it down in a few days, but ping me and I can fire it back up.

StephenButtolph commented 3 months ago

So, just to update here... As reported it seems like IPv6 isn't as well adopted as I thought 😓.

I suspect a large number of nodes are unable to make outbound IPv6 requests... Which means that nodes that choose to report IPv6 IPs will not be considered dial-able by these nodes.

Looking around there seem to be around 20 nodes on mainnet reporting IPv6 IPs.

This isn't necessarily the end of the world... But can impact network connectivity for those nodes.

haight6716 commented 3 months ago

Thanks for the update. v6 is a hot mess in general, but bit by bit we move that way.

IMO we need a plan for getting to a future where most nodes are v6. Some thin clients are v6-only already. They will require a v6 server. Ideally we can mix-and-match, serving on 4/6, connecting to other clients on either, depending on our preference.

Some thoughts, if I were emperor:

I don't see why nodes should even be required to have ipv4, but that's getting pretty far ahead of the ball.

ETA: I have no personal need for this to be fixed at this point - I'm using v4 ok now, but it seems in the best interest of the project long-term. Definitely does not need to be a priority for my sake.

I suspect v6 may be higher performance in some cases because it avoids layers of NAT, or allows jumbograms on LANs or .. ?.

Interesting: https://serverfault.com/questions/513942/convenient-public-ipv6-test-addresses

haight6716 commented 3 months ago

I'm going to shut down the test node for now, but will record some evidence about it first.

From the ipv6 test node:

$ uptime
 10:27:22 up 1 day, 21:52,  1 user,  load average: 0.49, 0.41, 0.43

$ curl -sX POST --data '{
    "jsonrpc":"2.0",
    "id"     :1,
    "method" :"info.getNodeID"
}' -H 'content-type:application/json;' 127.0.0.1:9650/ext/info|json_pp 
{
   "id" : 1,
   "jsonrpc" : "2.0",
   "result" : {
      "nodeID" : "NodeID-KAAKyFf5rSAXCdEDyC8hWzJFkLJPBN4qX",
      "nodePOP" : {
         "proofOfPossession" : "0x91bacccddd77a95247bb909845f6ad20f6c47753380e5c17d360563b64c1ac7c465730bd3cbab8b0c47fbc51e4ecfe1a108c53224faa169cb67e453a23794c29d4820e69e4d08684b95e2433190267c253d1e95e9db63c0b7fcad0f6f2b2a158",
         "publicKey" : "0x8dc561321417fd71d407c125fe357fab799e3f16b4f4b7417feafdd43f56e4e42b56f4d9c156f0e258ebeaeae4b061ec"
      }
   }
}

$ curl -sX POST --data '{
    "jsonrpc":"2.0",
    "id"     :1,
    "method" :"info.getNodeIP"
}' -H 'content-type:application/json;' 127.0.0.1:9650/ext/info|json_pp 
{
   "id" : 1,
   "jsonrpc" : "2.0",
   "result" : {
      "ip" : "[2601:602:8e00:ce98:c86c:ebfe:8351:4cc8]:9651"
   }
}

From my production node - it can see the test node:

$ curl -sX POST --data '{
    "jsonrpc":"2.0",
    "id"     :1,
    "method" :"info.peers",
    "params": {
        "nodeIDs": ["NodeID-KAAKyFf5rSAXCdEDyC8hWzJFkLJPBN4qX"]
    }
}' -H 'content-type:application/json;' 127.0.0.1:9650/ext/info|json_pp 
{
   "id" : 1,
   "jsonrpc" : "2.0",
   "result" : {
      "numPeers" : "1",
      "peers" : [
         {
            "benched" : [],
            "ip" : "192.168.0.1:35726",
            "lastReceived" : "2024-06-06T10:35:24-07:00",
            "lastSent" : "2024-06-06T10:35:24-07:00",
            "nodeID" : "NodeID-KAAKyFf5rSAXCdEDyC8hWzJFkLJPBN4qX",
            "objectedACPs" : [],
            "observedSubnetUptimes" : {},
            "observedUptime" : "99",
            "publicIP" : "[2601:602:8e00:ce98:3252:fd81:354:a31e]:9651",
            "supportedACPs" : [],
            "trackedSubnets" : [],
            "version" : "avalanchego/1.11.4"
         }
      ]
   }
}

And finally stats.avax.network:

image

github-actions[bot] commented 4 weeks ago

This issue has become stale because it has been open 60 days with no activity. Adding the lifecycle/frozen label will cause this issue to ignore lifecycle events.