ava-labs / avalanchego

Go implementation of an Avalanche node.
https://avax.network
BSD 3-Clause "New" or "Revised" License
2.13k stars 670 forks source link

Node not starting up with v1.11.3 #2865

Closed m-quigley closed 7 months ago

m-quigley commented 7 months ago

Describe the bug Was syncing a new node last night on AGo v11.2 Everything seemed to going well but I noticed things had stalled out with ~10 minutes left on the sync. I upgraded AGo to v11.3 this morning and the script is hanging as below - no logs in 2 hours

Mar 23 02:50:19 Ubuntu-2204-jammy-amd64-base avalanchego[499457]: [03-23|02:50:19.840] INFO <P Chain> snowman/transitive.go:97 initializing consensus engine
Mar 23 02:50:19 Ubuntu-2204-jammy-amd64-base avalanchego[499457]: [03-23|02:50:19.841] INFO health/worker.go:109 registered new check and initialized its state to failing {"namespace": "health", "name": "P", "tags": ["11111111111111111111111111111111LpoYY"]}
Mar 23 02:50:19 Ubuntu-2204-jammy-amd64-base avalanchego[499457]: [03-23|02:50:19.841] INFO server/server.go:211 adding route {"url": "/ext/bc/11111111111111111111111111111111LpoYY", "endpoint": ""}
Mar 23 02:50:19 Ubuntu-2204-jammy-amd64-base avalanchego[499457]: [03-23|02:50:19.841] INFO health/worker.go:109 registered new check and initialized its state to failing {"namespace": "readiness", "name": "bootstrapped", "tags": ["application"]}
Mar 23 02:50:19 Ubuntu-2204-jammy-amd64-base avalanchego[499457]: [03-23|02:50:19.841] INFO health/worker.go:109 registered new check and initialized its state to failing {"namespace": "health", "name": "bootstrapped", "tags": ["application"]}
Mar 23 02:50:19 Ubuntu-2204-jammy-amd64-base avalanchego[499457]: [03-23|02:50:19.841] INFO <P Chain> bootstrap/bootstrapper.go:159 starting bootstrapper
Mar 23 02:50:19 Ubuntu-2204-jammy-amd64-base avalanchego[499457]: [03-23|02:50:19.842] INFO chains/manager.go:1306 starting chain creator
Mar 23 02:50:19 Ubuntu-2204-jammy-amd64-base avalanchego[499457]: [03-23|02:50:19.842] INFO node/node.go:629 writing process context {"path": "/home/nodeuser/.avalanchego/process.json"}
Mar 23 02:50:19 Ubuntu-2204-jammy-amd64-base avalanchego[499457]: [03-23|02:50:19.842] INFO node/node.go:656 API server listening {"uri": "http://[::]:9650"}
Mar 23 02:50:25 Ubuntu-2204-jammy-amd64-base avalanchego[499457]: [03-23|02:50:25.986] INFO <P Chain> bootstrap/bootstrapper.go:390 starting to fetch blocks {"numAcceptedBlocks": 1, "numMissingBlocks": 2}
Mar 23 02:50:49 Ubuntu-2204-jammy-amd64-base avalanchego[499457]: [03-23|02:50:49.791] INFO health/worker.go:255 check started passing {"namespace": "health", "name": "network", "tags": ["application"]}

Curling gives

nodeuser@Ubuntu-2204-jammy-amd64-base:~/nodes/base$ curl -d '{"id":0,"jsonrpc":"2.0","method":"eth_getBlockByNumber","params":["latest",false]}'   -H "Content-Type: application/json" http://localhost:9651
curl: (52) Empty reply from server

To Reproduce Steps to reproduce the behavior.

Expected behavior Node syncing successfully

Screenshots If applicable, add screenshots to help explain your problem.

Logs If applicable, please include the relevant logs that indicate a problem and/or the log directory of your node. By default, this can be found at ~/.avalanchego/logs/. See above

Metrics If applicable, please include any metrics gathered from your node to assist us in diagnosing the problem.

Operating System Ubuntu

Additional context Add any other context about the problem here.

To best protect the Avalanche community security bugs should be reported in accordance to our Security Policy

judaniubi commented 7 months ago

I encountered the same issue on both Azure and AWS.

EduardoFromTheFuture commented 7 months ago

First of all, there is another issue raised for this: https://github.com/ava-labs/avalanchego/issues/2863

I came here to raise an issue, but instead of creating a third, I will just add the info I had shared on discord, so far:

trying to sync with the fuji network launching nodes using avalanchego --network-id=fuji

in three different machines, it is always getting stuck silently in the same step: [03-22|20:50:52.925] INFO <P Chain> bootstrap/bootstrapper.go:644 fetching blocks {"numFetchedBlocks": 90000, "numTotalBlocks": 142157, "eta": "28s"}

any idea of what it is? maybe a problem with fuji itself... or maybe some network issue on my machine and I need to open some port?

The worst thing is that it's completely silent, I can't find information about what could possibly be going on

still the same thing. failing with "numFetchedBlocks": 90000, "numTotalBlocks": 142163

I was using the latest. just cloned earlier today with git clone ava-labs/avalanchego.git then built it. (ommited the link)

arguably I should be using a tag. so I just tried v1.11.0, v1.11.2 and v1.11.3, after deleting the DB with: rm -r ~/.avalanchego/db/fuji/

and restarting the sync, I even tried to launch it differently: ~/ava-labs/avalanchego/build/avalanchego --network-id=fuji --public-ip=$(curl -s ifconfig.me) --http-host "" ~/ava-labs/avalanchego/build/avalanchego --network-id=fuji --public-ip=$(curl -s ifconfig.me) ~/ava-labs/avalanchego/build/avalanchego --network-id=fuji

and no luck

also tested with machines on the cloud, my local machine. all of them failing in the exact same way.

only the p2p port needs to be open for this step, right?

just tried with a downloaded binary of v1.11.3 (instead of building it), also stop silently at the same point

I am calling this a bug in v1.11.3 because I just moved the DB folder of a previously bootstrapped node to make it bootstrap again. surprise surprise: the bootstrapping also halted at the same place.

I believe the bug manifests on the upstream node that are sending data to my bootstrapping node, and not really on my node. I say this because, well, I managed to bootstrap this node before, but currently not now.

I know that @stephenbuttolph has been working on improving the bootstrapping performance, so maybe you could step in? =)

StephenButtolph commented 7 months ago

Curling gives curl: (52) Empty reply from server

This doesn't really relate to the issue too much, but you seem to be curling 9651. 9651 is the default p2p port. The default http port is 9650. So I don't think that curl is doing what you wanted it to do

StephenButtolph commented 7 months ago

I'm able to replicate this. Looking into it today.

StephenButtolph commented 7 months ago

The linked PR seems to have addressed this issue. Thanks for the reports!

EduardoFromTheFuture commented 7 months ago

Thank you. I can confirm that it is solved in 9833b45b3b766813d36df2811d35d2bd8942897e, syncing like a champ, good job!