algorand / go-algorand

Algorand's official implementation in Go.
https://developer.algorand.org/
Other
1.35k stars 470 forks source link

Node's ready status is incorrect while the node is starting #5923

Open stmax82 opened 8 months ago

stmax82 commented 8 months ago

To determine if a node is ready / fully synched I'm calling the python client's ready() function. Internally this is probably making a HTTP GET /ready request.

I noticed that a few seconds after starting the node, ready returns OK instead of status 503. Only after a while it switches to status 503 and then back to OK once it is fully synched.

Here's a python script to reproduce the problem:

import algosdk
import time

ALGOD_TOKEN="XXX"
ALGOD_SERVER="http://localhost:8080"

def main():
    client = algosdk.v2client.algod.AlgodClient(ALGOD_TOKEN, ALGOD_SERVER)
    while True:
        try:
            client.ready()
            print(f"node is ready")
        except algosdk.error.AlgodHTTPError as e:
            print(f"node is not ready, status {e.code}")
        except Exception as e:
            print(f"connection failed: {e}")
        time.sleep(1)

if __name__ == "__main__":
    main()

To reproduce the problem:

  1. Stop the node
  2. Start the script
  3. Start the node

You'll see the following output:

connection failed: <urlopen error [WinError 10061] No connection could be made because the target machine actively refused it>
connection failed: <urlopen error [WinError 10061] No connection could be made because the target machine actively refused it>
connection failed: Remote end closed connection without response
node is ready <-- this is wrong
node is ready <-- this is wrong
node is ready <-- this is wrong
node is ready <-- this is wrong
node is ready <-- this is wrong
node is not ready, status 503 <-- now it figured out that it's actually not ready yet...
node is not ready, status 503
node is not ready, status 503
node is not ready, status 503
node is not ready, status 503
node is ready <-- now it's really ready
node is ready
node is ready
node is ready
node is ready

As you can see it incorrectly starts with "node is ready", then switches to "node is not ready, status 503" and finally back to "node is ready".

I also repeatedly ran goal node status while the node was starting and it displayed the following (note entry number 4):

root@3ab791cddba2:/algod# goal node status
Last committed block: 35599045
Time since last block: 0.0s
Sync Time: 0.0s
Last consensus protocol: https://github.com/algorandfoundation/specs/tree/925a46433742afb0b51bb939354bd907fa88bf95
Next consensus protocol: https://github.com/algorandfoundation/specs/tree/925a46433742afb0b51bb939354bd907fa88bf95
Round for next consensus protocol: 35599046
Next consensus protocol supported: true
Last Catchpoint:
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=

root@3ab791cddba2:/algod# goal node status
Last committed block: 35599045
Time since last block: 0.0s
Sync Time: 0.0s
Last consensus protocol: https://github.com/algorandfoundation/specs/tree/925a46433742afb0b51bb939354bd907fa88bf95
Next consensus protocol: https://github.com/algorandfoundation/specs/tree/925a46433742afb0b51bb939354bd907fa88bf95
Round for next consensus protocol: 35599046
Next consensus protocol supported: true
Last Catchpoint:
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=

root@3ab791cddba2:/algod# goal node status
Last committed block: 35599045
Time since last block: 0.0s
Sync Time: 0.0s
Last consensus protocol: https://github.com/algorandfoundation/specs/tree/925a46433742afb0b51bb939354bd907fa88bf95
Next consensus protocol: https://github.com/algorandfoundation/specs/tree/925a46433742afb0b51bb939354bd907fa88bf95
Round for next consensus protocol: 35599046
Next consensus protocol supported: true
Last Catchpoint:
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=

At this point, client.ready() started returning status 503 (not ready):

root@3ab791cddba2:/algod# goal node status
Last committed block: 35599045
Time since last block: 0.0s
Sync Time: 0.7s
Last consensus protocol: https://github.com/algorandfoundation/specs/tree/925a46433742afb0b51bb939354bd907fa88bf95
Next consensus protocol: https://github.com/algorandfoundation/specs/tree/925a46433742afb0b51bb939354bd907fa88bf95
Round for next consensus protocol: 35599046
Next consensus protocol supported: true
Last Catchpoint:
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=

root@3ab791cddba2:/algod# goal node status
Last committed block: 35599047
Time since last block: 0.3s
Sync Time: 24.9s
Last consensus protocol: https://github.com/algorandfoundation/specs/tree/925a46433742afb0b51bb939354bd907fa88bf95
Next consensus protocol: https://github.com/algorandfoundation/specs/tree/925a46433742afb0b51bb939354bd907fa88bf95
Round for next consensus protocol: 35599048
Next consensus protocol supported: true
Last Catchpoint:
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=
root@3ab791cddba2:/algod#

While the node was starting it shows "Sync Time: 0.0s". The client.ready() returns OK. Only once the sync time starts increasing, the client.ready() returns status 503 (not ready).

Your environment

algorandskiy commented 7 months ago

It appears to be SDK issue

AustP commented 2 months ago

Has the /ready endpoint gotten worse? I use the endpoint in A1CN to know if I need to show the syncing page or the dashboard page. I used to have a timeout on startup to skip the first section of incorrect reporting. It's not ideal, but it worked.

Lately, however, the /ready endpoint has reported that the node is ready after doing a fast catchup but before being fully caught up. There's usually a few thousand blocks that need to be downloaded after a catchup. It used to be that /ready would report that the node wasn't ready while those needed to be downloaded. Now (v3.25.0-stable) /ready reports ready during that final stretch.

Note that I'm not using the SDK. I'm making my own REST calls to the node.