Algo follower node crashes periodically

algorand / conduit

Algorand's data pipeline framework.

MIT License

37 stars 26 forks source link

Algo follower node crashes periodically #157

Closed timurgum closed 2 months ago

timurgum commented 10 months ago

with logs like this

{"__type":"importer","_name":"algod","level":"error","msg":"called waitForRoundWithTimeout: wrong round returned from status for round: retrieved(32919745) != expected(32919746): status2.LastRound mismatch: context deadline exceeded"

**Who had this? how to fix ?

here is the conduind config

log-level: INFO
retry-count: 10
retry-delay: "1s"
hide-banner: false
metrics:
    mode: OFF
    addr: ":9999"
    prefix: "conduit"
importer:
    name: algod
    config:
        mode: "follower"
        netaddr: "http://algo:18720"
        token: "algo_not_admin_token"
        catchup-config:
            admin-token: "algo_admin_token"
processors:
exporter:
    name: postgresql
    config:
        connection-string: "host=algorand.db port=5432 user=algorand password=algo_pass dbname=algoranddb sslmode=disable"
        max-conn: 20
        delete-task:
            interval: 0
            rounds: 100000
telemetry:
  enabled: false

alecalve commented 7 months ago

We are encountering the same issue.

urtho commented 7 months ago

Define crashing. Is the follower really crashing or just failing to deliver the new round in time and indexer just shows a timeout and resumes in a while ?

It this happening during re-sync or once your indexer is synced and is just following the latest blocks ?

Also you can now switch to Algonode's virtual follower and just point your conduit at

netaddr: "http://mainnet-api.algonode.cloud"
token: ""

This will get you uninterrupted indexing.

alecalve commented 7 months ago

For us, it stopped once the node caught up with the network.

urtho commented 7 months ago

OK. Try this setting in node's config.json: "CatchupParallelBlocks": 32

or using command line tool:

algocfg set -p CatchupParallelBlocks -v 32

What happens is that Follower sources the blocks from random relays and you might get unlucky enough to hit relays that are far and slow will cause your conduit to time out and resume after some time. Setting this param to higher value decreases the chances of that.

Soon mainnet will migrate to a non-archival relays and followers will start sourcing blocks from a smaller but dedicated set of archivers that should be more reliable in this regard.

There are more tricks can help but they are harder to maintain and include a catchup + follower + load-balancer trio for the smoothest ride :)

alecalve commented 5 months ago

It still is happening to us when the node is writing a catchpoint.

gmalouf commented 2 months ago

Hi, checking in if this is still an issue you are encountering?

alecalve commented 2 months ago

It's not been happening recently.