algorand / go-algorand

Algorand's official implementation in Go.
https://developer.algorand.org/
Other
1.36k stars 474 forks source link

Crash on catchup abort if start fails #6051

Open mxmauro opened 4 months ago

mxmauro commented 4 months ago

Trying to setup a starting point close to some block I need, started a node catchup with a random block number:

goal node catchup 34000000#OYD7DMAWH3V3LDB66STPSIKIHPEQKCRJ6QSCRZR6L4CNYVSLFFTQ -d data

Got this, something I expected:

Cannot contact Algorand node: HTTP 408 Request Timeout: unable to start catchpoint service for requested catchpoint 34000000#OYD7DMAWH3V3LDB66STPSIKIHPEQKCRJ6QSCRZR6L4CNYVSLFFTQ: aborting catchup Start(): checkLedgerDownload(): catchpoint '34000000#OYD7DMAWH3V3LDB66STPSIKIHPEQKCRJ6QSCRZR6L4CNYVSLFFTQ' unavailable from peers: no ledger available for given round

So tried with a new round:

goal node catchup 33000000#OYD7DMAWH3V3LDB66STPSIKIHPEQKCRJ6QSCRZR6L4CNYVSLFFTQ -d data

First unexpected response because the catchup start failed.

Cannot contact Algorand node: HTTP 400 Bad Request: unable to start catchpoint catchup for '33000000#OYD7DMAWH3V3LDB66STPSIKIHPEQKCRJ6QSCRZR6L4CNYVSLFFTQ' - already catching up '34000000#OYD7DMAWH3V3LDB66STPSIKIHPEQKCRJ6QSCRZR6L4CNYVSLFFTQ'

So tried to abort:

goal node catchup --abort -d data

And algod crashed:

echo: http: panic serving 127.0.0.1:64863: runtime error: invalid memory address or nil pointer dereference
goroutine 1908 [running]:
net/http.(*conn).serve.func1()
        net/http/server.go:1898 +0xbe
panic({0x141b0b8e0?, 0x1418ab7a0?})
        runtime/panic.go:770 +0x132
github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).Abort(0xc008d81808)
        github.com/algorand/go-algorand/catchup/catchpointService.go:181 +0x16
github.com/algorand/go-algorand/node.(*AlgorandFollowerNode).AbortCatchup(0xc000455808, {0xc00553650c, 0x3d})
        github.com/algorand/go-algorand/node/follower_node.go:383 +0x1ef
github.com/algorand/go-algorand/daemon/algod/api/server/v2.(*Handlers).abortCatchup(0xc0084ba420, {0x1424e9278, 0xc0060295e0}, {0xc00553650c, 0x3d})
        github.com/algorand/go-algorand/daemon/algod/api/server/v2/handlers.go:1665 +0x73
github.com/algorand/go-algorand/daemon/algod/api/server/v2.(*Handlers).AbortCatchup(0xc0060295e0?, {0x1424e9278?, 0xc0060295e0?}, {0xc00553650c?, 0xc007ff8af8?})
        github.com/algorand/go-algorand/daemon/algod/api/server/v2/handlers.go:1846 +0x25
github.com/algorand/go-algorand/daemon/algod/api/server/v2/generated/nonparticipating/private.(*ServerInterfaceWrapper).AbortCatchup(0xc000833cd0, {0x1424e9278, 0xc0060295e0})
        github.com/algorand/go-algorand/daemon/algod/api/server/v2/generated/nonparticipating/private/routes.go:82 +0x1ab
github.com/algorand/go-algorand/daemon/algod/api/server/lib/middlewares.(*AuthMiddleware).handler-fm.(*AuthMiddleware).handler.func1({0x1424e9278, 0xc0060295e0})
        github.com/algorand/go-algorand/daemon/algod/api/server/lib/middlewares/auth.go:100 +0x3c5
github.com/labstack/echo/v4.(*Echo).add.func1({0x1424e9278, 0xc0060295e0})
        github.com/labstack/echo/v4@v4.9.1/echo.go:536 +0x4b
github.com/labstack/echo/v4/middleware.CORSWithConfig.func1.1({0x1424e9278, 0xc0060295e0})
        github.com/labstack/echo/v4@v4.9.1/middleware/cors.go:190 +0x463
github.com/algorand/go-algorand/daemon/algod/api/server/lib/middlewares.(*LoggerMiddleware).handler-fm.(*LoggerMiddleware).handler.func1({0x1424e9278, 0xc0060295e0})
        github.com/algorand/go-algorand/daemon/algod/api/server/lib/middlewares/logger.go:52 +0xad
github.com/labstack/echo/v4.(*Echo).ServeHTTP.func1({0x1424e9278, 0xc0060295e0})
        github.com/labstack/echo/v4@v4.9.1/echo.go:640 +0x127
github.com/algorand/go-algorand/daemon/algod/api/server.NewRouter.RemoveTrailingSlash.RemoveTrailingSlashWithConfig.func4.1({0x1424e9278, 0xc0060295e0})
        github.com/labstack/echo/v4@v4.9.1/middleware/slash.go:118 +0x1fd
github.com/algorand/go-algorand/daemon/algod/api/server.NewRouter.MakeConnectionLimiter.func1.1({0x1424e9278, 0xc0060295e0})
        github.com/algorand/go-algorand/daemon/algod/api/server/lib/middlewares/connectionLimiter.go:42 +0x90
github.com/labstack/echo/v4.(*Echo).ServeHTTP(0xc0080ed208, {0x1424bfe60, 0xc005334e00}, 0xc00060ed80)
        github.com/labstack/echo/v4@v4.9.1/echo.go:646 +0x327
net/http.serverHandler.ServeHTTP({0xc008082270?}, {0x1424bfe60?, 0xc005334e00?}, 0x6?)
        net/http/server.go:3137 +0x8e
net/http.(*conn).serve(0xc00535d050, {0x1424c5220, 0xc00835b110})
        net/http/server.go:2039 +0x5e8
created by net/http.(*Server).Serve in goroutine 833
        net/http/server.go:3285 +0x4b4

Seems the call to cs.abortCtxFunc panic'ed because it is nil

Your environment

Well, using Windows binaries but this is not related to the OS.

Expected behaviour

Despite the crash, I don't expected to have to abort a catchpoint that couldn't start.

algorandskiy commented 2 months ago

@mxmauro Could tell the version or commit hash it happened at? Sounds like there are few issues:

  1. A failed catchup attempt to 34000000 has not reset fast catchup state properly.
  2. Abort calls non-initialized (or cleared) cancellation function - most likely due to an issue from (1).
mxmauro commented 2 months ago

Hi @algorandskiy , the last commit I tested was c8407abca80f4682aac43a5ccc8cd524051f4f63.

And I agree about a filure on fast catchup process does not reset the state properly.