contribsys / faktory

Language-agnostic persistent background job server
https://contribsys.com/faktory/
Other
5.78k stars 230 forks source link

Faktory server crashes on failure of batch success callback #408

Closed ktowle closed 2 years ago

ktowle commented 2 years ago
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x6c6d71]

goroutine 353 [running]:
github.com/contribsys/faktory/manager.(*manager).processFailure(0xc00022c000, {0xc000357ea8, 0x18}, 0xc00041be00)
        /Users/mperham/src/github.com/contribsys/faktory/manager/retry.go:120 +0x251
github.com/contribsys/faktory/manager.(*manager).Fail(0xc0008fc380?, 0xc00041be00)
        /Users/mperham/src/github.com/contribsys/faktory/manager/retry.go:35 +0x85
github.com/contribsys/faktory/server.fail(0xc0001f43e0, 0xc000130080, {0xc0008fc000, 0x326})
        /Users/mperham/src/github.com/contribsys/faktory/server/commands.go:233 +0xc2
github.com/contribsys/faktory/server.(*Server).processLines(0xc000130080, 0xc0001f43e0)
        /Users/mperham/src/github.com/contribsys/faktory/server/server.go:329 +0x3a2
github.com/contribsys/faktory/server.(*Server).Run.func1({0x930048?, 0xc00019c198?})
        /Users/mperham/src/github.com/contribsys/faktory/server/server.go:147 +0x8e
created by github.com/contribsys/faktory/server.(*Server).Run
        /Users/mperham/src/github.com/contribsys/faktory/server/server.go:140 +0x205

We're experiencing this crash when the success callback job for a finished batch has a failure. It appears that as soon as the worker task running the callback terminates due to the error, something causes the faktory server to crash as above. If the callback finishes successfully then all is fine. This happens even before the worker can attempt the failure api call. We see this both in our test and production environments (Ubuntu 20.04) and when running locally against the MacOS version.

We realize this could be something about the worker or the way we're doing things, but hoping the stacktrace above will suggest what that could be...

mperham commented 2 years ago

What is "Enterprise stable main"? How are you not using an explicit version?

mperham commented 2 years ago

It's crashing because the Retry element is nil. If you manually set the callback's Retry element to an integer like the default of 25, it should work. The Batch enqueue, because it's internal to Faktory, was bypassing this bit of logic:

https://github.com/contribsys/faktory/blob/44668c76d2d2eeca9e7c1b61f8fd9b0c296e53c3/server/commands.go#L154-L157

ktowle commented 2 years ago

Sorry - I'm new to this project (and to faktory), but our Dockerfile appears to be pulling the latest stable main version each time we build. Per the log we're getting Faktory Enterprise 1.6.1 linux/amd64

ktowle commented 2 years ago

Ah - that makes sense - off to try it..

mperham commented 2 years ago

It'll be fixed in 1.6.2. Thank you!

ktowle commented 2 years ago

Setting a retry count works - thanks again.