contribsys / faktory

Language-agnostic persistent background job server
https://contribsys.com/faktory/
Other
5.71k stars 228 forks source link

Faktory crashing on startup. #417

Closed jeremyowensboggs closed 1 year ago

jeremyowensboggs commented 1 year ago

Faktory Enterprise 1.6.1 linux/amd64 © 2022 Contributed Systems LLC. I 2022-10-03T14:41:59.701Z Licensed to Pepsico , max 100 connections I 2022-10-03T14:41:59.701Z Initializing redis storage at /var/lib/faktory/db, socket /var/lib/faktory/db/redis.sock I 2022-10-03T14:41:59.714Z Web server now listening at :7420 I 2022-10-03T14:41:59.715Z Sending statsd metrics to 10.7.200.90:8125 with namespace simple-machine I 2022-10-03T14:41:59.717Z PID 1 listening at :7419, press Ctrl-C to stop I 2022-10-03T14:42:00.715Z Dead processed 2 jobs panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x6c6d71] goroutine 18 [running]: github.com/contribsys/faktory/manager.(manager).processFailure(0xc00013e0c0, {0xc00002e228, 0x18}, 0xb388e0) /Users/mperham/src/github.com/contribsys/faktory/manager/retry.go:120 +0x251 github.com/contribsys/faktory/manager.(manager).ReapExpiredJobs.func1({0xc0005a6300, 0x2f3, 0x300}) /Users/mperham/src/github.com/contribsys/faktory/manager/working.go:222 +0x3cf github.com/contribsys/faktory/storage.(redisSorted).RemoveBefore(0xc000116150, {0xc000266080?, 0xb405c0?}, 0xa, 0xc0005a2060) /Users/mperham/src/github.com/contribsys/faktory/storage/sorted_redis.go:288 +0x31a github.com/contribsys/faktory/manager.(manager).ReapExpiredJobs(0xc00013e0c0, {0x0?, 0xc00032ce08?, 0xb405c0?}) /Users/mperham/src/github.com/contribsys/faktory/manager/working.go:193 +0x135 github.com/contribsys/faktory/server.(reservationReaper).Execute(0xc000117d88) /Users/mperham/src/github.com/contribsys/faktory/server/tasks.go:20 +0x45 github.com/contribsys/faktory/server.(taskRunner).cycle(0xc0001183c0) /Users/mperham/src/github.com/contribsys/faktory/server/task_runner.go:99 +0x1e5 github.com/contribsys/faktory/server.(taskRunner).Run.func1() /Users/mperham/src/github.com/contribsys/faktory/server/task_runner.go:65 +0xa7 created by github.com/contribsys/faktory/server.(taskRunner).Run /Users/mperham/src/github.com/contribsys/faktory/server/task_runner.go:58 +0x72 Are you using an old version? Yes Have you checked the changelogs to see if your issue has been fixed in a later version? N/A

https://github.com/contribsys/faktory/blob/master/Changes.md https://github.com/contribsys/faktory/blob/master/Pro-Changes.md https://github.com/contribsys/faktory/blob/master/Ent-Changes.md

mperham commented 1 year ago

Somehow you got a job into Faktory which has no retry attribute at all. Both PUSH and PUSHB add retry if it's not there so I have no idea how this could happen.

https://github.com/contribsys/faktory/blob/93598e9cddee13a6e49f2d911f407c4e0adf8054/server/commands.go#L154-L157

jeremyowensboggs commented 1 year ago

Is there a way we can clear the job out?

mperham commented 1 year ago

Yes. If you start Redis by pointing it to the datafile, you can fire up redis-cli. One of the entries in the working zset is the bad job. You'll want to ZREM the entry which does not have a retry attribute.

redis-server faktory-redis.conf --path /path/to/faktory/db

Here's the faktory-redis.conf:

https://github.com/contribsys/faktory/blob/93598e9cddee13a6e49f2d911f407c4e0adf8054/storage/redis.go#L429

Make sure you remove all entries without retry.

jeremyowensboggs commented 1 year ago

Not seeing any with no retry, but we have a few with a retry of -1. Would a retry of -1 cause this?

jeremyowensboggs commented 1 year ago

nm, there is one with null "retry":null

jeremyowensboggs commented 1 year ago

The job with the null retry is created is the result of on on_success batch. However, it runs every 20 minutes since Monday of last week, and has succeeded quite a few times in the past week without this problem occurring.

mperham commented 1 year ago

Yes, that's a bug that has been fixed but not released. I will release 1.6.2 this week. In the meantime, try to explicitly set "retry" if possible in your client code where you define the callback.