contribsys / faktory

Language-agnostic persistent background job server
https://contribsys.com/faktory/
Other
5.73k stars 227 forks source link

Faktory crashed with fatal error: concurrent map writes #483

Closed seedifferently closed 2 months ago

seedifferently commented 2 months ago

Our Faktory Enterprise process crashed this morning with a fatal error: concurrent map writes as we were ramping up a batch of about 30k jobs. We're running 1.8.0 and don't see anything in the 1.9.0 changelog that indicates there's been a bugfix for something like this.

Here's a head snippet of the crash log, I can provide the full output if needed:

fatal error: concurrent map writes
goroutine 433267 [running]:
github.com/mperham/faktory-comm/ent/throttle.(*Throttle).LockAndPop(0x40000b6660, {0x559528, 0x40008499d0}, {0x4000ae13b0, 0xd}, 0x4000078150)
/Users/mperham/src/github.com/mperham/faktory-comm/ent/throttle/throttle.go:371 +0x388
github.com/mperham/faktory-comm/ent/throttle.(*ThrottledFetch).throttledFetch(0x4000596d20, {0x559528, 0x40008499d0}, {0x4000ae13b0, 0xd}, {0x4000739e10?, 0xe, 0xffffa9a355b8?})
/Users/mperham/src/github.com/mperham/faktory-comm/ent/throttle/throttle.go:444 +0x160
github.com/mperham/faktory-comm/ent/throttle.(*ThrottledFetch).Fetch(0x4000647cc8?, {0x559528?, 0x40008499d0?}, {0x4000ae13b0?, 0x4000647ce8?}, {0x4000739e10?, 0x51e98e8e62601?, 0x4000873ae0?})
/Users/mperham/src/github.com/mperham/faktory-comm/ent/throttle/throttle.go:480 +0x2c
github.com/contribsys/faktory/manager.(*manager).Fetch(0x400019c000, {0x559528, 0x40008499d0}, {0x4000ae13b0, 0xd}, {0x4000739e10, 0xe, 0xe})
/Users/mperham/src/github.com/contribsys/faktory/manager/fetch.go:100 +0x118
github.com/contribsys/faktory/server.fetch(0x4000c4fbc0, 0x4000180000, {0x4000733810, 0xa1})
/Users/mperham/src/github.com/contribsys/faktory/server/commands.go:182 +0xe0
github.com/contribsys/faktory/server.(*Server).processLines(0x4000180000, 0x4000c4fbc0)
/Users/mperham/src/github.com/contribsys/faktory/server/server.go:332 +0x3d0
github.com/contribsys/faktory/server.(*Server).Run.func1({0x55aa08?, 0x40000875c8?})
/Users/mperham/src/github.com/contribsys/faktory/server/server.go:148 +0x74
created by github.com/contribsys/faktory/server.(*Server).Run in goroutine 8
/Users/mperham/src/github.com/contribsys/faktory/server/server.go:141 +0x19c
goroutine 1 [chan receive, 24014 minutes]:
main.main()
/Users/mperham/src/github.com/mperham/faktory-comm/ent/cmd/daemon/main.go:100 +0x73c
goroutine 18 [syscall, 24014 minutes]:
syscall.Syscall6(0x0?, 0x0?, 0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
/opt/homebrew/Cellar/go/1.21.1/libexec/src/syscall/syscall_linux.go:91 +0x2c
os.(*Process).blockUntilWaitable(0x40000c0420)
/opt/homebrew/Cellar/go/1.21.1/libexec/src/os/wait_waitid.go:32 +0x6c
os.(*Process).wait(0x40000c0420)
/opt/homebrew/Cellar/go/1.21.1/libexec/src/os/exec_unix.go:22 +0x2c
os.(*Process).Wait(...)
/opt/homebrew/Cellar/go/1.21.1/libexec/src/os/exec.go:134
os/exec.(*Cmd).Wait(0x40000f4160)
/opt/homebrew/Cellar/go/1.21.1/libexec/src/os/exec/exec.go:890 +0x38
github.com/contribsys/faktory/storage.bootRedis.func2()
/Users/mperham/src/github.com/contribsys/faktory/storage/redis.go:163 +0x30
created by github.com/contribsys/faktory/storage.bootRedis in goroutine 1
/Users/mperham/src/github.com/contribsys/faktory/storage/redis.go:162 +0x7c8
goroutine 5 [select]:
github.com/contribsys/faktory/server.(*taskRunner).Run.func1()
/Users/mperham/src/github.com/contribsys/faktory/server/task_runner.go:67 +0xc0
created by github.com/contribsys/faktory/server.(*taskRunner).Run in goroutine 1
/Users/mperham/src/github.com/contribsys/faktory/server/task_runner.go:59 +0x70
goroutine 6 [sleep, 944 minutes]:
time.Sleep(0x4e94914f0000)
/opt/homebrew/Cellar/go/1.21.1/libexec/src/runtime/time.go:195 +0x10c
main.verifyProductionLicense(0x4000180000)
/Users/mperham/src/github.com/mperham/faktory-comm/ent/cmd/daemon/main.go:264 +0xd4
created by main.main in goroutine 1
/Users/mperham/src/github.com/mperham/faktory-comm/ent/cmd/daemon/main.go:93 +0x6ac
mperham commented 2 months ago

If you have a per-worker throttle which is heavily contended, I can see this happening. Interesting that no one else found this, the code has been stable for 3-4 years now. Working on the fix.

profsmallpine commented 2 months ago

@mperham thanks for jumping on this! Do you have a sense of when you'll cut a release? I'm planning the changes on our side but no rush.

mperham commented 2 months ago

I'm leaving on vacation end of next week so I'll probably ship it this time next week.

profsmallpine commented 2 days ago

Any updates here?

mperham commented 2 days ago

Let's get that release out! 😂 Expect 1.9.1 soon.

mperham commented 2 days ago

1.9.1 is out.