contribsys / faktory

Language-agnostic persistent background job server
https://contribsys.com/faktory/
Other
5.78k stars 230 forks source link

Faktory crashed with fatal error: concurrent map writes #483

Closed seedifferently closed 4 months ago

seedifferently commented 4 months ago

Our Faktory Enterprise process crashed this morning with a fatal error: concurrent map writes as we were ramping up a batch of about 30k jobs. We're running 1.8.0 and don't see anything in the 1.9.0 changelog that indicates there's been a bugfix for something like this.

Here's a head snippet of the crash log, I can provide the full output if needed:

fatal error: concurrent map writes
goroutine 433267 [running]:
github.com/mperham/faktory-comm/ent/throttle.(*Throttle).LockAndPop(0x40000b6660, {0x559528, 0x40008499d0}, {0x4000ae13b0, 0xd}, 0x4000078150)
/Users/mperham/src/github.com/mperham/faktory-comm/ent/throttle/throttle.go:371 +0x388
github.com/mperham/faktory-comm/ent/throttle.(*ThrottledFetch).throttledFetch(0x4000596d20, {0x559528, 0x40008499d0}, {0x4000ae13b0, 0xd}, {0x4000739e10?, 0xe, 0xffffa9a355b8?})
/Users/mperham/src/github.com/mperham/faktory-comm/ent/throttle/throttle.go:444 +0x160
github.com/mperham/faktory-comm/ent/throttle.(*ThrottledFetch).Fetch(0x4000647cc8?, {0x559528?, 0x40008499d0?}, {0x4000ae13b0?, 0x4000647ce8?}, {0x4000739e10?, 0x51e98e8e62601?, 0x4000873ae0?})
/Users/mperham/src/github.com/mperham/faktory-comm/ent/throttle/throttle.go:480 +0x2c
github.com/contribsys/faktory/manager.(*manager).Fetch(0x400019c000, {0x559528, 0x40008499d0}, {0x4000ae13b0, 0xd}, {0x4000739e10, 0xe, 0xe})
/Users/mperham/src/github.com/contribsys/faktory/manager/fetch.go:100 +0x118
github.com/contribsys/faktory/server.fetch(0x4000c4fbc0, 0x4000180000, {0x4000733810, 0xa1})
/Users/mperham/src/github.com/contribsys/faktory/server/commands.go:182 +0xe0
github.com/contribsys/faktory/server.(*Server).processLines(0x4000180000, 0x4000c4fbc0)
/Users/mperham/src/github.com/contribsys/faktory/server/server.go:332 +0x3d0
github.com/contribsys/faktory/server.(*Server).Run.func1({0x55aa08?, 0x40000875c8?})
/Users/mperham/src/github.com/contribsys/faktory/server/server.go:148 +0x74
created by github.com/contribsys/faktory/server.(*Server).Run in goroutine 8
/Users/mperham/src/github.com/contribsys/faktory/server/server.go:141 +0x19c
goroutine 1 [chan receive, 24014 minutes]:
main.main()
/Users/mperham/src/github.com/mperham/faktory-comm/ent/cmd/daemon/main.go:100 +0x73c
goroutine 18 [syscall, 24014 minutes]:
syscall.Syscall6(0x0?, 0x0?, 0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
/opt/homebrew/Cellar/go/1.21.1/libexec/src/syscall/syscall_linux.go:91 +0x2c
os.(*Process).blockUntilWaitable(0x40000c0420)
/opt/homebrew/Cellar/go/1.21.1/libexec/src/os/wait_waitid.go:32 +0x6c
os.(*Process).wait(0x40000c0420)
/opt/homebrew/Cellar/go/1.21.1/libexec/src/os/exec_unix.go:22 +0x2c
os.(*Process).Wait(...)
/opt/homebrew/Cellar/go/1.21.1/libexec/src/os/exec.go:134
os/exec.(*Cmd).Wait(0x40000f4160)
/opt/homebrew/Cellar/go/1.21.1/libexec/src/os/exec/exec.go:890 +0x38
github.com/contribsys/faktory/storage.bootRedis.func2()
/Users/mperham/src/github.com/contribsys/faktory/storage/redis.go:163 +0x30
created by github.com/contribsys/faktory/storage.bootRedis in goroutine 1
/Users/mperham/src/github.com/contribsys/faktory/storage/redis.go:162 +0x7c8
goroutine 5 [select]:
github.com/contribsys/faktory/server.(*taskRunner).Run.func1()
/Users/mperham/src/github.com/contribsys/faktory/server/task_runner.go:67 +0xc0
created by github.com/contribsys/faktory/server.(*taskRunner).Run in goroutine 1
/Users/mperham/src/github.com/contribsys/faktory/server/task_runner.go:59 +0x70
goroutine 6 [sleep, 944 minutes]:
time.Sleep(0x4e94914f0000)
/opt/homebrew/Cellar/go/1.21.1/libexec/src/runtime/time.go:195 +0x10c
main.verifyProductionLicense(0x4000180000)
/Users/mperham/src/github.com/mperham/faktory-comm/ent/cmd/daemon/main.go:264 +0xd4
created by main.main in goroutine 1
/Users/mperham/src/github.com/mperham/faktory-comm/ent/cmd/daemon/main.go:93 +0x6ac
mperham commented 4 months ago

If you have a per-worker throttle which is heavily contended, I can see this happening. Interesting that no one else found this, the code has been stable for 3-4 years now. Working on the fix.

profsmallpine commented 4 months ago

@mperham thanks for jumping on this! Do you have a sense of when you'll cut a release? I'm planning the changes on our side but no rush.

mperham commented 4 months ago

I'm leaving on vacation end of next week so I'll probably ship it this time next week.

profsmallpine commented 2 months ago

Any updates here?

mperham commented 2 months ago

Let's get that release out! 😂 Expect 1.9.1 soon.

mperham commented 2 months ago

1.9.1 is out.

profsmallpine commented 2 months ago

Getting an issue trying to docker pull docker.contribsys.com/contribsys/faktory-ent:1.9.1 -> Error response from daemon: manifest for docker.contribsys.com/contribsys/faktory-ent:1.9.1 not found: manifest unknown: manifest unknown

mperham commented 2 months ago

I spent two hours trying to figure this out. No luck so far but I’m acknowledging that the issue exists. My “docker push” has no errors but the server does not show 1.9.1 on the filesystem.

profsmallpine commented 2 months ago

What a brined 🥒 . Thanks for acknowledgment and I'll keep an eye here for a fix 🤞

mperham commented 1 month ago

It took me two days to figure out that OrbStack had hijacked the local port and was somehow "stealing" the pushes to docker.c.c. 1.9.1 should now be available.