contribsys / faktory

Language-agnostic persistent background job server
https://contribsys.com/faktory/
Other
5.73k stars 227 forks source link

Silently dropping Jobs with transaction discarded warnings #379

Closed pbrisbin closed 1 year ago

pbrisbin commented 2 years ago

docker.contribsys.com/contribsys/faktory-ent:1.4.0

https://hackage.haskell.org/package/faktory-1.1.1.0

We had our staging Faktory instance silently dropping Jobs, while spamming the following warnings:

2021-09-07T17:12:35.244Z,Error running task Retries: EXECABORT Transaction discarded because of previous errors.
2021-09-07T17:12:35.244Z,Error running task Scheduled: EXECABORT Transaction discarded because of previous errors.
2021-09-07T17:12:30.244Z,Error running task Busy: EXECABORT Transaction discarded because of previous errors.
2021-09-07T17:12:30.244Z,Error running task Retries: EXECABORT Transaction discarded because of previous errors.
2021-09-07T17:12:30.244Z,Error running task Scheduled: EXECABORT Transaction discarded because of previous errors.
2021-09-07T17:12:25.244Z,Error running task Retries: EXECABORT Transaction discarded because of previous errors.
2021-09-07T17:12:25.244Z,Error running task Scheduled: EXECABORT Transaction discarded because of previous errors.
2021-09-07T17:12:20.244Z,Error running task Retries: EXECABORT Transaction discarded because of previous errors.
2021-09-07T17:12:20.244Z,Error running task Scheduled: EXECABORT Transaction discarded because of previous errors.

As this was staging and we don't pay close attention to background jobs in that system, we didn't notice and the Faktory instance remained like this for almost all of September.

I see nothing else in the logs at all (except for the web client if/when we happen to be poking around), no "previous error" I can point at that triggered this state. Clients were receiving no errors on enqueue and in fact getting back Job Ids. Restarting the Faktory instance fixed it.

Have you seen this before?

I wouldn't be surprised if this isn't worth investigating, but I wonder if this message should be elevated to ERROR if it represents the instance silently not processing work.

mperham commented 2 years ago

That's a fatal Redis error. I would suspect it boils down to "you ran out of disk space".

Also, you're behind a few versions so issues like this might be fixed in newer versions.

pbrisbin commented 2 years ago

Probably not out of disk space since a restart fixed it, unless restarting somehow cleans something up. But if it's a fatal Redis error then that's even more reason the message should not come out at only warning level.