Jobs stuck in Busy state on panic()

saurori commented 4 years ago

While processing jobs (using Go worker), if there is a panic() called, jobs get stuck as Busy forever. The stack trace below is from the panic inside the example worker, not faktory itself.

Which Faktory package and version? Faktory 1.4.0

Which Faktory worker package and version?

github.com/contribsys/faktory v1.4.0-1
github.com/contribsys/faktory_worker_go v1.4.0-1

Please include any relevant worker configuration mgr.Concurrency = 10

Please include any relevant error messages or stacktraces

goroutine 15 [running]:
github.com/example-user/example-repo/cmd/workers/worker.FooBar(0x17bd4e0, 0xc000096540, 0xc000090200, 0x1, 0x4, 0x17a1408, 0xc00019fe58)
/go/src/github.com/example-user/example-repo/cmd/workers/foo.go:16 +0xb7
github.com/contribsys/faktory_worker_go.(*Manager).Register.func1(0x17bd4e0, 0xc000096540, 0xc0000c4000, 0x1635280, 0xc0000c4000)
/go/pkg/mod/github.com/contribsys/faktory_worker_go@v1.4.0-1/manager.go:45 +0x59
github.com/contribsys/faktory_worker_go.dispatch(0x0, 0x0, 0x0, 0x17bd4e0, 0xc000096540, 0xc0000c4000, 0xc000265e70, 0xc000094140, 0x0)
/go/pkg/mod/github.com/contribsys/faktory_worker_go@v1.4.0-1/middleware.go:19 +0x14a
github.com/contribsys/faktory_worker_go.processOne(0xc0002b09c0, 0xc00002cc00, 0x0)
/go/pkg/mod/github.com/contribsys/faktory_worker_go@v1.4.0-1/runner.go:128 +0x18d
github.com/contribsys/faktory_worker_go.process(0xc0002b09c0, 0x7)
/go/pkg/mod/github.com/contribsys/faktory_worker_go@v1.4.0-1/runner.go:84 +0xdf
created by github.com/contribsys/faktory_worker_go.(*Manager).Run
/go/pkg/mod/github.com/contribsys/faktory_worker_go@v1.4.0-1/manager.go:132 +0x11f

mperham commented 4 years ago

Yep, I will put some time into this over the next week to improve the situation for 1.4.1.

mperham commented 4 years ago

Here's what I (re)discovered:

If a worker process FETCHes a job, it has 30 minutes by default to ACK that job.
If the process dies, Faktory doesn't know that. All Faktory knows within one minute is that the process has not sent a heartbeat. It might be alive (network dropped) or it might be dead (process crashed).
Since the process might still be alive, Faktory sticks with (1) and does not immediately clear the Busy tab. The network can come back up, the worker process ACK the job and all will be well.
After 30 minutes, Faktory will reclaim the job and put it into the Retry set so the job can be retried in the future. This "reservation reaper" runs every 15 seconds.

So my question is: did you actually wait for 30 minutes to determine they are Busy "forever"?

saurori commented 4 years ago

I did not wait 30 minutes, so yes "forever" was a misrepresentation. That does seem like an excessive amount of time to wait for reclaim. What is the reason for not reclaiming the job immediately if the heartbeat is not received? Or a shorter window (that may cover a few heartbeats).

mperham commented 4 years ago

The reason is that network drop usecase. You can configure the reservation timeout with the "reserve_for" attribute in the job payload.

saurori commented 4 years ago

Makes sense, thanks for clarifying. I was using Faktory for some local job processing and experimentation and I have used Sidekiq in the past. It wasn't long running so I never hit 30 minutes.

I'm more familiar with RabbitMQ semantics where messages on a channel are redelivered on network failure (failed heartbeat on connection) and a redelivered flag is set on the message to indicate that the message may have been seen before.

mcabezas commented 4 years ago

Here's what I (re)discovered:

If a worker process FETCHes a job, it has 30 minutes by default to ACK that job.

If the process dies, Faktory doesn't know that. All Faktory knows within one minute is that the process has not sent a heartbeat. It might be alive (network dropped) or it might be dead (process crashed).

Since the process might still be alive, Faktory sticks with (1) and does not immediately clear the Busy tab. The network can come back up, the worker process ACK the job and all will be well.

After 30 minutes, Faktory will reclaim the job and put it into the Retry set so the job can be retried in the future. This "reservation reaper" runs every 15 seconds.

So my question is: did you actually wait for 30 minutes to determine they are Busy "forever"?

It is an awesome information, can you please move it into the documentation?

IMHO this kind of helpful information should be found into a FAQ or something like that.

mperham commented 4 years ago

The wiki is editable. You can put it in!

On Jul 11, 2020, at 08:12, Marcelo notifications@github.com wrote:

Here's what I (re)discovered:

If a worker process FETCHes a job, it has 30 minutes by default to ACK that job. If the process dies, Faktory doesn't know that. All Faktory knows within one minute is that the process has not sent a heartbeat. It might be alive (network dropped) or it might be dead (process crashed). Since the process might still be alive, Faktory sticks with (1) and does not immediately clear the Busy tab. The network can come back up, the worker process ACK the job and all will be well. After 30 minutes, Faktory will reclaim the job and put it into the Retry set so the job can be retried in the future. This "reservation reaper" runs every 15 seconds. So my question is: did you actually wait for 30 minutes to determine they are Busy "forever"?

It is an awesome information, can you please move it into the documentation?

IMHO this kind of helpful information should be found into a FAQ or something like that.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

kumarsiva07 commented 4 years ago

@mperham How to recover the panic in middleware and move that to retry / dead set?

kumarsiva07 commented 4 years ago

That happens automatically. … On Jul 16, 2020, at 02:29, Kumar SIVA @.***> wrote: @mperham How to recover the panic in middleware and move that to retry / dead set? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

No. Actually my worker process got killed and job stuck in busy if panic happen inside my worker code

mperham commented 4 years ago

The job will be put into the retry set after the reserve_for timeout has passed, by default that is 30 minutes.

On Jul 16, 2020, at 08:34, Kumar SIVA notifications@github.com wrote:

That happens automatically. … On Jul 16, 2020, at 02:29, Kumar SIVA @.***> wrote: @mperham How to recover the panic in middleware and move that to retry / dead set? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

No. Actually my worker process got killed and job stuck in busy if panic happen inside my worker code

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

kumarsiva07 commented 4 years ago

Better if you help with middleware that will help to solve this instead of crashing worker and moving the job to retry after 30 mins.

The job will be put into the retry set after the reserve_for timeout has passed, by default that is 30 minutes. … On Jul 16, 2020, at 08:34, Kumar SIVA @.> wrote: That happens automatically. … On Jul 16, 2020, at 02:29, Kumar SIVA @.> wrote: @mperham How to recover the panic in middleware and move that to retry / dead set? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. No. Actually my worker process got killed and job stuck in busy if panic happen inside my worker code — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

mperham commented 4 years ago

Ok, I think I understand now.

A panic will crash your worker process and all executing jobs will be retried after the reserve_for timeout.
If your code panics, that's a bug for you to fix.
FWG can recover to prevent the process from crashing BUT now that goroutine is in an unknown state.

I'll investigate how FWG can gracefully recover and replace that goroutine.

kumarsiva07 commented 4 years ago

Ok, I think I understand now.

A panic will crash your worker process and all executing jobs will be retried after the reserve_for timeout.

If your code panics, that's a bug for you to fix.

FWG can recover to prevent the process from crashing BUT now that goroutine is in an unknown state.

I'll investigate how FWG can gracefully recover and replace that goroutine.

Yes. Now, you got my point. Sorry for the confusion.

For now, I added below middleware

func Recover(ctx context.Context, job *faktory.Job, next func(ctx context.Context) error) (err error) {
    defer func() {
        if r := recover(); r != nil {
            var ok bool
            err, ok = r.(error)
            if !ok {
                err = fmt.Errorf("%v", r)
            }
            fmt.Printf("Panic Recovered in middleware - %s: %s", err, debug.Stack()) // line 20
        }
    }()

    return next(ctx)
}

mperham commented 3 years ago

I'm going to stick with a "let it crash and Faktory will retry it later" model.

contribsys / faktory

Jobs stuck in Busy state on panic() #300