acaloiaro / neoq

Queue-agnostic background job library for Go, with a pleasant API and powerful features.
MIT License
270 stars 4 forks source link

Panic when moving to neoq_dead_jobs #98

Closed BillBuilt closed 1 year ago

BillBuilt commented 1 year ago

Thank you for the patch for https://github.com/acaloiaro/neoq/issues/85, however this gets me to the next problems.

First, when a failed job exceeds the max retries, it is to be moved to the neoq_dead_jobs table, however a panic is being generated there:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1a53181]

goroutine 118 [running]:
github.com/acaloiaro/neoq/backends/postgres.(*PgBackend).moveToDeadQueue(0xc000774150?, {0x20bd150, 0xc000774150}, {0x20c4300, 0xc0000126a8}, 0xc0001f5680, {0x0, 0x0})
    /Users/billmatlock/go/pkg/mod/github.com/acaloiaro/neoq@v0.36.0/backends/postgres/postgres_backend.go:442 +0xe1
github.com/acaloiaro/neoq/backends/postgres.(*PgBackend).updateJob(0xc0003f4000, {0x20bd150, 0xc000774150}, {0x0, 0x0})
    /Users/billmatlock/go/pkg/mod/github.com/acaloiaro/neoq@v0.36.0/backends/postgres/postgres_backend.go:484 +0x2c5
github.com/acaloiaro/neoq/backends/postgres.(*PgBackend).handleJob(0xc0003f4000, {0x20bd188, 0xc00039c370}, {0x28a0a38, 0x1}, {0xc000412150, 0x8, 0x6fc23ac00, 0x0, {0x1dc5b27, ...}})
    /Users/billmatlock/go/pkg/mod/github.com/acaloiaro/neoq@v0.36.0/backends/postgres/postgres_backend.go:723 +0x4ce
github.com/acaloiaro/neoq/backends/postgres.(*PgBackend).start.func2()
    /Users/billmatlock/go/pkg/mod/github.com/acaloiaro/neoq@v0.36.0/backends/postgres/postgres_backend.go:539 +0x1db
created by github.com/acaloiaro/neoq/backends/postgres.(*PgBackend).start in goroutine 5
    /Users/billmatlock/go/pkg/mod/github.com/acaloiaro/neoq@v0.36.0/backends/postgres/postgres_backend.go:532 +0x353

I believe this is due to the fact that jobErr is not being populated, causing the nil pointer dereference. However the error message is also in the job struct so maybe the fix would be to drop the jobErr param and set the error field to the value of j.Error ?

In /backends/postgres/postgres_backend.go moveToDeadQueue() Go from:

_, err = tx.Exec(ctx, `INSERT INTO neoq_dead_jobs(id, queue, fingerprint, payload, retries, max_retries, error, deadline)
    VALUES ($1, $2, $3, $4, $5, $6, $7, $8)`,
    j.ID, j.Queue, j.Fingerprint, j.Payload, j.Retries, j.MaxRetries, jobErr.Error(), j.Deadline)

To:

_, err = tx.Exec(ctx, `INSERT INTO neoq_dead_jobs(id, queue, fingerprint, payload, retries, max_retries, error, deadline)
    VALUES ($1, $2, $3, $4, $5, $6, $7, $8)`,
    j.ID, j.Queue, j.Fingerprint, j.Payload, j.Retries, j.MaxRetries, j.Error, j.Deadline)

Secondly, if I patch the above error as described to get through the panic, the next time the failed job is ran, it is being ran on a different queue (handler) than what is assigned to it. Here is a snippet of my testing log showing that the first time it is getting handled by the correct handler (nqTestHandlerWithErr()) but the second time it is not (nqTestHandler()), but I do not know if my changes above are causing this or not, so I did not open a separate issue:

=== RUN   TestNeoq/TestAddJobs
2023/10/13 11:10:22 nqTestHandler()
2023/10/13 11:10:22 got job id: 1 queue: test_now messsage: [test_now] this is an instant job
2023/10/13 11:10:22 waiting 10 seconds for delayed job to run...
2023/10/13 11:10:27 nqTestHandler()
2023/10/13 11:10:27 got job id: 2 queue: test_delayed messsage: [test_delayed] this is a delayed job
2023/10/13 11:10:32 waiting 3 seconds for failed job to run...
2023/10/13 11:10:32 nqTestHandlerWithErr()
2023/10/13 11:10:32 got job id: 3 queue: test_error messsage: [test_error] this is a job with errors
time=2023-10-13T11:10:32.495-04:00 level=ERROR msg="job failed" job_error="job failed to process: this is a test error"
2023/10/13 11:10:35 waiting 30 seconds for failed job to get retried...
2023/10/13 11:10:48 nqTestHandler() <<<<< should be `nqTestHandlerWithErr()`
2023/10/13 11:10:48 got job id: 3 queue: test_error messsage: [test_error] this is a job with errors

And here are the test handlers:

var (
    ErrTest = errors.New("this is a test error")
)

func nqTestHandler(db *database.DB) handler.Func {
    return func(ctx context.Context) (err error) {
        var j *jobs.Job
        j, err = jobs.FromContext(ctx)
        if err != nil {
            return
        }
        log.Println("nqTestHandler()")
        log.Println("got job id:", j.ID, "queue:", j.Queue, "messsage:", j.Payload["message"])
        return
    }
}

func nqTestHandlerWithErr() handler.Func {
    return func(ctx context.Context) (err error) {
        j, err := jobs.FromContext(ctx)
        if err != nil {
            log.Printf("ERROR in nqTestHandlerWithErr: %v", err)
        }
        log.Println("nqTestHandlerWithErr()")
        log.Println("got job id:", j.ID, "queue:", j.Queue, "messsage:", j.Payload["message"])
        return ErrTest
    }
}

Thank you!

acaloiaro commented 1 year ago

Thanks for the report @BillBuilt. I'll look into this soon.

acaloiaro commented 1 year ago

@BillBuilt this is fixed in v0.41.0. Thanks again.

BillBuilt commented 1 year ago

Thank you! I can confirm the panic is indeed gone and the correct handler is being ran on the retry, however, after moving to the dead jobs table, the error message is empty (neoq_dead_jobs.error) when it should include the error message.

acaloiaro commented 1 year ago

Interesting -- I tested that as well, but must have made a change that affected it before cutting a release.

I'll get this sorted out as well. Thanks for the confirmation.

acaloiaro commented 1 year ago

Can you confirm whether you see an error-level log with the following form?

time=2023-10-16T09:54:30.678-06:00 level=ERROR msg="job failed" job_error="job failed to process: panic [/home/adriano/git/neoq/backends/postgres/postgres_backend_test.go:522]: no good"

If you see job_error="" instead of an actual error message, it means the stringification for your error is the empty string.

BillBuilt commented 1 year ago

Hello,

I spoke too soon. The error handler is NOT being ran on retries. And, if I bump max_retries from 1 to 2, things get more strange.

First, results after running with max_retries=1 With max retries = 1

2023/10/16 12:36:05 waiting 10 seconds for delayed job to run... 2023/10/16 12:36:05 nqTestHandler() 2023/10/16 12:36:05 got job id: 1 queue: test_now messsage: [test_now] this is an instant job 2023/10/16 12:36:10 nqTestHandler() 2023/10/16 12:36:10 got job id: 2 queue: test_delayed messsage: [test_delayed] this is a delayed job 2023/10/16 12:36:15 waiting 3 seconds for failed job to run... 2023/10/16 12:36:15 nqTestHandlerWithErr() 2023/10/16 12:36:15 got job id: 3 queue: test_error messsage: [test_error] this is a job with errors retries 0 max retries 0xc0004191a0 time=2023-10-16T12:36:15.276-04:00 level=ERROR msg="job failed" job_error="job failed to process: this is a test error" 2023/10/16 12:36:18 waiting 60 seconds for failed job to get retried... 2023/10/16 12:36:31 nqTestHandler() <<<< should be nqTestHandlerWithErr() 2023/10/16 12:36:31 got job id: 3 queue: test_error messsage: [test_error] this is a job with errors

And the resulting database tables:

max_retries=1

Things to note about this:

  1. The failed job gets retried as expected, but it does not use the correct handler
  2. The failed job gets moved to the neoq_dead_jobs table as expected
  3. The log shows 'job_error="job failed to process: this is a test error”'

When max_retries=2:

With max retries = 2

2023/10/16 12:47:59 waiting 10 seconds for delayed job to run... 2023/10/16 12:47:59 nqTestHandler() 2023/10/16 12:47:59 got job id: 1 queue: test_now messsage: [test_now] this is an instant job 2023/10/16 12:48:04 nqTestHandler() 2023/10/16 12:48:04 got job id: 2 queue: test_delayed messsage: [test_delayed] this is a delayed job 2023/10/16 12:48:09 waiting 3 seconds for failed job to run... 2023/10/16 12:48:09 nqTestHandlerWithErr() 2023/10/16 12:48:09 got job id: 3 queue: test_error messsage: [test_error] this is a job with errors retries 0 max retries 0xc0003c83d0 time=2023-10-16T12:48:09.883-04:00 level=ERROR msg="job failed" job_error="job failed to process: this is a test error" 2023/10/16 12:48:12 waiting 60 seconds for failed job to get retried... 2023/10/16 12:48:25 nqTestHandler() <<<< should be nqTestHandlerWithErr() and the retry is setting its status as "processed" so it never gets to the neoq_dead_jobs table 2023/10/16 12:48:25 got job id: 3 queue: test_error messsage: [test_error] this is a job with errors

And the resulting database tables:

max_retries=2

Things to note about this:

  1. The failed job gets retried as expected, but it does not use the correct handler, nor does it increment the retries field
  2. The failed job DOES NOT get moved to the neoq_dead_jobs table as expected since the retry uses the incorrect handler, and as such gets a ‘processed’ status so the 3rd retry never runs. However, even with these discrepancies, I would expect the retries to be ‘1’ at this point.
  3. The log shows 'job_error="job failed to process: this is a test error”'

Thank you

On Oct 16, 2023, at 11:56 AM, Adriano Caloiaro @.***> wrote:

Can you confirm whether you see a error-level log with the following form?

time=2023-10-16T09:54:30.678-06:00 level=ERROR msg="job failed" job_error="job failed to process: panic [/home/adriano/git/neoq/backends/postgres/postgres_backend_test.go:522]: no good"

If you see job_error="" instead of an actual error message, it means the stringification for your error is the empty string.

— Reply to this email directly, view it on GitHub https://github.com/acaloiaro/neoq/issues/98#issuecomment-1764793958, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG365FV5A2D42MHE3RAQ2ZDX7VKKZAVCNFSM6AAAAAA57KH62SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRUG44TGOJVHA. You are receiving this because you were mentioned.

acaloiaro commented 1 year ago

Hi Bill, I’m traveling but will take a look at your data when I can.

On Mon, Oct 16, 2023, at 11:09, Bill Matlock wrote:

Hello,

I spoke too soon. The error handler is NOT being ran on retries. And, if I bump max_retries from 1 to 2, things get more strange.

First, results after running with max_retries=1 With max retries = 1 ====================

2023/10/16 12:36:05 waiting 10 seconds for delayed job to run... 2023/10/16 12:36:05 nqTestHandler() 2023/10/16 12:36:05 got job id: 1 queue: test_now messsage: [test_now] this is an instant job 2023/10/16 12:36:10 nqTestHandler() 2023/10/16 12:36:10 got job id: 2 queue: test_delayed messsage: [test_delayed] this is a delayed job 2023/10/16 12:36:15 waiting 3 seconds for failed job to run... 2023/10/16 12:36:15 nqTestHandlerWithErr() 2023/10/16 12:36:15 got job id: 3 queue: test_error messsage: [test_error] this is a job with errors retries 0 max retries 0xc0004191a0 time=2023-10-16T12:36:15.276-04:00 level=ERROR msg="job failed" job_error="job failed to process: this is a test error" 2023/10/16 12:36:18 waiting 60 seconds for failed job to get retried... 2023/10/16 12:36:31 nqTestHandler() <<<< should be nqTestHandlerWithErr() 2023/10/16 12:36:31 got job id: 3 queue: test_error messsage: [test_error] this is a job with errors

And the resulting database tables:

Things to note about this:

  1. The failed job gets retried as expected, but it does not use the correct handler
  2. The failed job gets moved to the neoq_dead_jobs table as expected
  3. The log shows 'job_error="job failed to process: this is a test error”'

When max_retries=2: With max retries = 2 ==================== 2023/10/16 12:47:59 waiting 10 seconds for delayed job to run... 2023/10/16 12:47:59 nqTestHandler() 2023/10/16 12:47:59 got job id: 1 queue: test_now messsage: [test_now] this is an instant job 2023/10/16 12:48:04 nqTestHandler() 2023/10/16 12:48:04 got job id: 2 queue: test_delayed messsage: [test_delayed] this is a delayed job 2023/10/16 12:48:09 waiting 3 seconds for failed job to run... 2023/10/16 12:48:09 nqTestHandlerWithErr() 2023/10/16 12:48:09 got job id: 3 queue: test_error messsage: [test_error] this is a job with errors retries 0 max retries 0xc0003c83d0 time=2023-10-16T12:48:09.883-04:00 level=ERROR msg="job failed" job_error="job failed to process: this is a test error" 2023/10/16 12:48:12 waiting 60 seconds for failed job to get retried... 2023/10/16 12:48:25 nqTestHandler() <<<< should be nqTestHandlerWithErr() and the retry is setting its status as "processed" so it never gets to the neoq_dead_jobs table 2023/10/16 12:48:25 got job id: 3 queue: test_error messsage: [test_error] this is a job with errors

And the resulting database tables:

Things to note about this:

  1. The failed job gets retried as expected, but it does not use the correct handler, nor does it increment the retries field
  2. The failed job DOES NOT get moved to the neoq_dead_jobs table as expected since the retry uses the incorrect handler, and as such gets a ‘processed’ status so the 3rd retry never runs. However, even with these discrepancies, I would expect the retries to be ‘1’ at this point.
  3. The log shows 'job_error="job failed to process: this is a test error”'

Thank you

  • Bill

On Oct 16, 2023, at 11:56 AM, Adriano Caloiaro @.***> wrote:

Can you confirm whether you see a error-level log with the following form?

time=2023-10-16T09:54:30.678-06:00 level=ERROR msg="job failed" job_error="job failed to process: panic [/home/adriano/git/neoq/backends/postgres/postgres_backend_test.go:522]: no good"

If you see job_error="" instead of an actual error message, it means the stringification for your error is the empty string.

— Reply to this email directly, view it on GitHub https://github.com/acaloiaro/neoq/issues/98#issuecomment-1764793958, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG365FV5A2D42MHE3RAQ2ZDX7VKKZAVCNFSM6AAAAAA57KH62SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRUG44TGOJVHA. You are receiving this because you were mentioned.

— Reply to this email directly, view it on GitHub https://github.com/acaloiaro/neoq/issues/98#issuecomment-1764914957, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZNMQDHZWHMQCEER57OBZLX7VS35AVCNFSM6AAAAAA57KH62SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRUHEYTIOJVG4. You are receiving this because you modified the open/close state.Message ID: @.***>

acaloiaro commented 1 year ago

Hi @BillBuilt I'm not sure that I can produce this one without any code. Can you supply example code that reproduces what you're reporting?

I've tried to reproduce your result, but jobs are ending up in the dead queue as expected for me.

Could you also call neoq.New with neoq.WithLogLevel(logging.LogLevelDebug) and share the debug logs as well?

With reproduction code, I should be able to get any problems fixed up quickly now that I'm in one place.

acaloiaro commented 1 year ago

Update: I've reproduced it and will get started on a solution soon.

acaloiaro commented 1 year ago

@BillBuilt A fix is ready for this, but first I want to get a review from @elliotcourant before merging.

BillBuilt commented 1 year ago

@BillBuilt A fix is ready for this, but first I want to get a review from @elliotcourant before merging.

Thank you - and sorry about not getting you some example code - but I see you were able to reproduce. Let me know if you need anything else from me.

acaloiaro commented 1 year ago

@BillBuilt A fix is ready for this, but first I want to get a review from @elliotcourant before merging.

Thank you - and sorry about not getting you some example code - but I see you were able to reproduce. Let me know if you need anything else from me.

No worries -- if Elliot has the time to review by tomorrow, I'll wait it out. If not, I'll go ahead and merge tomorrow.

BillBuilt commented 1 year ago

No rush - thank you!

acaloiaro commented 1 year ago

@BillBuilt Release v0.43.0 has your fix. Cheers.

BillBuilt commented 11 months ago

@acaloiaro That fixed our issues! Thank you and sorry for the late response.

acaloiaro commented 11 months ago

Good to hear, Bill. Cheers!