Allow overriding schedule jobs

pconstantinou commented 10 months ago

Here's a draft proposal for supporting overriding scheduled jobs.

This includes adding optional parameters when scheduling jobs and a TestSuite implementation that allows the development of tests that run against all the backends.

pconstantinou commented 10 months ago

Thanks for the proof of concept, Philip!

If you're interested in getting this done, like you suggested, it will need to support Redis as well.

The asynq library provides some functionality that will be of help here: https://pkg.go.dev/github.com/SQYY/Asynq#Inspector.CancelProcessing

I'll take a look at this once we nail down the API

With that said, I'm afraid that to implement this feature more elegantly, we need a breaking API change to Enqueue.

Ideally, since Override only changes the enqueue behavior, it is not as much a job-level property as a behavior change to the system. It my mind, the concept is more of a "Job Option", i.e an option passed to the system at the time of enqueue to change behavior, which is not necessarily a property of the job itself.

I do actually have some appetite to make a breaking API change to the Neoq interface.

If you're interested in sticking with this PR/feature, I'd like to work it into a breaking API change.

Cool, makes sense, I updated the proposal to use options like you've done with the queue instantiation. This has the upside of not breaking the API for existing code.

pconstantinou commented 10 months ago

@acaloiaro I'm getting in a little over my head on this. As I started working on unit tests, I realized that the cached jobs that are already in memory need to be purged and reloaded. The unit tests are failing because the overwritten job is still run. I'm not sure how we feel about my schema change, ALTER TABLE neoq_jobs ADD CONSTRAINT neoq_jobs_fingerprint_constraint_idx UNIQUE (fingerprint, status, ran_at); seems like it should give you what you want but I see from the change log that you've played around with this before.

It seems like the memory backend suffers from the same problem as the postgres backend. Even though the Store is updated, there's a channel that needs to be reset.

Let me know what you think about this effort. If it's helpful, I can keep working on it, but I've got a big learning curve on the code case. If the approach doesn't align with the direction you want to go, I can wait.

acaloiaro commented 10 months ago

On Fri Dec 22, 2023 at 12:01 PM MST, Philip Constantinou wrote:

Thanks for the proof of concept, Philip!

If you're interested in getting this done, like you suggested, it will need to support Redis as well.

The asynq library provides some functionality that will be of help here: https://pkg.go.dev/github.com/SQYY/Asynq#Inspector.CancelProcessing

I'll take a look at this once we nail down the API

With that said, I'm afraid that to implement this feature more elegantly, we need a breaking API change to Enqueue.

Ideally, since Override only changes the enqueue behavior, it is not as much a job-level property as a behavior change to the system. It my mind, the concept is more of a "Job Option", i.e an option passed to the system at the time of enqueue to change behavior, which is not necessarily a property of the job itself.

I do actually have some appetite to make a breaking API change to the Neoq interface.

If you're interested in sticking with this PR/feature, I'd like to work it into a breaking API change.

Cool, makes sense, I updated the proposal to use options like you've done with the queue instantiation. This has the upside of not breaking the API for existing code.

Good choice. I like what you've done here.

acaloiaro commented 10 months ago

@acaloiaro I'm getting in a little over my head on this. As I started working on unit tests, I realized that the cached jobs that are already in memory need to be purged and reloaded. The unit tests are failing because the overwritten job is still run.

Like the postgres implementation, the memory backend will need the jobs atomically replaced with new job metadata. Do you want to continue attempting to do that work, or would you rather not? If not, we'll have to hold off on merging any changes, since one of the project's goals it to not forgo feature parity where parity is possible.

I'm not sure how we feel about my schema change, ALTER TABLE neoq_jobs ADD CONSTRAINT neoq_jobs_fingerprint_constraint_idx UNIQUE (fingerprint, status, ran_at); seems like it should give you what you want but I see from the change log that you've played around with this before.

I believe this doesn't fully achieve job uniqueness, where the "uniqueness" constraint is defined as preventing any two unprocessed jobs with the same payload from being queued at the same time. I believe the constraint provided in this PR relaxes that definition, and allows jobs with identical payloads in the error state and new state, since ran_at gets set whether jobs are unsuccessful (in the error state), or successful. I'm not entirely sure that this should not be neoq's uniqueness behavior. Let's keep your change as-is for now and I'll give it more thought.

It seems like the memory backend suffers from the same problem as the postgres backend. Even though the Store is updated, there's a channel that needs to be reset.

If you have a reproduction of the behavior described here, I'd like to take a look at it.

Let me know what you think about this effort. If it's helpful, I can keep working on it, but I've got a big learning curve on the code case. If the approach doesn't align with the direction you want to go, I can wait.

The work is helpful and I appreciate the effort. Feel free to keep moving forward. I'm happy to jump in anywhere you feel out of your depth, and even happy to pair synchronously if you like.

Cheers

pconstantinou commented 9 months ago

@acaloiaro thanks for the suggestions. I've finished a complete draft implementation. When I run the postgres and memory tests one at a time, they work correctly, but I can't get the make test to pass. I'm not clear how that test framework works.

I think it may be beneficial to have a base class for tests so you can write one test that works on all the backends with the same expected results. I got started on it, but if you like the approach then I can do a bit more and perhaps use testify which may make the tests cleaner.

I'm also unsure if I need to do more cleanup when I overwrite the job. It seems like futureJobs may need to be purged.

I coded up a solution for redis that should work based on the documentation, but I haven't been able to test it.

I'm happy to do some pairing on it and discuss how to do the unique constraints correctly.

No pressure to merge the code until it all works correctly.

acaloiaro commented 9 months ago

On Sun Dec 31, 2023 at 10:26 PM MST, Philip Constantinou wrote:

@acaloiaro thanks for the suggestions. I've finished a complete draft implementation. When I run the postgres and memory tests one at a time, they work correctly, but I can't get the make test to pass. I'm not clear how that test framework works.

First of all. Sorry for such a slow response. I've been traveling for both work and holidays, and got sick in the middle of it all.

I believe make test is not passing because it parallelizes a lot of tests at once. And individual runs of the test suites on your machine are passing by chance. As you can see, the tests fail here on Github Actions as well.

I think I like the way you've reorganized this test in its own file. A singular test file per backend has been feeling unweildly to me.

In any case, I do believe your test is correctly failing.

The ON CONFLICT line appears wrong to me. I'm seeing:

ON CONFLICT (fingerprint, status, ran_at)

But I beleive this should be

ON CONFLICT (queue, fingerprint, status, ran_at)

As that is the compound index defined by the migration backends/postgres/migrations/20240101191302_add_unique_on_fingerprint_constraint.up.sql:

ALTER TABLE neoq_jobs ADD CONSTRAINT neoq_jobs_fingerprint_constraint_idx UNIQUE (queue, fingerprint, status, ran_at);

So that might explain at least one reason the tests fail.

I will say that I find the logic of the test somewhat difficult to follow. There are some concepts in the test handler like foundKey and proceedKey that don't intuitively add up for me.

I think it may be beneficial to have a base class for tests so you can write one test that works on all the backends with the same expected results. I got started on it, but if you like the approach then I can do a bit more and perhaps use testify which may make the tests cleaner.

I think tests can be greatly improved. If you think you can do that, I'm fully behind it and happy to help. Although I recommend that in a separate pull request if you decide to move forward with it.

I'm also unsure if I need to do more cleanup when I overwrite the job. It seems like futureJobs may need to be purged.

Future jobs will indeed need to be considered here. For example, examine how this code schedules goroutines for upcoming jobs: https://github.com/acaloiaro/neoq/blob/86f7869440ffada0b1a2d3a13aa6bbfe5a3478af/backends/memory/memory_backend.go#L281

I coded up a solution for redis that should work based on the documentation, but I haven't been able to test it.

I'm happy to do some pairing on it and discuss how to do the unique constraints correctly.

Feel free join the gitter chat: https://app.gitter.im/#/room/%23neoq:gitter.im or #neoq:gitter.im from any Matrix client.

to chat and figure out a good time to pair on it.

No pressure to merge the code until it all works correctly.

acaloiaro commented 8 months ago

FYI @pconstantinou I'm just waiting for a passing test suite before re-reviewing code.

acaloiaro / neoq

Allow overriding schedule jobs #109