davechallis / ocypod

Ocypod is a Redis-backed service for orchestrating background jobs. Clients/workers can be written in any language, using HTTP/JSON to queue/fetch jobs, store results, etc.
Apache License 2.0
193 stars 13 forks source link

Retrieve timed_out jobs in O(1) #21

Closed joe-at-startupmedia closed 2 years ago

joe-at-startupmedia commented 2 years ago

Hello Dave, great tool here. Here is my first issue. Jobs which have timed out, and no longer retried, are moved to the ended list. There is no remaining index (redis key) to retrieve jobs which have a) timed out and b) have zero retries left without querying each job with a timed_out status. Ideally it would be nice to achieve this in constant time not order of (n). The solution here would be to add another redis list called timed_out which would allow you to retrieve the jobs which have timed_out. There is the failed list but they do not end up there after the retries have been exhausted. Thoughts?

davechallis commented 2 years ago

Yup, I can see that would be useful. I'll have more of a think about this, as I'd like to improve on monitoring/tracking of jobs (including being able to e.g. find all timed out jobs), which would help here.

I've got notes somewhere on a few solutions I was considering (separate queues based on final status as you suggested was one of them, maintaining an index somewhere to allow O(1) retrieval was another), I'll see if I can dig those out sometime.

joe-at-startupmedia commented 2 years ago

This is how I solved it. Note that my branch is not merge-worthy due to my prefacing the key names with the namespace. https://github.com/joe-at-startupmedia/ocypod/commit/ed9ca25a8f1b07e8451f88029a2064992922c59e

joe-at-startupmedia commented 2 years ago

I've noticed however that in addition to simply adding the job id to the timedout list, several other scenarios need accounted for:

  1. Should we allow jobs that are timed_out to be expired from the system? The point here is that they've basically failed and I'd personally like to keep a persistent record of what failed without automatically purging based on the value specified for expiry_check_internal. This way I can later manually requeue the job if need be (as explained in #23)
  2. The delete_in_pipe function also needs to clean the job id from the timedout list. Called by both check_job_expiry and when deleting a job. https://github.com/joe-at-startupmedia/ocypod/blob/bf9775b7e3c862e8c9446b7fa1c0036fc8660cb6/src/application/job.rs#L556
  3. The requeue function needs to clean the job id from the timedout list. https://github.com/joe-at-startupmedia/ocypod/blob/bf9775b7e3c862e8c9446b7fa1c0036fc8660cb6/src/application/job.rs#L174
  4. The cancel function needs to clean the job id from the timedout list. Called by setting the status to cancelled. https://ocypod.readthedocs.io/en/latest/api/#patch-jobjob_id https://github.com/joe-at-startupmedia/ocypod/blob/bf9775b7e3c862e8c9446b7fa1c0036fc8660cb6/src/application/job.rs#L199
  5. Adding the TIMEDOUT_KEY to the server_info function? Called by the GET/ info request. The problem here is that the timed_out job id would present in both the new timedout list as well as the ended so it's unnecessary. https://github.com/joe-at-startupmedia/ocypod/blob/bf9775b7e3c862e8c9446b7fa1c0036fc8660cb6/src/application/manager.rs#L82
  6. Adding the TIMEDOUT_KEY to the job_ids function? Called by the GET /queue/{queue_name}/job_ids request. The problem here is that the timed_out job id would present in both the new timedout list as well as the ended list resulting in duplicates. https://github.com/joe-at-startupmedia/ocypod/blob/bf9775b7e3c862e8c9446b7fa1c0036fc8660cb6/src/application/queue.rs#L192

Other concerns:

  1. The actual name of the key should take more thought as there is already a designated use for timed_out which is a job status. Further TIMEDOUT is two words and thus deserves an underscore separation.
  2. Should the job id be present in both ended and timedout. The problem with this is that it will result in duplicates being present upon calling job_ids or info as explained in items 5 and 6 above.

Example of duplicates:

curl localhost:8023/queue/[queue_name]/job_ids
{"running":[],"completed":[],"cancelled":[],"timed_out":[3,3],"queued":[],"failed":[]}