Closed wdbaruni closed 4 months ago
[!IMPORTANT]
Auto Review Skipped
Auto reviews are disabled on this repository.
Please check the settings in the CodeRabbit UI or the
.coderabbit.yaml
file in this repository. To trigger a single review, invoke the@coderabbitai review
command.You can disable this status message by setting the
reviews.review_status
tofalse
in the CodeRabbit configuration file.
Parallelism in Housekeeping has several issues, one of which the potential for panicking. Suggest adopting the workerpool package I mentioned instead of implementing our own.
Not sure about the several issues. You pointed to one issue with a wrong call to waitGroup.Done()
and that was fixed. You forgot to mention the workerpool package, but the semaphore seem to be working fine. The other requirement I have is I don't want the housekeeper to start another iteration before all tasks are completed, and thats why I am using the waitGroup
Don't embed context.Context in the housekeeping structure.
Fixed
We need to be mocking time in our tests, especially as this area of the code base grows.
Answered inline
Here's the worker pool package: https://github.com/gammazero/workerpool
The several issues were embedded context in structure, and the potential for panic with the wait group. The look to be addressed now will give another review in a sec.
I've previously looked into the worker pool library you mentioned, and it doesn't enable waiting for housekeeping tasks before calling the iteration as complete and starting another one. Our requirements are pretty simple and I don't see a need for such a library at this point. The semaphore is limiting the number of concurrent tasks, which is what the workerpool library does, and the waitGroup is preventing iterations from overlapping with each other, which the workerpool library is missing.
This work is a pre-requirement for job queues as the orchestrator is currently failing jobs that take longer than ExecutionTimeout from the time they were submitted. ExecutionTimeout should only fail executions that have been running for too long, and not fail jobs. If a job haven't started and has been in the queue for too long, then this timeout should take effect as there will be other config to timeout of fail a job has been in the queue for too long or has been retried too many times.
A side effect of this PR is that timed out executions will be retried by the scheduler on other nodes until one succeed or no more nodes to retry. This will be handled in next job queue related PRs