Closed jlledom closed 3 months ago
Great discovery work! It's worth filing a feature request in Async Ruby to allow clearing finished tasks from Barrier
without waiting.
But as a fix here, that should do the job.
I'm concerned about introducing the stop
calls though. Also would be nice to add a comment why we do the extra waiting. It's otherwise non-obvious.
I'm running worker locally for hours and I see no lock-up. I'll merge this because it is a net improvement and we need to figure out whether the lock-up is some environmental issue. Can be done in another PR.
One possible theory about lock-up I can imagine might be caused by losing connection to Redis. For some strange reason, we pass redis client from the worker to the JobFetcher
. But then on connection error, the JobFetcher
itself would re-create the connection.
This is code duplication, unnecessary complexity and a risk for refactoring.
Anyway, maybe in certain redis connection errors, the JobEnqueuer
will loop creating new connections and failing forever.
Although, locally, when I stop redis while worker is running, I see the non-descriptive ERROR 275551 [26/Mar/2024 20:11:04 UTC] EOFError
and then worker quits. So I'm not sure what kind of connection errors the fetcher actually handles gracefully.
Apparently it would be beneficial to get this method Worker.logger.notify
log errors better. But if you observe lock-ups, do you observe any errors reported to console? Only exception name is visible but still something is.
According to some tests ran by the SRE team, the last alpha contains a memory leak. Check this comment: https://github.com/3scale/3scale-saas/issues/754#issuecomment-1984490467
After an investigation, I found the problem was produced by this commit: https://github.com/3scale/apisonator/commit/1d83cbd4728b5f4bbce0bdb1a727ecd7ddbe7437
And in particular, the leak is caused by the use of an
Async::Barrier
to force theprocess_all
method to wait until all jobs are finished.https://github.com/3scale/apisonator/blob/35731f9ba08e9cfb6d78ba14631fc851a66ff457/lib/3scale/backend/worker_async.rb#L61-L79
The
Async::Barrier
implementation is pretty simple, it's just an array@tasks
of tasks that is being filled with a new task at every call toasync
. However, the tasks in this array aren't removed from it when the task finishes, instead, all tasks are removed at once when callingwait
. See the code here: https://github.com/socketry/async/blob/v1.31.0/lib/async/barrier.rbSince we only call
wait
once on the shutdown process, every one of the thousands of requests apisonator receives creates a task which is added to the barrier array and never removed, this is what produces the memory leak.If you check the old worker code (here) you can see how they tried to run the tasks in batches of
@max_concurrent_jobs
jobs. They calledschedule_jobs
to create@max_concurrent_jobs
tasks and then called@reactor.run
to run all of them. In my opinion, that was not working, because@reactor.async { perform(job) }
actually runs the task, and@reactor.run
does nothing. But I think the original idea of running tasks in batches of@max_concurrent_jobs
tasks can work.This PR adds one line that ensures that when there are more than
@max_concurrent_jobs
tasks in the barrier, no more tasks are scheduled and the current ones are waited to finish, thus removed from the barrier and fixing the memory leak.This is how it worked before compared to now:
Before:
Now:
In my tests this doesn't have any impact in performance, on the opposite, it seems to be faster now it doesn't have to work with that huge array.
This batching code is one possible approach to fix the leak, but I'm open to suggestions. Ideas?