3scale / apisonator

Red Hat 3scale API Management Apisonator backend
https://3scale.net
Apache License 2.0
36 stars 27 forks source link

Fix memory leak on the async worker #376

Closed jlledom closed 3 months ago

jlledom commented 3 months ago

According to some tests ran by the SRE team, the last alpha contains a memory leak. Check this comment: https://github.com/3scale/3scale-saas/issues/754#issuecomment-1984490467

After an investigation, I found the problem was produced by this commit: https://github.com/3scale/apisonator/commit/1d83cbd4728b5f4bbce0bdb1a727ecd7ddbe7437

And in particular, the leak is caused by the use of an Async::Barrier to force the process_all method to wait until all jobs are finished.

https://github.com/3scale/apisonator/blob/35731f9ba08e9cfb6d78ba14631fc851a66ff457/lib/3scale/backend/worker_async.rb#L61-L79

The Async::Barrier implementation is pretty simple, it's just an array @tasks of tasks that is being filled with a new task at every call to async. However, the tasks in this array aren't removed from it when the task finishes, instead, all tasks are removed at once when calling wait. See the code here: https://github.com/socketry/async/blob/v1.31.0/lib/async/barrier.rb

Since we only call wait once on the shutdown process, every one of the thousands of requests apisonator receives creates a task which is added to the barrier array and never removed, this is what produces the memory leak.

If you check the old worker code (here) you can see how they tried to run the tasks in batches of @max_concurrent_jobs jobs. They called schedule_jobs to create @max_concurrent_jobs tasks and then called @reactor.run to run all of them. In my opinion, that was not working, because @reactor.async { perform(job) } actually runs the task, and @reactor.run does nothing. But I think the original idea of running tasks in batches of @max_concurrent_jobs tasks can work.

This PR adds one line that ensures that when there are more than @max_concurrent_jobs tasks in the barrier, no more tasks are scheduled and the current ones are waited to finish, thus removed from the barrier and fixing the memory leak.

This is how it worked before compared to now:

Before:

  1. Up to 20 tasks are scheduled
  2. One of them finishes
  3. Another one takes the slot
  4. Will always be 20 tasks being run concurrently

Now:

  1. Up to 20 tasks are scheduled
  2. Wait for all of them to finish
  3. Accept 20 new tasks

In my tests this doesn't have any impact in performance, on the opposite, it seems to be faster now it doesn't have to work with that huge array.

This batching code is one possible approach to fix the leak, but I'm open to suggestions. Ideas?

akostadinov commented 3 months ago

Great discovery work! It's worth filing a feature request in Async Ruby to allow clearing finished tasks from Barrier without waiting.

But as a fix here, that should do the job.

I'm concerned about introducing the stop calls though. Also would be nice to add a comment why we do the extra waiting. It's otherwise non-obvious.

akostadinov commented 3 months ago

I'm running worker locally for hours and I see no lock-up. I'll merge this because it is a net improvement and we need to figure out whether the lock-up is some environmental issue. Can be done in another PR.

One possible theory about lock-up I can imagine might be caused by losing connection to Redis. For some strange reason, we pass redis client from the worker to the JobFetcher. But then on connection error, the JobFetcher itself would re-create the connection.

https://github.com/jlledom/apisonator/blob/9a4c16cb8a2d266c8229946957a7e55bb23ba6e9/lib/3scale/backend/job_fetcher.rb#L83-L86https://github.com/jlledom/apisonator/blob/9a4c16cb8a2d266c8229946957a7e55bb23ba6e9/lib/3scale/backend/job_fetcher.rb#L83-L86

This is code duplication, unnecessary complexity and a risk for refactoring.

Anyway, maybe in certain redis connection errors, the JobEnqueuer will loop creating new connections and failing forever.

Although, locally, when I stop redis while worker is running, I see the non-descriptive ERROR 275551 [26/Mar/2024 20:11:04 UTC] EOFError and then worker quits. So I'm not sure what kind of connection errors the fetcher actually handles gracefully.

Apparently it would be beneficial to get this method Worker.logger.notify log errors better. But if you observe lock-ups, do you observe any errors reported to console? Only exception name is visible but still something is.