conda-incubator / conda-store

Data science environments, for collaboration. ✨
https://conda.store
BSD 3-Clause "New" or "Revised" License
137 stars 44 forks source link

restart workers after executing 10 tasks to mitigate memory leaks #840

Open Adam-D-Lewis opened 2 days ago

Adam-D-Lewis commented 2 days ago

Fixes #https://github.com/nebari-dev/nebari/issues/2418

Description

This pull request:

Pull request checklist

Additional information

How to test

@trallard edited from: https://github.com/conda-incubator/conda-store/pull/840#issuecomment-2207828243

To test, you could check out this branch, then run conda standalone conda.store/conda-store/how-tos/install-standalone or you can run it in [docker containers](docker-compose up --build -d as explained here). After you startup, you can visit localhost:8080/conda-store/admin and rebuild the environment it comes with a few times. If you watch the docker memory usage (docker stats) you'll see that the memory usage does not continue to climb with each new build as it does without this option set.

netlify[bot] commented 2 days ago

Deploy Preview for conda-store canceled.

Name Link
Latest commit 5a16a49cf91a4bca3f9f435aa46b8b616fe2f81f
Latest deploy log https://app.netlify.com/sites/conda-store/deploys/6684330c1d9aa80008a6830c
Adam-D-Lewis commented 2 days ago

@dcmcand @peytondmurray @kcpevey @trallard Could I get a review on this?

Adam-D-Lewis commented 2 days ago

I'm happy to make this configurable via a traitlet as well if desired, but the worker restarts seem happen quickly so I can't see too much down side to just applying it for all conda store deployments.

kcpevey commented 1 day ago

I can't speak to the technical aspect of this, but I'm very glad to see it. I see the memory leakage on all my deployments. As far as I can tell it leads to an erroneously failed build due to OOM - which also generally fails with errors that make no sense. All around, this is a big admin management and end user UX improvement.

Adam-D-Lewis commented 1 day ago

Huh, with this option it looks like celery will just automatically restart the worker after 10 tasks are completed. My one complaint with this is it doesn't strictly speaking fix nebari-dev/nebari#2418 as we are only treating a symptom not the memory usage itself.

I agree it does just treat the symptom without fixing the underlying issue. That said, I think this will work robustly.

But in any case from what I have read it seems like this is exactly the kind of use case the option was intended for, so I guess let's just use it. Is there an easy way to test this? If so it would be nice to ensure that it works.

To test, you could check out this branch, then run conda standalone https://conda.store/conda-store/how-tos/install-standalone or you can run it in [docker containers](docker-compose up --build -d as explained here). After you startup, you can visit localhost:8080/conda-store/admin and rebuild the environment it comes with a few times. If you watch the docker memory usage (docker stats) you'll see that the memory usage does not continue to climb with each new build as it does without this option set.

peytondmurray commented 1 day ago

To test, you could check out this branch, then run conda standalone https://conda.store/conda-store/how-tos/install-standalone or you can run it in [docker containers](docker-compose up --build -d as explained here). After you startup, you can visit localhost:8080/conda-store/admin and rebuild the environment it comes with a few times. If you watch the docker memory usage (docker stats) you'll see that the memory usage does not continue to climb with each new build as it does without this option set.

Ahh, what I meant was is there a simple way of adding automated testing?

trallard commented 1 day ago

Ahh, what I meant was is there a simple way of adding automated testing?

I am +1 with Peyton here. It is good that we can manually verify (once or so), but I would prefer having some sort of automated tests (akin to load tests) to ensure this is, in fact, a useful fix. Those tests could later be used to also do some benchmarking/memory checks. So while the fix itself seems fine I would suggest holding on merging until the tests have been added.