conda-incubator / conda-store

Data science environments, for collaboration. ✨
https://conda.store
BSD 3-Clause "New" or "Revised" License
142 stars 46 forks source link

[BUG] - Memory leaks during environment creation tasks #848

Closed peytondmurray closed 1 day ago

peytondmurray commented 2 months ago

Describe the bug

See https://github.com/nebari-dev/nebari/issues/2418 and #840 for context. TLDR: action_add_conda_prefix_packages is leaking memory, causing problems on various nebari deployments.

edit by @trallard: It seems action_add_conda_prefix_packages is not the main or at least not the sole culprit of memory leaks so I adjusted the title

Memray flamegraph:

image

Expected behavior

No memory leaks.

How to Reproduce the problem?

See https://github.com/nebari-dev/nebari/issues/2418 for a description.

Output

No response

Versions and dependencies used.

No response

Anything else?

No response

Adam-D-Lewis commented 2 months ago

The conda store worker docker container usage rises by ~18Mb with each new env build according to docker compose stats, but the memory usage at the top of the flamegraph (see below) seems to show an increase of about 7Mb with each new env build (each spike is a new conda env build). I then ran a quick test bypassing list_conda_prefix_packages and the memory seemed to only increase by about ~11Mb with each new env so it seems likely that there may be multiple sources of memory growth (perhaps the rest is in one of the other subprocesses started by conda store that I didn't track), but I'm not sure where the other increase(s) are currently.

image

trallard commented 2 months ago

Per @Adam-D-Lewis's comment above about seeing memory increases even after bypassing the initially flagged action, we need to do a more in-depth profiling analysis.

Adam-D-Lewis commented 2 months ago

I tried restricting the docker container to 1 GiB of memory usage, and the memory growth per build was then less than 18Mb per build, but did increase, and eventually the celery worker was restarted due to memory usage. I'm wondering if the excess 11Mb of memory growth in the docker container when I had no memory limit specified were just due to python being a memory managed language and that the memory usage from garbage collected objects is not always given back to the OS.

That being said, I think it should still be possible to write a test showing the memory usage grows to the point where a celery worker runs out of memory.

trallard commented 1 day ago

@peytondmurray is this still relevant. IIRC we could not demonstrate that there was in fact a leak happening but we will be doing some profiling soon too. Shall we close this issue

peytondmurray commented 1 day ago

That sounds fine; closing now. Anyway, even if there was a leak we are now restarting workers regularly so the symptom is in principle avoided even if any underlying leak (if any) isn't solved.