caporaso-lab / sourcetracker2

SourceTracker2
BSD 3-Clause "New" or "Revised" License
62 stars 45 forks source link

memory leak #58

Closed wdwvt1 closed 8 years ago

wdwvt1 commented 8 years ago

There is a memory leak that has become apparent in long-running simulations. The memory usage of python when running _gibbs steadily increases despite there being no clear reason for doing so (the memory is preallocated for results storage in gibbs_sampler).

I believe the error has to do with one of the following:

  1. numpy array copies that occur and are not garbage collected link1, link2.
  2. pandas dataframes not being correctly garbage collected (might be the same bug as 1) link1, link2, link3.
  3. Memory leak when ipyparallel is running link1.

Based on the threads I have read (those linked above) I am guessing that either a bunch of array copies are occurring that are not getting collected, or there is some interaction between the cluster and multiple sinks.

wdwvt1 commented 8 years ago

I think the issue has something to do with mapping different sink samples to different processors via ipyparallel. Below is a graph of a simple experiment.

The obscured black line is runtime vs memory consumption for a single sink, no cluster, 10 iterations. The blue line is runtime vs memory consumption for a 2 sinks with a cluster of 2 nodes, 10 iterations. The red line is runtime vs memory consumption for 100 sinks with a cluster of 2 nodes, 10 iterations. figure_1

Obviously not conclusive, but you can see the memory jump in steps for red. I see something very similar on longer 100+ sink runs, but significantly more pronounced in terms of memory usage.

ajaykshatriya commented 8 years ago

Very insightful. You are a supercluster of biome code.

Best, Ajay

Sent from a mobile device

On Aug 16, 2016, at 5:00 PM, Will Van Treuren notifications@github.com wrote:

I think the issue has something to do with mapping different sink samples to different processors via ipyparallel. Below is a graph of a simple experiment.

The obscured black line is runtime vs memory consumption for a single sink, no cluster, 10 iterations. The blue line is runtime vs memory consumption for a 2 sinks with a cluster of 2 nodes, 10 iterations. The red line is runtime vs memory consumption for 100 sinks with a cluster of 2 nodes, 10 iterations. [image: figure_1] https://cloud.githubusercontent.com/assets/1048569/17718736/668b6394-63ca-11e6-97bd-afc4208c92e4.png

Obviously not conclusive, but you can see the memory jump in steps for red. I see something very similar on longer 100+ sink runs, but significantly more pronounced in terms of memory usage.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/biota/sourcetracker2/issues/58#issuecomment-240265668, or mute the thread https://github.com/notifications/unsubscribe-auth/AHTpAsTG2U5JWLq3JJFTDK55FbJMkiKwks5qgkEMgaJpZM4Jl11y .

wdwvt1 commented 8 years ago

My hunches were all wrong. After using valgrind, memory_profiler, and the python standard tracemalloc, I figured it out based on guess and check. Removing the Sampler.seq_assignments_to_contingency_table call solved the problem. I'll update the function in a future commit.