POSYDON-code / POSYDON

POSYDON is a next-generation single and binary-star population synthesis code incorporating full stellar structure and evolution modeling with the use of MESA.
BSD 3-Clause "New" or "Revised" License
29 stars 19 forks source link

BinaryPopulation: Out of Memory #224

Open maxbriel opened 8 months ago

maxbriel commented 8 months ago

When running a large population on a cluster, the BinaryPopulation goes out of memory even when the mem_per_cpu and maximum CPU in the population file are set the same.

My expectation is that some binaries take up more memory than expected, probably due to the "looping" issue described in Issue #194. As such, slurm kills the job before POSYDON can dump the binary into a file. However, I have not investigated this in detail.

@ka-rocha have you ran into this issue? Or suggestions on the possible origin for this?

maxbriel commented 8 months ago

I've performed a few test to check what the memory consumption is with different options.

First, I performed 4 different runs:

  1. No binary
  2. 10 binaries with dump_rate=10 while ram_per_cpu=None
  3. 10 binaries with a dump_rate=1 while ram_per_cpu=None
  4. 10 binaries with a dump_rate=1 while ram_per_cpu=4

test_runs

Both 2. and 4. keep all binaries in memory as expected from the code. [Is this wanted behaviour?]

However, 3. shows unexpected behaviour, it's memory usage keeps increasing, while binaries are written to file. The below figure shows the calls to save for this run between the blue brackets. It's clear that even though the file gets saved, the memory usage slowly keeps increasing.

Figure_1

maxbriel commented 8 months ago

This Out-of-Memory error is difficult to track down. Here's a summary:

  1. It is not the binaries themselves. They only take up a small amount of space compared to grids etc
  2. It is not the initial loading in of the grids and interpolators. This is about 1.2 GB.
  3. Three possible origins:
    1. The garbage collection does not function correctly on the cluster/yggdrasil?
    2. A dataset allocation in a package calling h5py, once in a while?
    3. IF interpolator?

When reproducing the issue on a local machine, memory_profiler initially increases and eventually drop significantly, probably due to swapping occurring by the operating systems. As such, it's difficult to recreate the problem.

On the cluster, however, the memory keeps increasing without releasing any of it, see below. This is a multi-metallicity run with 2 job arrays with a total of 100 binaries and a dump rate at every 10 binaries. Both usages look the same. No memory gets released during the run.

Screenshot 2023-12-20 at 14 10 58

Here's a longer run mid-run with many binaries. The memory usage is just slowly creeping up.

Screenshot 2023-12-20 at 14 18 08

Manually calling gc.collect() every loop in the BinaryPopulation.evolve() does not seem to resolve this issue, meaning that references to these items remain or they're small allocation that are earmarked as never to be released (as per python standard).

The tracemalloc indicate that 2 larger memory usage areas inside the binary loop.

  1. When calling the interpolation in detached_step, grids get read from file. An h5py dependency allocates an increasing amount of memory.
  2. In the IFinterpolator DataScalars are stored to scale single track information. This results in additional data usage.

There are also some smaller continuous increases per loop, where it is unclear if these ever get released or if they can even be released.

maxbriel commented 8 months ago

I've been able to track down the origin of this.

It turns out to be specific to reading the binaries from an HDF5 file and (re-)evolving them. The below plot shows a 2 100 binary per metallicity run, where the first is with randomly generating the binaries, and the second with reading from an HDF5 file to evolve the binaries. Although there is some increase in the random generation, it is not as extreme as with reading from the HDF5 file and might not be a problem in longer runs.

Screenshot 2023-12-21 at 15 44 10

Here's the trace of the offending line:

Traceback ``` ------trace------- File "/Users/max/Documents/projects/memory_population_error/run_100_dump10_noram.py", line 45 main(input_file='/Users/max/Documents/projects/memory_population_error/0.1_test.h5') File "/Users/max/Documents/projects/memory_population_error/run_100_dump10_noram.py", line 40 binary_population.evolve(tqdm=True) File "/Users/max/Documents/POSYDON/posydon/popsyn/binarypopulation.py", line 213 self._safe_evolve(**self.kwargs) File "/Users/max/Documents/POSYDON/posydon/popsyn/binarypopulation.py", line 287 binary = self.manager.from_hdf(index, restore=True).pop() File "/Users/max/Documents/POSYDON/posydon/popsyn/binarypopulation.py", line 697 hist = self.store.select(key='history', where=query_str) File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/pandas/io/pytables.py", line 886 s.infer_axes() File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/pandas/io/pytables.py", line 2809 s = self.storable File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/pandas/io/pytables.py", line 3487 return getattr(self.group, "table", None) File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/group.py", line 798 return self._f_get_child(name) File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/group.py", line 685 return self._v_file._get_node(childpath) File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/file.py", line 1550 node = self._node_manager.get_node(nodepath) File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/file.py", line 411 node = self.node_factory(key) File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/group.py", line 1158 return ChildClass(self, childname) File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/table.py", line 808 super().__init__(parentnode, name, new, filters, byteorder, _log, File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/leaf.py", line 264 super().__init__(parentnode, name, _log) File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/node.py", line 258 self._g_post_init_hook() File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/table.py", line 845 indexed = indexname in self._v_file File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/file.py", line 1997 self.get_node(path) File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/file.py", line 1603 node = self._get_node(nodepath) File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/file.py", line 1550 node = self._node_manager.get_node(nodepath) File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/file.py", line 411 node = self.node_factory(key) File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/group.py", line 1151 return ChildClass(self, childname, new=False) File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/index.py", line 381 super().__init__(parentnode, name, title, new, filters) File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/group.py", line 221 super().__init__(parentnode, name, _log) File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/node.py", line 258 self._g_post_init_hook() File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/index.py", line 404 indices = self.indices File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/group.py", line 798 return self._f_get_child(name) File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/group.py", line 685 return self._v_file._get_node(childpath) File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/file.py", line 1550 node = self._node_manager.get_node(nodepath) File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/file.py", line 411 node = self.node_factory(key) File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/group.py", line 1158 return ChildClass(self, childname) File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/indexes.py", line 90 super().__init__( File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/earray.py", line 143 super().__init__(parentnode, name, atom, shape, title, filters, File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/carray.py", line 200 super(Array, self).__init__(parentnode, name, new, filters, File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/leaf.py", line 264 super().__init__(parentnode, name, _log) File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/node.py", line 251 self._v_objectid = self._g_open() File "/Users/max/anaconda3/envs/reversed_interference/lib/python3.11/site-packages/tables/array.py", line 221 (oid, self.atom, self.shape, self._v_chunkshape) = self._open_array() ```

The trace comes specifically from reading the HDF5 file of the input binaries in table format (See this pandas issue). Since this input file stays open, while evolving the population, the memory slowly builds up.

Example of the build up from multiple runs: Evolved Binaries size
10 550 KiB
50 3151 KiB
100 6349 KiB
200 12.5 MiB
maxbriel commented 8 months ago

After doing a longer run with the fix, I still run into an out-of-memory issue with 4GB per CPU allocation. The run does continue for longer than previously; 5 metallicities compared to 3 originally. The run fails after approximately 13 hours. There must be another location where memory is leaked.

The run I'm doing is 1.000.000 binaries per metallicity for all 8 metallicties with a dump rate of 1.000 binaries. I am reading the input population from an HDF5 file one binary at the time.

Looking at the shorter runs in PR #299 and previous exploration of the issues, this could be caused

It might be that switching between metallicities does not release/close some files and as such garbage collection is unable to clean up stuff. I believe that after each metallicity the BinaryPopulation is deleted, but the files related to the population (grids) are not explicitly closed. Could this lead to an increase of memory usage?

292248827-fc8251f1-32d7-4eda-87f5-5c4541493f5f
maxbriel commented 8 months ago

With the fix in PR #233, the memory issue deepens. Even when deleting completely deleting a reference to the BinaryPopulation, the memory associated with the interpolators is not released.

Using gc.get_referrers and sys.getrefcount, I find that there are 2 references to the BinaryPopulation instead of the expected 1. (Note sys.getrefcount will show 3, since that functions creates another reference to the object). This is also true when checking this before entering the while loop in SyntheticPopulation.evolve().

A reference to the object remains and seems to be cyclical(?). gc.get_referrers(self.binary_population[i]) called inside the while loop.

Reference 1 ``` {'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <_frozen_importlib_external.SourceFileLoader object at 0x102a75350>, '__spec__': None, '__annotations__': {}, '__builtins__': , '__file__': '/Users/max/Documents/projects/memory_population_error/multi_metallicity_run/run_population.py', '__cached__': None, 'PopulationRunner': , 'BinaryPopulation': , 'copy': , 'gc': , 'sys': , 'pop': , 'met': 2.0, 'i': , 'referrers': [{...}, ]} ``` (I removed the extra details printed by BinaryPopulation)
Reference 2 ``` ```

Note that in reference 1, Reference 2 is stated as a referrer to 1 under 'referrers'. Calling referrers of Reference 1 gives Reference 1 again.

maxbriel commented 8 months ago

With PR #229, #233 and #234, I'm now hopefully doing larger/longer population runs. So I will provide a short summary of the state of this, because not all issues are completely resolved. I found 3 potential memory leaks:

  1. Reading input systems from HDF5 files (solved in PR #229)
  2. Multi-metallicity runs (hotfix in PR #234)
  3. Single-metallicity runs (potentially remains)

I believe 3 to still be present because before the implementation of PR #233 a memory overflow occurred, which means that with only data from 1 metallicity a binary run still overloaded. This might be related to pandas reads on an open HDF5 file that doesn't get closed, as with the input binary reading. I suspect this since several grid files are kept open in steps. This would only be an issue if the table is stored in table format.

While 2 is solved, the implementation to do so is not great and we might want to think a bit more about how we're doing multi-metallicity runs.

maxbriel commented 2 months ago

During recent population synthesis runs, I've noticed some binary populations going out of memory again: ie a dump rate of 2.000 divided over 200 jobs with 1.000.000 binaries per metallicity and a memory limit of 4GB.

I don't fully understand how this can go over the limit. It seems to have gone out-of-memory after the first 2.000 binaries. Maybe 2.000 binaries is too close to the maximum memory limit.

astroJeff commented 3 weeks ago

Update (Aug 22): This is still an issue, and we should prioritize it. @maxbriel please provide some updates here.

maxbriel commented 3 weeks ago

I ran several tests near the end of July to check what is going on and where memory keeps being used in a binary population run at 1 metallicity with a single CPU.

Test 1

A normal population run Outcome: Out-Of-Memory (4GB limit)

The image below shows the memory usage over time in blue in GB and the writing to disk in orange in MB. The system wasn't able to track it going out of memory, but slurm killed it with an Out-of-Memory code. This shows that

  1. The binaries in memory are not the issue.
  2. There are random spikes in the memory usage.
  3. The initial increase comes from loading in the IF interpolators + other grid initialisation.
  4. The slower but clear increase afterwards comes from reading in specific single-star models.
  5. The slow later increase is the memory leak.

image blue: the RSS memory usage (in GB). Orange: the writing to disk (in MB). x-axis is "time-like"

Test 2

ZAMS -> step_end Outcome: success (4GB limit) Walltime: 12:00:00

The plot below shows the memory usage and the writes to disk again.

  1. The read-in of the single-star models is not present here, since they're only read-in and stored when used.
  2. The slow increase in memory is still there.

image blue: the RSS memory usage (in GB). Orange: the writing to disk (in MB). x-axis is "time-like"

Maybe the combination of writes to file and the slow increase causes the issue by being to close to 4GB. However, I'm not sure how to find/address the slow increase.