Open maxbriel opened 8 months ago
I've performed a few test to check what the memory consumption is with different options.
First, I performed 4 different runs:
dump_rate=10
while ram_per_cpu=None
dump_rate=1
while ram_per_cpu=None
dump_rate=1
while ram_per_cpu=4
Both 2. and 4. keep all binaries in memory as expected from the code. [Is this wanted behaviour?]
However, 3. shows unexpected behaviour, it's memory usage keeps increasing, while binaries are written to file.
The below figure shows the calls to save
for this run between the blue brackets. It's clear that even though the file gets saved, the memory usage slowly keeps increasing.
This Out-of-Memory error is difficult to track down. Here's a summary:
h5py
, once in a while?When reproducing the issue on a local machine, memory_profiler
initially increases and eventually drop significantly, probably due to swapping occurring by the operating systems. As such, it's difficult to recreate the problem.
On the cluster, however, the memory keeps increasing without releasing any of it, see below. This is a multi-metallicity run with 2 job arrays with a total of 100 binaries and a dump rate at every 10 binaries. Both usages look the same. No memory gets released during the run.
Here's a longer run mid-run with many binaries. The memory usage is just slowly creeping up.
Manually calling gc.collect()
every loop in the BinaryPopulation.evolve()
does not seem to resolve this issue, meaning that references to these items remain or they're small allocation that are earmarked as never to be released (as per python standard).
The tracemalloc
indicate that 2 larger memory usage areas inside the binary loop.
detached_step
, grids get read from file. An h5py dependency allocates an increasing amount of memory.There are also some smaller continuous increases per loop, where it is unclear if these ever get released or if they can even be released.
I've been able to track down the origin of this.
It turns out to be specific to reading the binaries from an HDF5 file and (re-)evolving them. The below plot shows a 2 100 binary per metallicity run, where the first is with randomly generating the binaries, and the second with reading from an HDF5 file to evolve the binaries. Although there is some increase in the random generation, it is not as extreme as with reading from the HDF5 file and might not be a problem in longer runs.
Here's the trace of the offending line:
The trace comes specifically from reading the HDF5 file of the input binaries in table
format (See this pandas issue). Since this input file stays open, while evolving the population, the memory slowly builds up.
Example of the build up from multiple runs: Evolved Binaries | size |
---|---|
10 | 550 KiB |
50 | 3151 KiB |
100 | 6349 KiB |
200 | 12.5 MiB |
After doing a longer run with the fix, I still run into an out-of-memory issue with 4GB per CPU allocation. The run does continue for longer than previously; 5 metallicities compared to 3 originally. The run fails after approximately 13 hours. There must be another location where memory is leaked.
The run I'm doing is 1.000.000 binaries per metallicity for all 8 metallicties with a dump rate of 1.000 binaries. I am reading the input population from an HDF5 file one binary at the time.
Looking at the shorter runs in PR #299 and previous exploration of the issues, this could be caused
It might be that switching between metallicities does not release/close some files and as such garbage collection is unable to clean up stuff. I believe that after each metallicity the BinaryPopulation is deleted, but the files related to the population (grids) are not explicitly closed. Could this lead to an increase of memory usage?
With the fix in PR #233, the memory issue deepens. Even when deleting completely deleting a reference to the BinaryPopulation
, the memory associated with the interpolators is not released.
Using gc.get_referrers
and sys.getrefcount
, I find that there are 2 references to the BinaryPopulation
instead of the expected 1. (Note sys.getrefcount
will show 3, since that functions creates another reference to the object).
This is also true when checking this before entering the while
loop in SyntheticPopulation.evolve()
.
A reference to the object remains and seems to be cyclical(?).
gc.get_referrers(self.binary_population[i])
called inside the while
loop.
Note that in reference 1, Reference 2 is stated as a referrer to 1 under 'referrers'
.
Calling referrers
of Reference 1 gives Reference 1 again.
With PR #229, #233 and #234, I'm now hopefully doing larger/longer population runs. So I will provide a short summary of the state of this, because not all issues are completely resolved. I found 3 potential memory leaks:
I believe 3 to still be present because before the implementation of PR #233 a memory overflow occurred, which means that with only data from 1 metallicity a binary run still overloaded. This might be related to pandas
reads on an open HDF5 file that doesn't get closed, as with the input binary reading. I suspect this since several grid files are kept open in steps. This would only be an issue if the table is stored in table
format.
While 2 is solved, the implementation to do so is not great and we might want to think a bit more about how we're doing multi-metallicity runs.
During recent population synthesis runs, I've noticed some binary populations going out of memory again: ie a dump rate of 2.000 divided over 200 jobs with 1.000.000 binaries per metallicity and a memory limit of 4GB.
I don't fully understand how this can go over the limit. It seems to have gone out-of-memory after the first 2.000 binaries. Maybe 2.000 binaries is too close to the maximum memory limit.
Update (Aug 22): This is still an issue, and we should prioritize it. @maxbriel please provide some updates here.
I ran several tests near the end of July to check what is going on and where memory keeps being used in a binary population run at 1 metallicity with a single CPU.
A normal population run Outcome: Out-Of-Memory (4GB limit)
The image below shows the memory usage over time in blue in GB and the writing to disk in orange in MB. The system wasn't able to track it going out of memory, but slurm killed it with an Out-of-Memory code. This shows that
blue: the RSS memory usage (in GB). Orange: the writing to disk (in MB). x-axis is "time-like"
ZAMS -> step_end Outcome: success (4GB limit) Walltime: 12:00:00
The plot below shows the memory usage and the writes to disk again.
blue: the RSS memory usage (in GB). Orange: the writing to disk (in MB). x-axis is "time-like"
Maybe the combination of writes to file and the slow increase causes the issue by being to close to 4GB. However, I'm not sure how to find/address the slow increase.
When running a large population on a cluster, the BinaryPopulation goes out of memory even when the mem_per_cpu and maximum CPU in the population file are set the same.
My expectation is that some binaries take up more memory than expected, probably due to the "looping" issue described in Issue #194. As such, slurm kills the job before POSYDON can dump the binary into a file. However, I have not investigated this in detail.
@ka-rocha have you ran into this issue? Or suggestions on the possible origin for this?