Memory leak issue with ESPEI

guannant commented 17 hours ago

Hi there,

I was running the tutorial example for Cu-Mg on our HPC system and noticed a significant increase in memory usage as the iterations progressed. Specifically, the memory usage reached approximately 700GB after 1,713 iterations (see the attached screenshot). This resulted in our system flagging the job due to excessive memory consumption.

It appears that this high memory demand may stem from one or both of the following:

Retention of All Walker Positions: The emcee sampler in ESPEI incrementally retains references to all walker positions.
Accumulation of Intermediate Results: The storage of self.sampler.chain and self.sampler.lnprobability may contribute to the memory growth.

To address this, I believe ESPEI could benefit from a mechanism to periodically save results to disk (e.g., every 100 iterations) and reset the emcee sampler to free memory.

I am happy to contribute by developing an HDF5 output module for ESPEI to replace the current use of numpy.save(). This would enable periodic pruning of the emcee sampler and provide a more memory-efficient workflow.

Let me know your thoughts on this!

bocklund commented 16 hours ago

Hey, thanks for reporting in and sorry you’re having trouble. I’ve documented some poking around on causes and mitigations to the memory leak in this issue: https://github.com/PhasesResearchLab/ESPEI/issues/230. Given what I’ve seen there, I don’t think emcee or numpy are responsible for the leak.

That said, I have been interested in exploring hdf5 i/o to bundle the trace and lnprob arrays, as well as archive the phase models, datasets, input YAML, and the numpy rng state for reproducibility and smoother restarts

guannant commented 11 hours ago

Hey, thanks for reporting in and sorry you’re having trouble. I’ve documented some poking around on causes and mitigations to the memory leak in this issue: #230. Given what I’ve seen there, I don’t think emcee or numpy are responsible for the leak.

That said, I have been interested in exploring hdf5 i/o to bundle the trace and lnprob arrays, as well as archive the phase models, datasets, input YAML, and the numpy rng state for reproducibility and smoother restarts

I see. What would be the recommended temporary fix here? restarting scheduler or cache the symbols?

I can help out with the HDF5 output to combine at least the trace and lnprob arrays and make a pull request here. My project at ANL relies heavily on ESPEI, and we are also exploring the integration of different MCMC engines with ESPEI. Hopefully, this can be a good add-on feature to ESPEI in the future.

PhasesResearchLab / ESPEI

Memory leak issue with ESPEI #262