Open guannant opened 17 hours ago
Hey, thanks for reporting in and sorry you’re having trouble. I’ve documented some poking around on causes and mitigations to the memory leak in this issue: https://github.com/PhasesResearchLab/ESPEI/issues/230. Given what I’ve seen there, I don’t think emcee or numpy are responsible for the leak.
That said, I have been interested in exploring hdf5 i/o to bundle the trace and lnprob arrays, as well as archive the phase models, datasets, input YAML, and the numpy rng state for reproducibility and smoother restarts
Hey, thanks for reporting in and sorry you’re having trouble. I’ve documented some poking around on causes and mitigations to the memory leak in this issue: #230. Given what I’ve seen there, I don’t think emcee or numpy are responsible for the leak.
That said, I have been interested in exploring hdf5 i/o to bundle the trace and lnprob arrays, as well as archive the phase models, datasets, input YAML, and the numpy rng state for reproducibility and smoother restarts
I see. What would be the recommended temporary fix here? restarting scheduler or cache the symbols?
I can help out with the HDF5 output to combine at least the trace and lnprob arrays and make a pull request here. My project at ANL relies heavily on ESPEI, and we are also exploring the integration of different MCMC engines with ESPEI. Hopefully, this can be a good add-on feature to ESPEI in the future.
Hi there,
I was running the tutorial example for Cu-Mg on our HPC system and noticed a significant increase in memory usage as the iterations progressed. Specifically, the memory usage reached approximately 700GB after 1,713 iterations (see the attached screenshot). This resulted in our system flagging the job due to excessive memory consumption.
It appears that this high memory demand may stem from one or both of the following:
To address this, I believe ESPEI could benefit from a mechanism to periodically save results to disk (e.g., every 100 iterations) and reset the emcee sampler to free memory.
I am happy to contribute by developing an HDF5 output module for ESPEI to replace the current use of numpy.save(). This would enable periodic pruning of the emcee sampler and provide a more memory-efficient workflow.
Let me know your thoughts on this!