Open zcstarr opened 7 months ago
@emanuellima1 @danlessa , I have a proposal to fix some of cadcad's perf issues. I debugged this issue back in 0.4.28 while working on a client project that involved alot of simulation data. The problem is essentially that ProcessPool is being used incorrectly.
I think this will fix your issue @danlessa with number #350 , in the issue I have listed out what I think the problems are and the PR is the fix I implemented.
I wanted to know if this is something you're interested in fixing with cadCAD and if it would make sense to go the whole way and fix the lazy loading of the dataframe at the end and make writing the imm. files to disk optional ?
Curious to know what you think.
Hey @zcstarr that would be awesome! You could potentially use the additional_objs
parameter for toggling it off or on. See PR #316 for an example.
@zcstarr going through the whole way is definitely worth the time. It would be nice to develop standardized benchmarks for overall simulation execution and memory usage (in terms of RAM and disk), too.
@danlessa ,thanks for looking at these and the quick response! Yeah I think an option makes sense as well. I played around with a few things , so created a very small sim with 1 parameter 1 state , then played with making that state large like 1mb or 100kb, then making multiple runs to see if I could see a difference in memory performance using the memory profiler. There's a definite simulation drop, but reading the result back from disk is probably the biggest issue.
I was able to sketch out and get working a fully incremental/lazy load process for the simulation part outside of returning back the results. Looking further down the road and thinking about easy_run and the executor, I could see wanting a way to handle having too large a data simulation. The larger the simulation state and params, the more problematic it is to process run the data.
It's easy to run out of ram when running a large enough simulation. I'm trying to think through if it would make sense to have a serialization option, that can allow users to lazy load/write large datasets. I'm not sure what that should be 🤔
I think in a semi ideal world, you'd be able to write the results to disk in a way that can be lazy loaded or computed on demand, so you don't have to pull the entire dataset into memory. Thinking about https://github.com/vaexio/vaex or maybe https://docs.dask.org/en/stable/dataframe.html .
Probably the way forward is to make this an additional obj option for writing temp files to disk , and another for writing the results of the simulation to disk in some format.
@danlessa @emanuellima1 just tagging an update here. So I refactored things to lazy evaluate and was able to save alot of runtime memory.
The basic gist is before any changes including the parallel processing change,
This is parallel processing change
This is the lazy evaluation change
This simulation is in examples/documentation/headless_tools.py I try and simulation creating a 100-200kb state + parameters simulation for a 10year daily run over 2 years.
The result of memory usage is 1.1GB currently to 117mb with lazy evaluation, then when loaded to a df it's 846mb.
Then in data frame
There is a segment of code that kind of didn't make much sense to me, I assumed maybe it was from different versions ago. So given this is new code I thought I'd be able to just support most standard use cases. So feel free to let me know if those additional potential configurations or scenarios are necessary.
Just let me know what you think, I also made lazy eval switchable and only able to be enable on local parallel processing.
Summary: When cadCAD runs in parallel mode, cadCAD will underutilize CPUs somewhere around 75% of available CPUs will go vacant. It's not that it doesn't try and use the CPU it's that it thrashes, the CPU by trying to create new process pools for every config that it wants to run in parallel. This in turn causes the process manager to thrash, because it will constantly utilize then free up memory.
Motivation: cadCAD performance increase to be able to utilize 100% of the cpu in a multithreaded situation, that can save hours off of a large simulation. It will also prevent too many process file handles being opened during execution of a simulation, I believe this might be related to #350 .
Solution: The solution is to refactor execution.py to use a single process pool, as intended by the creators of the package, and instead refactor the simulation to instead only create the pool once, and then reuse cpus as they become available. I also suggest, that we include an option to write intermediate results to disk, and to read them back without loading the entire dataset into memory.
I found that once I increased the parallelization, it was really easy to run out of memory for a large simulation, simply because all intermediate results were being held. My temporary solution is to write these intermediate results to temporary disk. This will prevent processes from running out of memory when running in a highly ( think 16 cores or more) parallel environment. The downside of this is that , when the simulation is complete, cadCAD currently requires you to load everything back into memory. This in turn is memory intensive as well.
The solution here is to continue the refactor to have the final data, load iteratively into memory vs altogether. My initial experiments were with cadcad 0.4.28 to fix this problem. I am hopeful that maybe the datacopy enhancement would reduce the overall memory capacity, which might make it less likely to be an issue. I think the real solution is to iteratively load it.
I'd be happy to go down this route of making this configurable. The PR i've written auto writes and reads from disk, this should be a config option. I wanted to know if this is a direciton worth going, if so I can make it prod worthy.