Open cjyetman opened 1 year ago
After a lot of experimentation, I found that a significant contribution to the slow behavior of data.prep is due to memory fragmentation. Every time one of the very large datasets are loaded into memory, R tries to find space in RAM for it. When the object is removed, R "releases" the space in RAM that it was taking up, but after a while, there are not enough contiguous blocks of memory to efficiently load another large dataset. Additionally, while R "releases" the memory, the OS does not reclaim it, so the memory requirement on the OS continually increases, likely leading to memory swap occurring.
To test mitigating this, I tried wrapping chunks of the data.prep process in callr()
statements which run the code in a separate R thread which is completely exited and the memory is fully released to the OS after it completes. This seemed to give a significant performance advantage, even though there's some overhead starting multiple R sub-threads.
With this in mind, I think it may be a good strategy to implement the isolation of various chunks in data.prep using callr()
, but with a well though-out strategy of when/where each chunk is run with the aim of starting each dataprep_connect_abcd_with_scenario()
run from an R environment unburdened with heavy memory usage.
Thanks for investigating. In those screenshots, the first is "as-is", and the second is with callr
?
Thanks for investigating. In those screenshots, the first is "as-is", and the second is with
callr
?
roughly, yes
@cjyetman I'm happy to move forward with the callr
strategy that you proposed, what would you think the next steps would be? Shall we have a call to discuss what/ where/ how these callr
chunks should be defined?
(The call doesn't need to be soon/ urgent of course)
@cjyetman I'm happy to move forward with the
callr
strategy that you proposed, what would you think the next steps would be? Shall we have a call to discuss what/ where/ how thesecallr
chunks should be defined?(The call doesn't need to be soon/ urgent of course)
I'm struggling to find the scripts I experimented with, which complicates things a bit, but... I think some serious strategizing of how to implement it would be good, like deciding in what order different chunks can/should be done.
If you'd like to schedule a call with @AlexAxthelm and I, or make use of a Tech Review to discuss this sometime in the next few weeks, I'm open to it!
Shall we call this closed by #240 ?
https://github.com/RMI-PACTA/workflow.data.preparation/pull/240? I don't think so. That doesn't seem to actually reduce memory consumption of the function, it just prevents leakage (from objects in the top-level environment)
I think that in order to actually close this one, we need to not process the "big blocks" of financial/scenario data, and instead process individual elements ("on-demand").
Yeah, this is still something worth doing, hypothetically anyway.
Sounds good!
dataprep_connect_abcd_with_scenario()
is the elephant in the room. It's hundreds of lines of code, difficult to understand, and torturously long to run. There's got to be a better way.AB#10867