refactor and optimize `dataprep_connect_abcd_with_scenario()`

cjyetman commented 1 year ago

dataprep_connect_abcd_with_scenario() is the elephant in the room. It's hundreds of lines of code, difficult to understand, and torturously long to run. There's got to be a better way.

AB#10867

cjyetman commented 2 months ago

After a lot of experimentation, I found that a significant contribution to the slow behavior of data.prep is due to memory fragmentation. Every time one of the very large datasets are loaded into memory, R tries to find space in RAM for it. When the object is removed, R "releases" the space in RAM that it was taking up, but after a while, there are not enough contiguous blocks of memory to efficiently load another large dataset. Additionally, while R "releases" the memory, the OS does not reclaim it, so the memory requirement on the OS continually increases, likely leading to memory swap occurring.

To test mitigating this, I tried wrapping chunks of the data.prep process in callr() statements which run the code in a separate R thread which is completely exited and the memory is fully released to the OS after it completes. This seemed to give a significant performance advantage, even though there's some overhead starting multiple R sub-threads.

With this in mind, I think it may be a good strategy to implement the isolation of various chunks in data.prep using callr(), but with a well though-out strategy of when/where each chunk is run with the aim of starting each dataprep_connect_abcd_with_scenario() run from an R environment unburdened with heavy memory usage.

AlexAxthelm commented 2 months ago

Thanks for investigating. In those screenshots, the first is "as-is", and the second is with callr?

cjyetman commented 2 months ago

Thanks for investigating. In those screenshots, the first is "as-is", and the second is with callr?

roughly, yes

jdhoffa commented 2 months ago

@cjyetman I'm happy to move forward with the callr strategy that you proposed, what would you think the next steps would be? Shall we have a call to discuss what/ where/ how these callr chunks should be defined?

(The call doesn't need to be soon/ urgent of course)

cjyetman commented 2 months ago

@cjyetman I'm happy to move forward with the callr strategy that you proposed, what would you think the next steps would be? Shall we have a call to discuss what/ where/ how these callr chunks should be defined?

(The call doesn't need to be soon/ urgent of course)

I'm struggling to find the scripts I experimented with, which complicates things a bit, but... I think some serious strategizing of how to implement it would be good, like deciding in what order different chunks can/should be done.

jdhoffa commented 2 months ago

If you'd like to schedule a call with @AlexAxthelm and I, or make use of a Tech Review to discuss this sometime in the next few weeks, I'm open to it!

jdhoffa commented 3 weeks ago

Shall we call this closed by #240 ?

AlexAxthelm commented 3 weeks ago

https://github.com/RMI-PACTA/workflow.data.preparation/pull/240? I don't think so. That doesn't seem to actually reduce memory consumption of the function, it just prevents leakage (from objects in the top-level environment)

I think that in order to actually close this one, we need to not process the "big blocks" of financial/scenario data, and instead process individual elements ("on-demand").

cjyetman commented 3 weeks ago

Yeah, this is still something worth doing, hypothetically anyway.

jdhoffa commented 3 weeks ago

Sounds good!

RMI-PACTA / pacta.data.preparation

refactor and optimize `dataprep_connect_abcd_with_scenario()` #7