Open camroach87 opened 7 years ago
Yes. I've hit this and I know @jonocarroll has as well.
I note that lm
, glm
and other ml packages like xgboost
and ranger
have options to turn off some of the weighty pieces. I think it's probably better to use the options if possible, to avoid a memory spike in processing that means we don't even get to the pruning stage.
Maybe there is some kind of wrapper function that could enable all said options? I'm thinking along the lines of this package: https://github.com/rbertolusso/intubate
Bonus evil Hadley: https://twitter.com/hadleywickham/status/759412516539600896
Okay so after reading your links I see the options are not as effective as one might hope. I still reckon that general architecture might be the right way to go though. It could intercept that model object and set references to NULL as you suggest. Just have to somehow ensure garbage collection is happening often enough.
Ha, I like the Hadley comment :) and intubate
looks interesting - going to look into it a bit more. Yeah, I've played around with some of the trim options for those packages and didn't have much luck :(
By general architecture do you mean that we have a function that first looks at the object type and then removes components based on the type? i.e., there will be a set of rules for an lm
object, a different set of rules for an xgboost
object, etc.? So we'll end up with a list of supported packages. I quite like this as well and was considering it - it's definitely more robust than the other approach.
I've been down this road a few times and (as the WinVector blog shows) there's lots of data that comes along for the ride and ends up in the final, very complex, model output object. I tend to find though that different analyses require different parts of that object, hence I doubt all of the components are redundant and can be generally removed.
A few points I will note as I wait for my flight to BNE:
Actual copies of the data are, as @MilesMcBain has shown elsewhere, merely references to the original (thank you R and your copy-on-modify structure) -- try creating a tibble
full of "copies" of an object, models of that object, and extractions of the same, and you'll find they all point to the same memory address. For a large model data object, it's the residuals, qr
, and other "not copies" that make the model object large.
The best approach I've found, which also addresses the "naming things is hard" issue, is to do all of this within the tibble
/purrr
approach; data
, model
, and thing I want
(extracted from the model
column within the tibble
) as map
ped columns, then drop model
. Sure, it uses up a lot of memory while generating it, but you'd need to use that if you're creating the standard model object anyway. This relies on you knowing what you want to keep, but that's somewhat part and parcel of doing this either of these ways.
The last alternative would be to create a new glm
(or whatever model) function which creates a leaner output. This can be domain-specific since your prune
would drop things anyway. This way the object never becomes larger than you want rather than getting too big then pruning back.
Food for thought, but an interesting project for sure. I'll be keen to hear what comes of it.
I'm interested in creating a package to automatically prune large and unneeded components from objects to help with caching performance. Details and motivation in this gist
https://gist.github.com/camroach87/12b658afdd9f2d051721ad21311a960a
Thoughts, suggestions, package name improvements all welcome. Also - if someone has already created something like this please let me know because I will use it all the time :)