AU-BURGr / UnConf2017

Repository for Unconf Topics 2017
7 stars 2 forks source link

Project: prunr package #10

Open camroach87 opened 7 years ago

camroach87 commented 7 years ago

I'm interested in creating a package to automatically prune large and unneeded components from objects to help with caching performance. Details and motivation in this gist

https://gist.github.com/camroach87/12b658afdd9f2d051721ad21311a960a

Thoughts, suggestions, package name improvements all welcome. Also - if someone has already created something like this please let me know because I will use it all the time :)

MilesMcBain commented 7 years ago

Yes. I've hit this and I know @jonocarroll has as well.

I note that lm, glm and other ml packages like xgboost and ranger have options to turn off some of the weighty pieces. I think it's probably better to use the options if possible, to avoid a memory spike in processing that means we don't even get to the pruning stage.

Maybe there is some kind of wrapper function that could enable all said options? I'm thinking along the lines of this package: https://github.com/rbertolusso/intubate

Bonus evil Hadley: https://twitter.com/hadleywickham/status/759412516539600896

MilesMcBain commented 7 years ago

Okay so after reading your links I see the options are not as effective as one might hope. I still reckon that general architecture might be the right way to go though. It could intercept that model object and set references to NULL as you suggest. Just have to somehow ensure garbage collection is happening often enough.

camroach87 commented 7 years ago

Ha, I like the Hadley comment :) and intubate looks interesting - going to look into it a bit more. Yeah, I've played around with some of the trim options for those packages and didn't have much luck :(

By general architecture do you mean that we have a function that first looks at the object type and then removes components based on the type? i.e., there will be a set of rules for an lm object, a different set of rules for an xgboost object, etc.? So we'll end up with a list of supported packages. I quite like this as well and was considering it - it's definitely more robust than the other approach.

jonocarroll commented 7 years ago

I've been down this road a few times and (as the WinVector blog shows) there's lots of data that comes along for the ride and ends up in the final, very complex, model output object. I tend to find though that different analyses require different parts of that object, hence I doubt all of the components are redundant and can be generally removed.

A few points I will note as I wait for my flight to BNE:

Food for thought, but an interesting project for sure. I'll be keen to hear what comes of it.