drsimonj / twidlr

data.frame-based API for model and predict functions
Other
59 stars 9 forks source link

Gamlss #25

Closed erblast closed 7 years ago

erblast commented 7 years ago

added gamlss

drsimonj commented 7 years ago

Thanks again! I've left few comments on code (above) for corrections. Also, I built your functions and ran following, only to receive an error:

d <- datasets::mtcars
fit <- gamlss(d, vs ~ hp + wt, family = gamlss.dist::BI())
predict(fit, data = d)

#> Error in data.frame(data, source = namelist) : 
#>  arguments imply differing number of rows: 32, 33 

Does this replicate for you?

erblast commented 7 years ago

sorry for the sloppy code from the last commit, i thought everything went fine on my computer but i am just getting used to package building functionality of Rstudio and i must have missed something. Especially the bug you found really annoyed me, because i could not figure out what was going on. This stackoverflow post

http://stackoverflow.com/questions/39125600/r-error-in-data-framedata-source-namelist-arguments-imply-differing-num

leads me to believe that gamlss:::predic.gamlss() searches the local namespace for a variable with the same name as the data variable that was passed to gamlss() to create the model object. I now attached the data to the model object and then pass it to gamlss:::predic.gamlss() . And this fixes the issue for me.

I do not understand now why this travis CI build test is failing. Can you make something of it? Do you think attaching the data to the model will unnecessarily blow up memory usage? How does that relate to pipelearner where I can find the original data in every row for each cross validation sample. Does R recongnise that it is storing the same object and always point to the same location in memory?

drsimonj commented 7 years ago

No worries. Problems like these are exactly one of the reasons why I created twidlr! For other examples, see this gist I started as a resource for twidlr motivations (code here to be run without twidlr). I've just added this gamlss issue (after working out what is going on).

The problem is that gamlss:::predict.gamlss is calling eval(Call$data) when newdata is used without data. This searches for the original data in the global environment which, as far as I'm concerned, is not ideal.

I played with few ideas without digging much deeper into gamlss:::predict.gamlss, but think that your solution of attaching the data is best for the moment. It may be a necessary evil for reliable results!

I have some ideas for improvement, however. First, to save some space, rather than attaching the entire original data frame, just use the variables included in the model as follows:

original_data <- data[all.vars(formula)]

Then attach this as an attribute rather than a new object in the list as follows:

attr(object, "original_data") <- original_data

Then, in predict, recover this original data via:

original_data <- attr(object, "original_data")

and feed this into the gamlss:::predict function.

For example of using attributes, see here.

See how you go implementing this, but don't worry if it's not exact. I'm happy to clean any details after your next push.

drsimonj commented 7 years ago

Also, please add some notes in the documentation about what is going so users don't confuse data (for new data), with data in the original predict function.

drsimonj commented 7 years ago

Sorry, forgot to answer some other questions:

erblast commented 7 years ago

Thanks for your quick reply, I was not aware of the issues from other packages that you mention in your gist at all. Its not really intuitive for me that a built model should depend on the original data at all, especially for regression models. Also thanks for posting the threads about the ::: issue, this also cleared up some things for me. I will implement the changes you suggested.

erblast commented 7 years ago

I had to bring the original_dataframe and the new_dataframe to the same dimensions in order for predict.gamlss() to work.

I am not sure what you want me to add as documentation. If you accidently pass newdata to twidlr::predict.gamlss() it will throw an error. Sould this be mentioned in the documentation (the help page?) or as comments in the actual code?

erblast commented 7 years ago

done

drsimonj commented 7 years ago

Merged! Congrats and thanks again for another nice contribution :)

I cleaned the code in 3fc3d56d4a2b173c4a57008d51a02c357e09fce8 , which you can take a look at if you're planning to add anything else.