Crunch-io / crplyr

A 'dplyr' Interface for Crunch
https://crunch.io/r/crplyr/
GNU Lesser General Public License v3.0
5 stars 3 forks source link

Add better error message when user calls unimplemented functions like `mutate` `rename` etc. #5

Closed joewilliams-yg closed 5 years ago

joewilliams-yg commented 6 years ago

This stems from a discussion about using as.data.frame(..., force=TRUE).

The use case here is to do external weighting. I need to be able to manipulate a data.frame object in order to use a raking script on the dataset. I don't need or want to create variables in the actual client facing dataset. If I use as.data.frame(..., force = TRUE) I get a data.frame object that I can manipulate. If is use as.data.frame(..., force = FALSE) I cannot manipulate the data.frame to do common recodings.

Yet, I have been told to use crplyr() with as.data.frame(..., force = FALSE) to get the same functionality. That doesn't appear to be the case.

Should we expect as.data.frame(..., force=FALSE) to have the same level of functionality as force=TRUE?

dt <- as.data.frame(ds[c("identity", "gender", "age", "age4", "race4", "educ4", "presvote16x", "e14_presvote12", "pid3", "ideo3", "region", "votereg2", "app_dtrmp")], include.hidden = TRUE, force = FALSE) %>% mutate( race3 = recode_factor(race4, 'White' = 'White/Other', 'Other' = 'White/Other', 'Black'='Black', 'Hispanic'='Hispanic'), educ3 = recode_factor(educ4, 'HS or less' = 'HS or less', 'Some college' = 'Some college', 'College grad' = 'College degree', 'Postgrad' = 'College degree'), educ2 = recode_factor(educ3, 'HS or less' = 'No degree', 'Some college' = 'No degree', 'College degree' = 'College grad'))

Produces the following error:

Error in UseMethod("mutate") : no applicable method for 'mutate' applied to an object of class "CrunchDataFrame"

gshotwell commented 6 years ago

Hi Joe,

Thanks for the report. You should not expect the two as.data.frame calls to behave the same way. When you use force = TRUE you will get a regular R data frame as a result. This means that everything in the tidyverse and other packages will work on the resulting object. When you use force = FALSE the result is a CrunchDataFrame which has some dataframe methods implemented, but not all of them. In this case we don't currently have a mutate implemented in crplyr, so if you want to use that function you need to use force = TRUE to get the R data frame. One thing you can do to make this process a bit easier is to use the collect() function to bring the data down from Crunch. So you can do something like:

ds %>% 
    select(var1, var2) %>%
    collect() %>% #Data is brought down from crunch at this point
    mutate(...) # Continue on using dplyr/tidyr on a local data frame. 

Or intention for crplyr was basically to make it easier to work with large datasets before pulling the data into your local machine for further analysis. If you want your work to be reflected on the server, it's better to use rcrunch tools to do that job. The way to think about it is that crplyr is good if you want to get the data out of Crunch and do something with it, rcrunch is better if you want to manipulate the server-based Crunch dataset.

joewilliams-yg commented 6 years ago

Ah, now I understand. I got confused because it was suggested I modify my work flow to remove the force=TRUE. I will try implementing select() %>% collect() %>% mutate(). Thanks!

gshotwell commented 6 years ago

Great, we still want to implement mutate but in the meantime we should at least throw better error for this case.