WinVector / replyr

Patches for using dplyr with Databases and Big Data
https://winvector.github.io/replyr/
Other
67 stars 12 forks source link

moveValuesToColumns using remote table returns Error #8

Closed asantucci closed 7 years ago

asantucci commented 7 years ago

I was under the impression Re Plyr offered reshaping for remote tables, but perhaps I am incorrect?

Consider moveValuesToColumns and moveValuesToRows. Each of these functions in turn calls checkColsFormUniqueKeys, and this function in turn calls nrow(data), which returns NA for remote tables. To extend support for remote tables, we'd need to (at the very least) replace this line in question with something like data %>% ungroup %>% summarise(n()).

Curious to hear your thoughts.

Edit: Looks like I should be using moveValuesToColumnsQ and that this is a misunderstanding on my part. Will follow up if I in fact continue to have any issue, but perhaps a more descriptive function name and or documentation would be in order to point readers to the right remote table counterparts. Thanks for all your hard work!

asantucci commented 7 years ago

Would be nice if there were a more clear example in the documentation of "what the appropriate query" would be to replicate behavior similar to spread or gather. Working this out now, although I currently have a workflow like:

dt <- tbl(db_con, tbl_name) %>% select %>% filter %>% mutate %>% group_by %>% summarise

and it's not clear how I can then feed this resulting dt object into the moveValuesToColumnsQ function call.

edit: looks like I should be using compute to store a temporary table in the DB and then pass the string of the resulting temporary table to moveValuesToColumnsQ...

JohnMount commented 7 years ago

I am developing more documentation as we go, sorry I am not caught up yet. I am also trying to improve the package error messages. I think you found the right issues, but I'll go over them.

Also, thank you for your patience. I know it is frustrating to see someone (myself) promote a package and then have trouble adapting it to a non-trivial workflow.

1) For remote sources you need to use one of moveValuesToColumnsQ() or moveValuesToRowsQ(). Without the Q you are actually calling the cdata wrappers for tidyr which only work on local data. The replyr versions are careful not to call nrow(), and do not check for unique keying, as that may be expensive on larger data. 2) moveValuesToColumnsQ or moveValuesToRowsQ() take table names. So, yes please land the table to a known name by a dplyr::compute(name = ) step and then pass that name to the function. It isn't very pipe-compatible right now, so you wil have to break your pipeline at that point. 3) I now have helper function that build control tables that are exactly gather or spread, you can find some text on that here. Basically moveValuesToColumnsQ() is like a spread/pivot, and and moveValuesToRowsQ() is gather/un-pivot.

For your example it would roughly be:

# set up db connection
db_con <- CONNECTION_TO_DB_OR_SPARK

# build a temp name generator to help delete stuff later (if we want)
tng <- makeTempNameGenerator('p')

# do our initial work
dt <- tbl(db_con, tbl_name) %>% select %>% filter %>% mutate %>% group_by %>% summarise %>% compute(name='dt')

# inspect the table (note using dplyr handle) to get the pivot control table
controlTable <- replyr::buildPivotControlTable(dt,
                                               columnToTakeKeysFrom = 'columnname1',
                                               columnToTakeValuesFrom = 'columnname2',
                                               sep = "_")
# use the control table to pivot based on the name of the table
dt <- moveValuesToColumnsQ(keyColumns = ADDITIONAL_COLUMNS_KEYING_ROWS,
                     controlTable = controlTable,
                     tallTableName = "dt",
                     my_db = db_con,
                     tempNameGenerator = tng)

You want the ADDITIONAL_COLUMNS_KEYING_ROWS plus columnToTakeKeysFrom to identify every data cell uniquely (not checked). The columnToTakeKeysFrom is where new column names are coming from and the columnToTakeValuesFrom is where values are coming from. The naming was actually an attempt to be mnemonic, but that seems to be working more for me than for you (sorry). Here is an essay on the naming conventions: link.

JohnMount commented 7 years ago

Your documentation suggestion is a good one. I've added some better basic examples to the documentation: replyr::moveValuesToRowsQ(), replyr::moveValuesToColumnsQ(). Hopefully this will help some.

asantucci commented 7 years ago

I really appreciate the work you're doing. I'm having a surprisingly hard time understanding your documentation. I think you may want to consider abstracting away some of the implementation details from your users. In particular, after my workfow:

# set up db connection
db_con <- CONNECTION_TO_DB_OR_SPARK

# build a temp name generator to help delete stuff later (if we want)
tng <- makeTempNameGenerator('p')

# do our initial work
dt <- tbl(db_con, tbl_name) %>% select %>% filter %>% mutate %>% group_by %>% summarise %>% compute(name='dt')

I'm left with a remote tibble which looks as follows:

person_id  month   cost
    <int>   <dbl> <dbl>
        3      1  750.2
# ... with more rows

I'm of course intending to reshape wide into the obvious format

person_id month_1 month_2 month_3 ... month_12
        3  750.2    ...

I.e. dcast(dt, person_id ~ month, value.var = 'cost', fill = NA) is the analogous reshape command.

I've tried using replyr::buildPivotControlTable as follows ctbl <- replyr::buildPivotControlTable(d = dt, columnToTakeKeysFrom = 'month', columnToTakeValuesFrom = 'cost', sep = '_')

# month med_cost
#     5 month_5
#    12 month_12
# ...

And when I try to apply your pivot operation

moveValuesToColumnsQ(keyColumns = 'person_id', controlTable = ctbl, tallTableName = 'dt', my_db = con, tempNameGenerator = tng)

I am encountered with the following error:

Error in moveValuesToColumnsQ(keyColumns = 'person_id', controlTable = ctbl, 
  all control table gorup ids must be valid R variable names.

But clearly one of my month identifiers is a month integer, and I've very clearly set a column-name prefix to be "month_#" in the controlTable. Please advise.

MZLABS commented 7 years ago

(@JohnMount here) Still working out how to teach it. It is fairly new material.

I need to do some live teaching to get more detailed feedback on how people pick this up. I appreciate your feedback, as it does help inform me how other people approach this material. I am making changes to reflect what you show me (so your time investment is not needlessly squandered). The dev-version of replyr now defaults to strict=FALSE (the advice given below). To install the dev version please try:

# install.packages("devtools")
devtools::install_github('WinVector/replyr')

That being said: the error you are now running into is due the fact the month column in the control table are integers (not valid un-quoted R variable names) and this version of replyr is a bit too strict. Please try adding the argument strict=FALSE to the moveValuesToColumnsQ() call. I am now setting strict FALSE as the new default- I think I am doing enough quoting in the SQL that we don't have to worry too much about structure of names.


library("replyr")
#> Loading required package: seplyr
#> Loading required package: wrapr
#> Loading required package: cdata
my_db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

# pivot / tidyr::spread example
# cdata version
d <- data.frame(meas= c(1L, 2L), val= c(0.6, 0.2))
cdata::moveValuesToColumns(d,
                           columnToTakeKeysFrom= 'meas',
                           columnToTakeValuesFrom= 'val',
                           rowKeyColumns= c())
#>     1   2
#> 1 0.6 0.2
#tidyr version

tidyr::spread(d, key = 'meas', value = 'val')
#>     1   2
#> 1 0.6 0.2

# replyr version
dR <- replyr_copy_to(my_db, d, 'dR',
                     overwrite = TRUE, temporary = TRUE)
cT <- buildPivotControlTable(dR,
                             columnToTakeKeysFrom= 'meas',
                             columnToTakeValuesFrom= 'val')
replyr::moveValuesToColumnsQ(keyColumns = NULL,
                             cT, 'dR', my_db, strict = FALSE)
#> # Source:   table<mvtcq_fa9cmifod44kok4tgmk2_0000000001> [?? x 2]
#> # Database: sqlite 3.19.3 [:memory:]
#>     `1`   `2`
#>   <dbl> <dbl>
#> 1   0.6   0.2

The above should work without the strict=FALSE flag in the new dev-verion of replyr.

Thanks for sticking with this, I hope it works for you.

asantucci commented 7 years ago

@MZLABS @JohnMount Thank you so much for your time!

Simply installing the development version of the package was enough to fix the error for me, and in fact the column names were generated (as I thought they might be) as valid R names. I.e. the code I wrote in my previous post yields a tibble of the form

person_id month_5 month_11 month_1 month_2 ...

I'll let you know if I have any further questions.

Thank you so much again for listening sincerely to my feedback and improving your package so quickly.

Edit: I guess the next task I find myself wanting to do is loop over the columns and replace NA's with 0's, so perhaps an argument (within moveValuesToColumnsQ that allows the user to specify what the default fill-in value should be if missing, defaulting to NA of course).

Edit2: Do you have a recommendation for cleanup of the temporary tables that are generated?

JohnMount commented 7 years ago

The fact the names came out the way you felt they should mean I fixed it while fixing something else (and did not remember). So I in some sense knew I was wrong. Sorry about that.

The bulk replacement of NA should probably be an option. I'll try to get to that soon. I'll leave this issue open and leave a comment (and close the issue) when I add that feature.

You are very welcome for the responses. Thank you again for taking the time to help (and doubly thanks for being patient and for saying thank-you; it helps a lot).

The cleanup of the temporary tables should be doable. The idea is to name all all temps using a single temporary name generator generated by replyr::makeTempNameGenerator(). Then at the end after it is safe to do so (and you are not kicking out supports needed by other derived tables) you get the names by calling the temp name generator with dumpList=TRUE and then call dplyr::db_drop_table() on each name (a for-loop).

I admit, the above the temp-name management is a bit kludgy in replyr, but all replyr functions accept a shared temp name generator argument to prevent name leaks. The issue is you often can not delete temps until much later than you think. However, replyr is at least trying to supply enough tools to make it solvable.

The idea is if you don't want to lose/leak names you need to never let any of the replyr functions create their own default temp name generator (their default behavior). As long as you always pass them your generator they will pass it functions they call and you will have complete control. The defaults are to build their own generators, to make the functions easier to use for light use when you are not worried about temp management (else the package would be even harder to learn).

Thanks for taking the time to help. replyr is a package I use with clients (so they get a lot more hands-on training), but I have been promoting it publicly- so it needs to be in a state that can be picked up by new users.

JohnMount commented 7 years ago

I could not resist. I have a "control default values" solution in now (same value all columns, but hey- at least it isn't always NA). Can you please give it a try and tell me if you have any issues. I can do a column name controlled map, but until somebody needs it that seems needlessly complicated.

library("replyr")
#> Loading required package: seplyr
#> Loading required package: wrapr
#> Loading required package: cdata
my_db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

d <- data.frame(id = c(1, 1, 2),
                meas= c('AUC', 'R2', 'AUC'), 
                val= c(0.6, 0.2, 0.5),
                stringsAsFactors = FALSE)
# cdata::moveValuesToColumns(d,
#                            columnToTakeKeysFrom= 'meas',
#                            columnToTakeValuesFrom= 'val',
#                            rowKeyColumns= 'id')
dR <- replyr_copy_to(my_db, d, 'dR',
                     overwrite = TRUE, temporary = TRUE)
cT <- buildPivotControlTable(dR,
                             columnToTakeKeysFrom= 'meas',
                             columnToTakeValuesFrom= 'val')

# NA fills
replyr::moveValuesToColumnsQ(keyColumns = 'id',
                             controlTable = cT, 
                             tallTableName = 'dR',
                             my_db = my_db)
#> # Source:   table<mvtcq_k4phkttandcoaqyh3fsh_0000000001> [?? x 3]
#> # Database: sqlite 3.19.3 [:memory:]
#>      id   AUC    R2
#>   <dbl> <dbl> <dbl>
#> 1     1   0.6   0.2
#> 2     2   0.5    NA

# 0.0 fills
replyr::moveValuesToColumnsQ(keyColumns = 'id',
                             controlTable = cT, 
                             tallTableName = 'dR',
                             my_db = my_db,
                             defaultValue = 0.0)
#> # Source:   table<mvtcq_ncdyx91vpgn5nkkot3va_0000000001> [?? x 3]
#> # Database: sqlite 3.19.3 [:memory:]
#>      id   AUC    R2
#>   <dbl> <dbl> <dbl>
#> 1     1   0.6   0.2
#> 2     2   0.5   0.0