Closed asantucci closed 7 years ago
Would be nice if there were a more clear example in the documentation of "what the appropriate query" would be to replicate behavior similar to spread or gather. Working this out now, although I currently have a workflow like:
dt <- tbl(db_con, tbl_name) %>% select %>% filter %>% mutate %>% group_by %>% summarise
and it's not clear how I can then feed this resulting dt
object into the moveValuesToColumnsQ
function call.
edit: looks like I should be using compute
to store a temporary table in the DB and then pass the string of the resulting temporary table to moveValuesToColumnsQ
...
I am developing more documentation as we go, sorry I am not caught up yet. I am also trying to improve the package error messages. I think you found the right issues, but I'll go over them.
Also, thank you for your patience. I know it is frustrating to see someone (myself) promote a package and then have trouble adapting it to a non-trivial workflow.
1) For remote sources you need to use one of moveValuesToColumnsQ()
or moveValuesToRowsQ()
. Without the Q
you are actually calling the cdata
wrappers for tidyr
which only work on local data. The replyr
versions are careful not to call nrow()
, and do not check for unique keying, as that may be expensive on larger data.
2) moveValuesToColumnsQ
or moveValuesToRowsQ()
take table names. So, yes please land the table to a known name by a dplyr::compute(name = )
step and then pass that name to the function. It isn't very pipe-compatible right now, so you wil have to break your pipeline at that point.
3) I now have helper function that build control tables that are exactly gather or spread, you can find some text on that here. Basically moveValuesToColumnsQ()
is like a spread/pivot, and and moveValuesToRowsQ()
is gather/un-pivot.
For your example it would roughly be:
# set up db connection
db_con <- CONNECTION_TO_DB_OR_SPARK
# build a temp name generator to help delete stuff later (if we want)
tng <- makeTempNameGenerator('p')
# do our initial work
dt <- tbl(db_con, tbl_name) %>% select %>% filter %>% mutate %>% group_by %>% summarise %>% compute(name='dt')
# inspect the table (note using dplyr handle) to get the pivot control table
controlTable <- replyr::buildPivotControlTable(dt,
columnToTakeKeysFrom = 'columnname1',
columnToTakeValuesFrom = 'columnname2',
sep = "_")
# use the control table to pivot based on the name of the table
dt <- moveValuesToColumnsQ(keyColumns = ADDITIONAL_COLUMNS_KEYING_ROWS,
controlTable = controlTable,
tallTableName = "dt",
my_db = db_con,
tempNameGenerator = tng)
You want the ADDITIONAL_COLUMNS_KEYING_ROWS
plus columnToTakeKeysFrom
to identify every data cell uniquely (not checked). The columnToTakeKeysFrom
is where new column names are coming from and the columnToTakeValuesFrom
is where values are coming from. The naming was actually an attempt to be mnemonic, but that seems to be working more for me than for you (sorry). Here is an essay on the naming conventions: link.
Your documentation suggestion is a good one. I've added some better basic examples to the documentation: replyr::moveValuesToRowsQ(), replyr::moveValuesToColumnsQ(). Hopefully this will help some.
I really appreciate the work you're doing. I'm having a surprisingly hard time understanding your documentation. I think you may want to consider abstracting away some of the implementation details from your users. In particular, after my workfow:
# set up db connection
db_con <- CONNECTION_TO_DB_OR_SPARK
# build a temp name generator to help delete stuff later (if we want)
tng <- makeTempNameGenerator('p')
# do our initial work
dt <- tbl(db_con, tbl_name) %>% select %>% filter %>% mutate %>% group_by %>% summarise %>% compute(name='dt')
I'm left with a remote tibble which looks as follows:
person_id month cost
<int> <dbl> <dbl>
3 1 750.2
# ... with more rows
I'm of course intending to reshape wide into the obvious format
person_id month_1 month_2 month_3 ... month_12
3 750.2 ...
I.e. dcast(dt, person_id ~ month, value.var = 'cost', fill = NA)
is the analogous reshape command.
I've tried using replyr::buildPivotControlTable
as follows
ctbl <- replyr::buildPivotControlTable(d = dt, columnToTakeKeysFrom = 'month', columnToTakeValuesFrom = 'cost', sep = '_')
# month med_cost
# 5 month_5
# 12 month_12
# ...
And when I try to apply your pivot operation
moveValuesToColumnsQ(keyColumns = 'person_id', controlTable = ctbl, tallTableName = 'dt', my_db = con, tempNameGenerator = tng)
I am encountered with the following error:
Error in moveValuesToColumnsQ(keyColumns = 'person_id', controlTable = ctbl,
all control table gorup ids must be valid R variable names.
But clearly one of my month identifiers is a month integer, and I've very clearly set a column-name prefix to be "month_#" in the controlTable. Please advise.
(@JohnMount here) Still working out how to teach it. It is fairly new material.
I need to do some live teaching to get more detailed feedback on how people pick this up. I appreciate your feedback, as it does help inform me how other people approach this material. I am making changes to reflect what you show me (so your time investment is not needlessly squandered). The dev-version of replyr
now defaults to strict=FALSE
(the advice given below). To install the dev version please try:
# install.packages("devtools")
devtools::install_github('WinVector/replyr')
That being said: the error you are now running into is due the fact the month column in the control table are integers (not valid un-quoted R variable names) and this version of replyr
is a bit too strict. Please try adding the argument strict=FALSE
to the moveValuesToColumnsQ()
call. I am now setting strict FALSE
as the new default- I think I am doing enough quoting in the SQL
that we don't have to worry too much about structure of names.
library("replyr")
#> Loading required package: seplyr
#> Loading required package: wrapr
#> Loading required package: cdata
my_db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
# pivot / tidyr::spread example
# cdata version
d <- data.frame(meas= c(1L, 2L), val= c(0.6, 0.2))
cdata::moveValuesToColumns(d,
columnToTakeKeysFrom= 'meas',
columnToTakeValuesFrom= 'val',
rowKeyColumns= c())
#> 1 2
#> 1 0.6 0.2
#tidyr version
tidyr::spread(d, key = 'meas', value = 'val')
#> 1 2
#> 1 0.6 0.2
# replyr version
dR <- replyr_copy_to(my_db, d, 'dR',
overwrite = TRUE, temporary = TRUE)
cT <- buildPivotControlTable(dR,
columnToTakeKeysFrom= 'meas',
columnToTakeValuesFrom= 'val')
replyr::moveValuesToColumnsQ(keyColumns = NULL,
cT, 'dR', my_db, strict = FALSE)
#> # Source: table<mvtcq_fa9cmifod44kok4tgmk2_0000000001> [?? x 2]
#> # Database: sqlite 3.19.3 [:memory:]
#> `1` `2`
#> <dbl> <dbl>
#> 1 0.6 0.2
The above should work without the strict=FALSE
flag in the new dev-verion of replyr
.
Thanks for sticking with this, I hope it works for you.
@MZLABS @JohnMount Thank you so much for your time!
Simply installing the development version of the package was enough to fix the error for me, and in fact the column names were generated (as I thought they might be) as valid R names. I.e. the code I wrote in my previous post yields a tibble of the form
person_id month_5 month_11 month_1 month_2 ...
I'll let you know if I have any further questions.
Thank you so much again for listening sincerely to my feedback and improving your package so quickly.
Edit: I guess the next task I find myself wanting to do is loop over the columns and replace NA's with 0's, so perhaps an argument (within moveValuesToColumnsQ
that allows the user to specify what the default fill-in value should be if missing, defaulting to NA of course).
Edit2: Do you have a recommendation for cleanup of the temporary tables that are generated?
The fact the names came out the way you felt they should mean I fixed it while fixing something else (and did not remember). So I in some sense knew I was wrong. Sorry about that.
The bulk replacement of NA
should probably be an option. I'll try to get to that soon. I'll leave this issue open and leave a comment (and close the issue) when I add that feature.
You are very welcome for the responses. Thank you again for taking the time to help (and doubly thanks for being patient and for saying thank-you; it helps a lot).
The cleanup of the temporary tables should be doable. The idea is to name all all temps using a single temporary name generator generated by replyr::makeTempNameGenerator()
. Then at the end after it is safe to do so (and you are not kicking out supports needed by other derived tables) you get the names by calling the temp name generator with dumpList=TRUE
and then call dplyr::db_drop_table()
on each name (a for-loop).
I admit, the above the temp-name management is a bit kludgy in replyr
, but all replyr
functions accept a shared temp name generator argument to prevent name leaks. The issue is you often can not delete temps until much later than you think. However, replyr
is at least trying to supply enough tools to make it solvable.
The idea is if you don't want to lose/leak names you need to never let any of the replyr
functions create their own default temp name generator (their default behavior). As long as you always pass them your generator they will pass it functions they call and you will have complete control. The defaults are to build their own generators, to make the functions easier to use for light use when you are not worried about temp management (else the package would be even harder to learn).
Thanks for taking the time to help. replyr
is a package I use with clients (so they get a lot more hands-on training), but I have been promoting it publicly- so it needs to be in a state that can be picked up by new users.
I could not resist. I have a "control default values" solution in now (same value all columns, but hey- at least it isn't always NA
). Can you please give it a try and tell me if you have any issues. I can do a column name controlled map, but until somebody needs it that seems needlessly complicated.
library("replyr")
#> Loading required package: seplyr
#> Loading required package: wrapr
#> Loading required package: cdata
my_db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
d <- data.frame(id = c(1, 1, 2),
meas= c('AUC', 'R2', 'AUC'),
val= c(0.6, 0.2, 0.5),
stringsAsFactors = FALSE)
# cdata::moveValuesToColumns(d,
# columnToTakeKeysFrom= 'meas',
# columnToTakeValuesFrom= 'val',
# rowKeyColumns= 'id')
dR <- replyr_copy_to(my_db, d, 'dR',
overwrite = TRUE, temporary = TRUE)
cT <- buildPivotControlTable(dR,
columnToTakeKeysFrom= 'meas',
columnToTakeValuesFrom= 'val')
# NA fills
replyr::moveValuesToColumnsQ(keyColumns = 'id',
controlTable = cT,
tallTableName = 'dR',
my_db = my_db)
#> # Source: table<mvtcq_k4phkttandcoaqyh3fsh_0000000001> [?? x 3]
#> # Database: sqlite 3.19.3 [:memory:]
#> id AUC R2
#> <dbl> <dbl> <dbl>
#> 1 1 0.6 0.2
#> 2 2 0.5 NA
# 0.0 fills
replyr::moveValuesToColumnsQ(keyColumns = 'id',
controlTable = cT,
tallTableName = 'dR',
my_db = my_db,
defaultValue = 0.0)
#> # Source: table<mvtcq_ncdyx91vpgn5nkkot3va_0000000001> [?? x 3]
#> # Database: sqlite 3.19.3 [:memory:]
#> id AUC R2
#> <dbl> <dbl> <dbl>
#> 1 1 0.6 0.2
#> 2 2 0.5 0.0
I was under the impression Re Plyr offered reshaping for remote tables, but perhaps I am incorrect?
Consider
moveValuesToColumns
andmoveValuesToRows
. Each of these functions in turn callscheckColsFormUniqueKeys
, and this function in turn callsnrow(data)
, which returnsNA
for remote tables. To extend support for remote tables, we'd need to (at the very least) replace this line in question with something likedata %>% ungroup %>% summarise(n())
.Curious to hear your thoughts.
Edit: Looks like I should be using
moveValuesToColumnsQ
and that this is a misunderstanding on my part. Will follow up if I in fact continue to have any issue, but perhaps a more descriptive function name and or documentation would be in order to point readers to the rightremote table
counterparts. Thanks for all your hard work!