Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.6k stars 982 forks source link

Optionally order columns of multiple value.var in dcast() by RHS of formula #2601

Open UweBlock opened 6 years ago

UweBlock commented 6 years ago

When reshaping multiple value.vars from long to wide format with dcast(), the columns are ordered such that the value.vars are grouped together, e.g.,

# sample data
library(data.table)
long <- structure(list(id = c(1L, 1L, 2L, 2L), year = structure(c(1L, 
2L, 2L, 3L), class = "factor", .Label = c("2007", "2008", "2009"
)), X1 = c(12007.1, 12008.1, 22008.1, 22009.1), X2 = c("12007_2", 
"12008_2", "22008_2", "22009_2")), .Names = c("id", "year", "X1", 
"X2"), class = c("data.table", "data.frame"), row.names = c(NA, -4L))
long
   id year      X1      X2
1:  1 2007 12007.1 12007_2
2:  1 2008 12008.1 12008_2
3:  2 2008 22008.1 22008_2
4:  2 2009 22009.1 22009_2
# reshape from to long to wide format
cols <- c("X1", "X2")
dcast(long, id ~ year, value.var = cols)
   id X1_2007 X1_2008 X1_2009 X2_2007 X2_2008 X2_2009
1:  1 12007.1 12008.1      NA 12007_2 12008_2      NA
2:  2      NA 22008.1 22009.1      NA 22008_2 22009_2

There are some questions on SO which expect the order of columns to be grouped by the RHS: Reshape data within groups - groups in a single row How to reshape tabular data to one row per group

There is no option in dcast() yet which allows to specify the order of reshaped columns.

Workaround

As a workaround, the column order can be rearranged using setcolorder() but this requires a lot of manual coding:

# reorder columns to group by RHS
wide <- dcast(long, id ~ year, value.var = cols)
new_col_order <- CJ(unique(long$year), cols)[, paste(V2, V1, sep = "_")]
setcolorder(wide, c(setdiff(names(wide), new_col_order), new_col_order))
wide
   id X1_2007 X2_2007 X1_2008 X2_2008 X1_2009 X2_2009
1:  1 12007.1 12007_2 12008.1 12008_2      NA      NA
2:  2      NA      NA 22008.1 22008_2 22009.1 22009_2

Now, the columns are order by the RHS (year). However, every change to the formula requires to amend the code to reorder the columns:

wide <- dcast(long, id ~ rowid(id), value.var = cols)
new_col_order <- CJ(seq_len(uniqueN(long$id)), cols)[, paste(V2, V1, sep = "_")]
setcolorder(wide, c(setdiff(names(wide), new_col_order), new_col_order))
wide
   id    X1_1    X2_1    X1_2    X2_2
1:  1 12007.1 12007_2 12008.1 12008_2
2:  2 22008.1 22008_2 22009.1 22009_2

Also, with a list of functions:

wide <- dcast(long, id ~ year, fun.aggregate = list(mean, sd), value.var = "X1")
new_col_order <- CJ(unique(long$year), c("mean", "sd"))[, paste("X1", V2, V1, sep = "_")]
setcolorder(wide, c(setdiff(names(wide), new_col_order), new_col_order))
wide
   id X1_mean_2007 X1_sd_2007 X1_mean_2008 X1_sd_2008 X1_mean_2009 X1_sd_2009
1:  1      12007.1         NA      12008.1         NA          NaN         NA
2:  2          NaN         NA      22008.1         NA      22009.1         NA

Related Issues

There are related issues which deal with naming of output columns in first place but not with order https://github.com/Rdatatable/data.table/issues/1153 https://github.com/Rdatatable/data.table/issues/1951 (Note, I am happy with the current naming convention)

Output of sessionInfo()

R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2018-01-23 05:02:02 UTC; appveyor
UweBlock commented 3 years ago

There is a recent question on SO Automatically order a wide data.table: dcast columns in a specific order/setcolorder based on a pattern with numbers.

This brings me the idea to use patterns in addition to colorder = "value.var" (the default) or colorder = "rhs". So, we might have something like
colorder = pattern(..., regex)
where regex is a regular expression with groups (in parantheses) and ... giving the order to sort the columns.

Perhaps, this feature might better become an enhancment of setcolorder()?

Giving a second thought, something like colorder = forderv(c("year", "X2", "X1")) or colorder = forderv(c("year", "value.var")) could work as well, e.g.,

cols <- c("X1", "X2")
wide <- dcast(long, id ~ year, value.var = cols, colorder = forderv(c("year", cols))  

would return

   id X1_2007 X2_2007 X1_2008 X2_2008 X1_2009 X2_2009
1:  1 12007.1 12007_2 12008.1 12008_2      NA      NA
2:  2      NA      NA 22008.1 22008_2 22009.1 22009_2