Open MichaelChirico opened 8 years ago
Seconding @MichaelChirico here. But I imagine that it's somewhat cumbersome to pull off since tstrsplit
works both within and outside of a data.table
.
In reworking some of the functions for "splitstackshape", I've written the following wrapper functions (both of which will probably use copy
in the end instead of modifying the original data.table
):
## vectorized equivalent of `listCol_w`
flatten <- function(indt, cols, drop = FALSE) {
require(data.table)
if (!is.data.table(indt)) indt <- as.data.table(indt)
x <- unlist(indt[, lapply(.SD, function(x) max(lengths(x))), .SDcols = cols])
nams <- paste(rep(cols, x), sequence(x), sep = "_")
indt[, (nams) := unlist(lapply(.SD, transpose), recursive = FALSE), .SDcols = (cols)]
if (isTRUE(drop)) indt[, (cols) := NULL]
indt[]
}
and
## vectorized equivalent of `listCol_l`
flattenLong <- function(indt, cols) {
ob <- setdiff(names(indt), cols)
x <- flatten(indt, cols, TRUE)
mv <- lapply(cols, function(y) grep(sprintf("^%s_", y), names(x)))
setorderv(melt(x, measure.vars = mv, value.name = cols), ob)[]
}
With @MichaelChirico's example, the approach would be:
flatten(copy(DT)[, x := strsplit(x, "/", TRUE)], "x", drop = TRUE)
# y z x_1 x_2 x_3 x_4
# 1: 1 4 a b c d
# 2: 2 5 a b c NA
# 3: 3 6 a b NA NA
(Or, of course just using cSplit
.)
@mrdwab thanks for the feedback! Indeed I had thought about how to implement this a bit but was stymied -- the only other function I know of that accepts can assign things is :=
, and that doesn't work outside j
. But I do occasionally use tstrsplit
on a non-data.table
object -- it's a nice feature.
Of course if this is possible this is the way to go:
if (assign){
# behave like `:=`
}else{
# behave like now
}
If not, an alternative would be to add a keep.n
argument to transpose
(I see in the C
code this is just a matter of returning maxlen
), which works like:
if (keep.n){
# return length-2 list --
# [[1]]: "list" = as before
# [[2]]: "v.name" = paste0("V", 1:maxlen)
}else{
# as before
}
And used like
DT[ , (x <- tstrsplit(x, "/", keep.n = TRUE))$vname := x$list]
I think this should be a general feature of`:=`()
. If no names were provided, then it should automatically generate them, e.g. for the data provided by @MichaelChirico, this should look like
DT[, `:=`(tstrsplit(x, "/", fixed = TRUE))]
DT
# x y z x_1 x_2 x_3 x_4
# 1: a/b/c/d 1 4 a b c d
# 2: a/b/c 2 5 a b c NA
# 3: a/b 3 6 a b NA NA
This could be a truly awesome feature that will make our lives much easier when the number of columns is unknown. It will also make the code much more concise.
This could be generalized to other data.table
(or any) functions too, for instance
DT[, `:=`(shift(y, 1:.N))]
DT
# x y z x_1 x_2 x_3 x_4 y_1 y_2 y_3
# 1: a/b/c/d 1 4 a b c d NA NA NA
# 2: a/b/c 2 5 a b c NA 1 NA NA
# 3: a/b 3 6 a b NA NA 2 1 NA
The only caveat I see here, that it could override existing columns. In that case it should return an error, or generate a bit different column names- that part needs to be figured out.
@DavidArenburg good point. do you happen to know if there's an outstanding FR for that? seems like an obvious one, after all... if not, I'll change the title of this one.
Maybe update this SO question if this is implemented.
Another SO post, asking to do it by group (so the workaround "answer" for the last one I linked doesn't work): https://stackoverflow.com/q/51861038
# reproducible example (since the link doesn't have one)
library(data.table)
dt <- data.table(x = 1 : 4, id = c(1,1,2,2))
ff = function(x) list(a = x + 1, b = 2)
# desired syntax
dt[, `:=`(res <- names(ff(x)), res), by=id]
# or
dt[, `:=`(ff(x)), by=id]
I know there's a warning about the inefficiency of returning named lists in j
for grouped operations, but maybe there's some good way to handle their use-case.
Following up on @DavidArenburg 's suggested approach, the more natural R way to accomplish this would I think be through do.call
, but do.call(`:=`, list_of_column_assignments)
won't work with the current API for :=
which is NSE-based. A way around this would be to define a proxy function like list_set
or list_assign
which does the same thing as :=
as a function, including auto-naming un-named components: do.call(list_set, tstrsplit(some_string, some_sep))
.
This is a bit redundant -- there's a reason we can't use :=
as a function:
print(data.table:::`:=`)
function(...) stop('Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").')
And list_set
is redundant to that...
Another SO post here: https://stackoverflow.com/questions/66917673/set-multiple-columns-in-r-data-table-with-a-named-list-and
@MichaelChirico Is it really not possible to hack in a way for do.call(":=", list())
to work? Using the base R syntax is a more appealing solution to me than adding another function or changing the default behavior of :=
Since j already gets substituted, we could just add a switch for when do.call(":=",
or do.call(`:=`,
is detected
I think it could be doable but I'm not a big fan... (1) added NSE maintenance overhead and (2) i usually avoid do.call
like the plague myself. I would lean towards a more elegant solution if possible...
Yeah as someone familiarizing myself with the [.data.table
code I'll agree on that point--the amount of NSE is pretty overwhelming already.
Throwing in my 2c here, could the do.call
and other :=
tricks be avoided if there was a new keyword?
dt[, .NEW := tstrsplit(col, ...)]
(or something better named than .NEW
)
This would assign using the names of the output without modifying the normal flow.
It's something of a nuisance to assign ragged columns with
tstrsplit
, if we don't necessarily know how many columns will result ex ante. Is it possible to add anassign
argument which would use:=
orset
to add the result as extra columns to the table?Workaround (ugly)
Inspired by this SO question