Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 985 forks source link

FR: auto-infer LHS of := when absent #1543

Open MichaelChirico opened 8 years ago

MichaelChirico commented 8 years ago

It's something of a nuisance to assign ragged columns with tstrsplit, if we don't necessarily know how many columns will result ex ante. Is it possible to add an assign argument which would use := or set to add the result as extra columns to the table?

DT <- data.table(x = c("a/b/c/d", "a/b/c", "a/b"),
                 y = 1:3, z = 4:6)
DT[ , tstrsplit(x, "/")]
#    V1 V2 V3 V4 #`x`, `y`, and `z` are irrecoverable
#1:  a  b  c NA
#2:  a  b NA NA
#3:  a  b  c  d

Workaround (ugly)

DT[ , paste0("V", 1:max(sapply(spl <- strsplit(x, "/"), length))) := transpose(spl)][]
#          x y z V1 V2 V3 V4
#1:   a/b/c 1 5  a  b  c NA
#2:     a/b 2 6  a  b NA NA
#3: a/b/c/d 3 7  a  b  c  d

Inspired by this SO question

mrdwab commented 8 years ago

Seconding @MichaelChirico here. But I imagine that it's somewhat cumbersome to pull off since tstrsplit works both within and outside of a data.table.

In reworking some of the functions for "splitstackshape", I've written the following wrapper functions (both of which will probably use copy in the end instead of modifying the original data.table):

## vectorized equivalent of `listCol_w`
flatten <- function(indt, cols, drop = FALSE) {
  require(data.table)
  if (!is.data.table(indt)) indt <- as.data.table(indt)
  x <- unlist(indt[, lapply(.SD, function(x) max(lengths(x))), .SDcols = cols])
  nams <- paste(rep(cols, x), sequence(x), sep = "_")
  indt[, (nams) := unlist(lapply(.SD, transpose), recursive = FALSE), .SDcols = (cols)]
  if (isTRUE(drop)) indt[, (cols) := NULL]
  indt[]
}

and

## vectorized equivalent of `listCol_l`
flattenLong <- function(indt, cols) {
  ob <- setdiff(names(indt), cols)
  x <- flatten(indt, cols, TRUE)
  mv <- lapply(cols, function(y) grep(sprintf("^%s_", y), names(x)))
  setorderv(melt(x, measure.vars = mv, value.name = cols), ob)[]
}

With @MichaelChirico's example, the approach would be:

flatten(copy(DT)[, x := strsplit(x, "/", TRUE)], "x", drop = TRUE)
#    y z x_1 x_2 x_3 x_4
# 1: 1 4   a   b   c   d
# 2: 2 5   a   b   c  NA
# 3: 3 6   a   b  NA  NA

(Or, of course just using cSplit.)

MichaelChirico commented 8 years ago

@mrdwab thanks for the feedback! Indeed I had thought about how to implement this a bit but was stymied -- the only other function I know of that accepts can assign things is :=, and that doesn't work outside j. But I do occasionally use tstrsplit on a non-data.table object -- it's a nice feature.

Of course if this is possible this is the way to go:

if (assign){
  # behave like `:=`
}else{
  # behave like now
}

If not, an alternative would be to add a keep.n argument to transpose (I see in the C code this is just a matter of returning maxlen), which works like:

if (keep.n){
  # return length-2 list -- 
  # [[1]]: "list" = as before
  # [[2]]: "v.name" = paste0("V", 1:maxlen)
}else{
  # as before
}

And used like

DT[ , (x <- tstrsplit(x, "/", keep.n = TRUE))$vname := x$list]
DavidArenburg commented 8 years ago

I think this should be a general feature of`:=`(). If no names were provided, then it should automatically generate them, e.g. for the data provided by @MichaelChirico, this should look like

DT[, `:=`(tstrsplit(x, "/", fixed = TRUE))]
DT
#          x y z x_1 x_2 x_3 x_4
# 1: a/b/c/d 1 4   a   b   c   d
# 2:   a/b/c 2 5   a   b   c  NA
# 3:     a/b 3 6   a   b  NA  NA

This could be a truly awesome feature that will make our lives much easier when the number of columns is unknown. It will also make the code much more concise.

This could be generalized to other data.table (or any) functions too, for instance

DT[, `:=`(shift(y, 1:.N))]
DT
#          x y z x_1 x_2 x_3 x_4 y_1 y_2 y_3
# 1: a/b/c/d 1 4   a   b   c   d  NA  NA  NA
# 2:   a/b/c 2 5   a   b   c  NA   1  NA  NA
# 3:     a/b 3 6   a   b  NA  NA   2   1  NA

The only caveat I see here, that it could override existing columns. In that case it should return an error, or generate a bit different column names- that part needs to be figured out.

MichaelChirico commented 8 years ago

@DavidArenburg good point. do you happen to know if there's an outstanding FR for that? seems like an obvious one, after all... if not, I'll change the title of this one.

franknarf1 commented 8 years ago

Maybe update this SO question if this is implemented.

franknarf1 commented 6 years ago

Another SO post, asking to do it by group (so the workaround "answer" for the last one I linked doesn't work): https://stackoverflow.com/q/51861038

# reproducible example (since the link doesn't have one)
library(data.table)
dt <- data.table(x = 1 : 4, id = c(1,1,2,2))
ff = function(x) list(a = x + 1, b = 2)

# desired syntax    
dt[, `:=`(res <- names(ff(x)), res), by=id]
# or 
dt[, `:=`(ff(x)), by=id]

I know there's a warning about the inefficiency of returning named lists in j for grouped operations, but maybe there's some good way to handle their use-case.

MichaelChirico commented 5 years ago

Following up on @DavidArenburg 's suggested approach, the more natural R way to accomplish this would I think be through do.call, but do.call(`:=`, list_of_column_assignments) won't work with the current API for := which is NSE-based. A way around this would be to define a proxy function like list_set or list_assign which does the same thing as := as a function, including auto-naming un-named components: do.call(list_set, tstrsplit(some_string, some_sep)).

This is a bit redundant -- there's a reason we can't use := as a function:

print(data.table:::`:=`)
function(...) stop('Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").')

And list_set is redundant to that...

myoung3 commented 3 years ago

Another SO post here: https://stackoverflow.com/questions/66917673/set-multiple-columns-in-r-data-table-with-a-named-list-and

@MichaelChirico Is it really not possible to hack in a way for do.call(":=", list()) to work? Using the base R syntax is a more appealing solution to me than adding another function or changing the default behavior of :=

myoung3 commented 3 years ago

Since j already gets substituted, we could just add a switch for when do.call(":=", or do.call(`:=`, is detected

MichaelChirico commented 3 years ago

I think it could be doable but I'm not a big fan... (1) added NSE maintenance overhead and (2) i usually avoid do.call like the plague myself. I would lean towards a more elegant solution if possible...

myoung3 commented 3 years ago

Yeah as someone familiarizing myself with the [.data.table code I'll agree on that point--the amount of NSE is pretty overwhelming already.

shapenaji commented 3 years ago

Throwing in my 2c here, could the do.call and other := tricks be avoided if there was a new keyword?

dt[, .NEW := tstrsplit(col, ...)]

(or something better named than .NEW) This would assign using the names of the output without modifying the normal flow.

jangorecki commented 3 years ago

another https://stackoverflow.com/q/67914827/2490497