Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.61k stars 985 forks source link

format_col and nested 1-column frames #6592

Open r2evans opened 3 days ago

r2evans commented 3 days ago

(This may be related to https://github.com/Rdatatable/data.table/issues/5948, since nested frames is a common link.)

mt <- as.data.table(mtcars)[1:3,]
mt$frm <- list(data.frame(a=1), data.frame(a=1), data.frame(a=1))
mt
# Error in vapply(X = x, FUN = fun, ..., FUN.VALUE = NA_character_, USE.NAMES = use.names) : 
#   values must be type 'character',
#  but FUN(X[[1]]) result is type 'list'
mt[,-12]
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1:  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4
# 2:  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4
# 3:  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1

When debugging,

debug(data.table:::format_col.default)
mt
#### 'c'ontinue through the first 11 columns, then
x
# [[1]]
#   a
# 1 1
# [[2]]
#   a
# 1 1
# [[3]]
#   a
# 1 1
vapply_1c(x, format_list_item, ...)
# Error in vapply(X = x, FUN = fun, ..., FUN.VALUE = NA_character_, USE.NAMES = use.names) : 
#   values must be type 'character',
#  but FUN(X[[1]]) result is type 'list'
format_list_item(x[[1]])
#   a
# 1 1

Where it is assumed that all return values from vapply_1c are expected to be strings.

Note that this does not fail when the nested frames are more than one column, since format_list_item.default perhaps-naively uses length(format(..)) == 1.

mt$frm <- list(data.frame(a=1,b=1), data.frame(), data.frame(a=1,b=1))
mt
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb               frm
#    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>            <list>
# 1:  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4 <data.frame[1x2]>
# 2:  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4 <data.frame[0x0]>
# 3:  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1 <data.frame[1x2]>

I don't know if it makes sense to define format_list_item.data.frame as well to preclude this, or if there are better methods:

format_list_item.data.frame <- function(x, ...) "<multi-column>"
vapply_1c(x, format_list_item.data.frame, ...)
# [1] "<multi-column>" "<multi-column>" "<multi-column>"

This test was done with data.table_1.16.2, but it also fails with data.table_1.15.2, so it's not a recent breakage.

packageVersion("data.table")
# [1] ‘1.15.2’
mt <- as.data.table(mtcars)[1:3,]
mt$frm <- list(data.frame(a=1), data.frame(a=1), data.frame(a=1))
mt
# Error in vapply(X = x, FUN = fun, ..., FUN.VALUE = NA_character_, USE.NAMES = use.names) :
#   values must be type 'character',
#  but FUN(X[[1]]) result is type 'list'
sessionInfo()
sessionInfo()
# R version 4.3.3 (2024-02-29)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 24.04.1 LTS
# Matrix products: default
# BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
# LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
# locale:
#  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
# time zone: America/New_York
# tzcode source: system (glibc)
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# other attached packages:
# [1] data.table_1.16.2 r2_0.11.0        
# loaded via a namespace (and not attached):
#  [1] compiler_4.3.3    clipr_0.8.0       fastmap_1.2.0     cli_3.6.2         tools_4.3.3       htmltools_0.5.8.1 rmarkdown_2.26    knitr_1.45       
#  [9] xfun_0.42         digest_0.6.34     rlang_1.1.3       evaluate_0.23    
r2evans commented 3 days ago

If we change the current internal function to the following, it works:

format_list_item2 <- function(x, ...) {
    if (is.null(x)) 
        "[NULL]"
    else if (is.atomic(x) || inherits(x, "formula")) 
        paste(c(format(head(x, 6L), ...), if (length(x) > 6L) "..."), 
            collapse = ",")
    else if (!inherits(x, "data.frame") && has_format_method(x) && length(formatted <- format(x, ...)) == 1L) {
        formatted
    }
    else {
        paste0("<", class(x)[1L], paste_dims(x), ">")
    }
}
vapply_1c(x, format_list_item2, ...)
# [1] "<data.frame[1x1]>" "<data.frame[1x1]>" "<data.frame[1x1]>"

Is that simple enough? I'm happy to submit a PR to that effect. (I'd change the original format_list_item.default, not add the above renamed function. The only change is the addition of !inherits(..).)

MichaelChirico commented 2 days ago

I think registering format_list_item.data.frame is the right approach here.