Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.52k stars 967 forks source link

[BUG] Cannot print a data.table with column of type `rvar` from posterior package #5981

Open SteveBronder opened 4 months ago

SteveBronder commented 4 months ago

Summary

When printing a data.table that contains an rvar type from the posterior package an empty data.table is printed instead. See the mrp below for a full case. Looking into it more it appears dim.data.table returns a value of zero for the rows where there should be a nonzero value. dim returning a zero causes line 55 of print.data.table to go to the empty data.table message.

I tried looking into the code for dim:

https://github.com/Rdatatable/data.table/blob/46ee52571214b135b645e367ebacd02de02aff52/src/wrappers.c#L101-L121

The data.table has a length(x) equal to 1 so the last branch is chosen.

Calling dim.data.frame with the data.table gives the correct output below of (4, 1). I think this is because .row_names_info used in dim.data.frame is just asking for the length of the row names via .Internal(shortRowNames(x, type)).

Minimal reproducible example

library(data.table)
library(posterior)
n <- 4   # length of output vector
x_rvar <- rvar(array(rnorm(n*n, mean = 1, sd = 1), dim = c(n, n)))
x_rvar
# rvar<4000>[4] mean ± sd:
#   [1] 0.98 ± 1.00  1.00 ± 1.02  1.00 ± 0.99  0.99 ± 0.99 

# does not work :(
ex_dt = data.table(ex = x_rvar)
ex_dt
# Empty data.table (0 rows and 1 cols): ex

ex_df = data.frame(ex = x_rvar)

# dim for data.frame has wrong rows
dim(ex_dt)
# [1] 0 1
dim(ex_df)
# [1] 4 1

dim.data.frame(ex_dt)
# [1] 4 1

# But `ex` does exist?
ex_dt[, ex]
# rvar<4>[4] mean ± sd:
# [1] 0.025 ± 0.58  0.759 ± 0.69  1.292 ± 1.14  1.601 ± 0.53

ex_dt[, mean(ex)]
[1] 0.02508674 0.75898825 1.29194188 1.60076280

# printing with print.data.frame works
print.data.frame(ex_dt)
# ex
# 1 0.025 ± 0.58
# 2 0.759 ± 0.69
# 3 1.292 ± 1.14
# 4 1.601 ± 0.53

dput(ex_dt)
# structure(list(ex = 
#  structure(list(), draws = structure(
#   c(-0.228122437024597, -0.665473091749799, 0.433053985537569, 0.56088850510324,  
#      1.57671830971838,  -0.104841641013198, 0.817921294855514, 0.746155054023318, 
#      0.778298953156937,  2.20224060437554, -0.078692684532738, 2.26592066385111,
#      1.21605524439755,   1.59716805846671,  2.34552924058062,  1.24429863711927), 
#      dim = c(4L, 4L), dimnames = list(c("1", "2", "3", "4"), NULL)), 
#      nchains = 1L, cache = <environment>, class = c("rvar",  "vctrs_vctr"))),
#   row.names = c(NA, -4L), class = c("data.table", "data.frame"), 
#   .internal.selfref = <pointer: 0x55cc4ae60f60>)

rownames(ex_dt)
# [1] "1" "2" "3" "4"

Looking over the open/closed issues for data.table I could not find anything similar. I don't know enough about R's internals to know why length(VECTOR_ELT(x, 0)) is giving a value of zero, though I found the source for VECTOR_ELThere

# Output of sessionInfo()

R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Rocky Linux 8.9 (Green Obsidian)

Matrix products: default
BLAS/LAPACK: FlexiBLAS /MNT/SW/NIX/STORE/LKQ7SR37F820WRZCLZ0VC9N6FG5ZB3GD-OPENBLAS-0.3.26/LIB/LIBOPENBLAS.SO;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] posterior_1.5.0   data.table_1.15.2 cmdstanr_0.7.1    testthat_3.2.1   

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5          cli_3.6.2            knitr_1.42           rlang_1.1.3          xfun_0.39            processx_3.8.3       pkgload_1.3.4        generics_0.1.3       tensorA_0.36.2.1    
[10] jsonlite_1.8.8       glue_1.7.0           backports_1.4.1      rprojroot_2.0.3      distributional_0.4.0 ps_1.7.6             brio_1.1.3           fansi_1.0.6          tibble_3.2.1        
[19] abind_1.4-5          lifecycle_1.0.4      compiler_4.3.2       dplyr_1.1.2          waldo_0.5.2          pkgconfig_2.0.3      rstudioapi_0.14      R6_2.5.1             tidyselect_1.2.0    
[28] utf8_1.2.4           pillar_1.9.0         magrittr_2.0.3       checkmate_2.3.1      withr_3.0.0          tools_4.3.2          matrixStats_1.2.0    desc_1.4.2    
MichaelChirico commented 4 months ago

At root the issue is we don't do length() S3 dispatch.

Apparently (#1433) we wrote our own dim in the first place for performance reasons with [[.data.frame.

Maybe we should implement [[.data.table instead to just wrap VECTOR_ELT(x, i-1), then length() can dispatch correctly. Otherwise we need to do length() dispatch ourselves from C, right?

SteveBronder commented 4 months ago

I have a PR at the link below to do the length dispatch in C. It's been a long time since I've written with R's C API so it may not be that nice. I also ran into another bug where rownames(toprint) was only giving back a length 1 vector so line 80 of print.data.table was failing

  if (isTRUE(row.names)) {
   # rownames was NULL and the rhs was length 4 so this threw an error
    rownames(toprint) = paste0(format(rn,right=TRUE,scientific=FALSE),":")
  } else {
    rownames(toprint) = rep.int("", nrow(toprint))
  }

Since vectors of rvar types have a dimension format_col.default was choosing the first branch and just returning "<multi-column>". imo I think that if condition should be kicked to the second to the last of the if else statements so that package writers have an opportunity to write the proper overwrite for format

https://github.com/Rdatatable/data.table/compare/master...SteveBronder:data.table:fix/print-rvar?expand=1

SteveBronder commented 4 months ago

Actually @MichaelChirico I think we should close this PR / issue. See the discussion in the comments here. Essentially this is another case of #4415 (specifically the comment here).

At first I thought this was a data.table issue but now I think it's rvar not conforming to data.table

MichaelChirico commented 4 months ago

Thanks for the reminder. Indeed I'd checked if rvar is S4, but it's just S3 which I thought should be easier to accommodate in general. And indeed I wouldn't expect generic computations on rvar columns in data.table to work.

But storing non-atomic columns in a data.table is still very useful for some high-level use cases, and something as basic as print() not working is surprising to me. If we can get a fix working that's fairly general and not terribly involved/inefficient, I think we should go for it.

SteveBronder commented 4 months ago

Thanks for the reminder. Indeed I'd checked if rvar is S4, but it's just S3 which I thought should be easier to accommodate in general. And indeed I wouldn't expect generic computations on rvar columns in data.table to work.

Yes I just was messing with rbindlist.c and I think this is a bigger project than I want to take on atm.

But storing non-atomic columns in a data.table is still very useful for some high-level use cases, and something as basic as print() not working is surprising to me. If we can get a fix working that's fairly general and not terribly involved/inefficient, I think we should go for it.

Would you like me to open up a PR for the fix branch I posted above?

MichaelChirico commented 4 months ago

Would you like me to open up a PR for the fix branch I posted above?

that'd be great!