Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.59k stars 979 forks source link

column types not inferred for rbind. #3654

Closed ocschwar closed 4 years ago

ocschwar commented 5 years ago

Here's a simple test case:

library(data.table)
library(circular)
X <- data.table(Time='2013-01-01 05:00:00', speed=5.0, direction = circular::circular(22))
Y <- data.table(Time='2013-01-0117:00:00', speed=4.5, direction = 172)
rbind(X,Y)

The manual page for data.table::rbindlist continues to say this:

 If column ‘i’ does not have the same type in each of the list
 items; e.g, the column is ‘integer’ in item 1 while others are
 ‘numeric’, they are coerced to the highest type.

But that is no longer the behavior I see. Could type coercion be done at least optionally?

`> sessionInfo() R version 3.4.4 (2018-03-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.2 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] graphics grDevices datasets utils stats methods base

other attached packages: [1] REhedge_0.2.42 REdata_0.0.6 REveal_1.10 REviz_0.1
[5] circular_0.4-93 bindrcpp_0.2.2 data.table_1.12.2 maps_3.3.0
[9] odbc_1.1.6 devtools_1.13.6 testthat_2.1.1

loaded via a namespace (and not attached): [1] readxl_1.2.0 backports_1.1.2 spam_2.1-4
[4] sm_2.2-5.6 plyr_1.8.4 lazyeval_0.2.1
[7] sp_1.3-1 rclipboard_0.1 crosstalk_1.0.0
[10] leaflet_2.0.2 ggplot2_3.1.0 digest_0.6.19
[13] htmltools_0.3.6 gdata_2.18.0 magrittr_1.5
[16] memoise_1.1.0 cluster_2.0.6 aws.signature_0.4.4 [19] openxlsx_4.1.0.1 Metrics_0.1.4 readr_1.3.1
[22] xts_0.11-2 jpeg_0.1-8 waterfalls_0.1.2
[25] colorspace_1.3-2 blob_1.1.1 dplyr_0.7.8
[28] rgdal_1.3-6 tcltk_3.4.4 callr_3.2.0
[31] crayon_1.3.4 RCurl_1.95-4.11 jsonlite_1.6
[34] hexbin_1.27.2 roxygen2_6.1.1 bindr_0.1.1
[37] zoo_1.8-4 glue_1.3.0 gtable_0.2.0
[40] DEoptimR_1.0-8 abind_1.4-5 scales_1.0.0
[43] padr_0.5.0 mvtnorm_1.0-7 DBI_1.0.0
[46] Rcpp_1.0.0 plotrix_3.7-4 xtable_1.8-3
[49] foreign_0.8-69 bit_1.1-14 mapproj_1.2.6
[52] dotCall64_0.9-5.2 RJDBC_0.2-7.1 htmlwidgets_1.3
[55] httr_1.4.0 RColorBrewer_1.1-2 geosphere_1.5-7
[58] pkgconfig_2.0.2 XML_3.98-1.16 rJava_0.9-10
[61] dbplyr_1.2.2 RJSONIO_1.3-1.1 tidyselect_0.2.5
[64] rlang_0.3.4 reshape2_1.4.3 later_0.7.5
[67] munsell_0.5.0 cellranger_1.1.0 tools_3.4.4
[70] cli_1.0.1 xgboost_0.71.2 moments_0.14
[73] RSQLite_2.1.1 english_1.2-3 ggmap_2.6.1
[76] ggridges_0.5.1 aws.s3_0.3.12 stringr_1.3.1
[79] processx_3.3.0 bit64_0.9-7 zip_2.0.2
[82] robustbase_0.93-3 purrr_0.2.5 ncdf4_1.16.1
[85] RgoogleMaps_1.4.3 nlme_3.1-131 mime_0.6
[88] xml2_1.2.0 compiler_3.4.4 rstudioapi_0.10
[91] png_0.1-7 ggjoy_0.4.1 RPostgreSQL_0.6-2
[94] tibble_2.1.1 stringi_1.2.4 ps_1.3.0
[97] desc_1.2.0 fields_9.6 lattice_0.20-35
[100] Matrix_1.2-12 commonmark_1.6 shinyjs_1.0
[103] stringdist_0.9.5.1 pillar_1.3.1 bitops_1.0-6
[106] maptools_0.9-4 grImport_0.9-2 raster_2.8-4
[109] httpuv_1.4.5 R6_2.4.0 latticeExtra_0.6-28 [112] promises_1.0.1 gridExtra_2.3 openair_2.6-5
[115] codetools_0.2-15 boot_1.3-20 MASS_7.3-49
[118] gtools_3.5.0 assertthat_0.2.0 proto_1.0.0
[121] rprojroot_1.3-2 rjson_0.2.18 shinyWidgets_0.4.4 [124] withr_2.1.2 mgcv_1.8-23 parallel_3.4.4
[127] hms_0.4.2 grid_3.4.4 timeDate_3043.102
[130] tidyr_0.8.3 git2r_0.25.2 shiny_1.2.0
[133] lubridate_1.7.4 base64enc_0.1-3 dygraphs_1.1.1.6
`

jangorecki commented 5 years ago

Those rows type is identical

> typeof(X$direction)
[1] "double"
> typeof(Y$direction)
[1] "double"

What is different is the class. Documentation refers to type, not class. My suggestion is to be explicit about data types/classes and coerce to expected type when desired. It is useful not only for readability but sometimes different class might result in some optimizations being switched off. Closing for now, please re-open if you disagree, ideally including expected answer.

ocschwar commented 5 years ago

This change in behavior from 1.11.6 came as a surprise. I could find nothing in the changelogs to indicate it. It is arguably a change for the better.

jangorecki commented 5 years ago

Sounds like breaking change to API, worth to consider then, thanks for feedback!

jangorecki commented 4 years ago

Minimal example

rbindlist(list(
  data.table(x=1L, y=structure(1L, class="a")),
  data.table(x=2L, y=structure(2L, class="b"))
))
#Error in rbindlist(list(data.table(x = 1L, y = structure(1L, class = "a")),  : 
#  Class attribute on column 2 of item 2 does not match with column 2 of item 1.
jangorecki commented 4 years ago

@ocschwar I tried your code on 1.11.4 and it raises error for me as well. Could you provide exact code that used to work fine and is now raising error?

X <- data.table(Time='2013-01-01 05:00:00', speed=5.0, direction = circular::circular(22))
Y <- data.table(Time='2013-01-0117:00:00', speed=4.5, direction = 172)
rbind(X,Y)
#Error in rbindlist(l, use.names, fill, idcol) : 
#  Class attributes at column 3 of input list at position 2 does not match with column 3 of input list at position 1. Coercion of objects of class 'factor' alone is handled internally by rbind/rbindlist at the moment.
jangorecki commented 4 years ago

Closing as not a regression and works as documented. @ocschwar please let us know if you can provide require chunk so we can re-open this issue.