Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.6k stars 979 forks source link

rbindlist messing up with bouding box of spatial `sf` objects #5352

Open rafapereirabr opened 2 years ago

rafapereirabr commented 2 years ago

I've found that rbindlist{data.table} somehow changes the bouding box of spatial sf objects. This is related to issue #2273 here, and I've linked to issues on geobr and tmap packages as well.

Minimal reproducible example

devtools::install_github("ipeaGIT/geobr", subdir = "r-package")
library(geobr)
library(sf)
library(data.table)
library(waldo)

# download sf data
rr <- read_state(code_state = 'RR')
rs <- read_state(code_state = 'RS')

test_list <- list(rr, rs)

# row bind with rbindlist
t1_list <- data.table::rbindlist(test_list, fill = TRUE)
t1 <- sf::st_sf(t1_list) 
plot(t1['code_state'])

# base row bind
t2 <- rbind(rr,rs)
plot(t2['code_state'])

# compare
waldo::compare(t1, t2)

> `class(old)`: "sf" "data.table" "data.frame"
> `class(new)`: "sf"              "data.frame"
> 
> `attr(old$geom, 'bbox')`: "((-64.82525,-1.580633),(-58.88688,5.271841))"
> `attr(new$geom, 'bbox')`: "((-64.82525,-33.75208),(-49.69146,5.271841))"

For some reason, though, this problem is fixed when I run a simple subset removing a row hat does not exist in the data.

t3 <- subset(t1, abbrev_state  != "xx")
waldo::compare(t2, t3)

> `class(old)`: "sf" "data.frame"             
> `class(new)`: "sf" "data.table" "data.frame"

sessionInfo()

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] waldo_0.4.0       data.table_1.14.2 sf_1.0-7          geobr_1.6.6      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3       rstudioapi_0.13    rematch2_2.1.2    
 [4] magrittr_2.0.2     units_0.8-0        tidyselect_1.1.2  
 [7] R6_2.5.1           rlang_1.0.2        fansi_1.0.2       
[10] httr_1.4.2         dplyr_1.0.8        tools_4.1.1       
[13] grid_4.1.1         utf8_1.2.2         KernSmooth_2.23-20
[16] cli_3.1.1          e1071_1.7-9        DBI_1.1.2         
[19] ellipsis_0.3.2     class_7.3-19       assertthat_0.2.1  
[22] tibble_3.1.6       lifecycle_1.0.1    crayon_1.5.0      
[25] purrr_0.3.4        vctrs_0.3.8        curl_4.3.2        
[28] glue_1.6.2         proxy_0.4-26       diffobj_0.3.5     
[31] compiler_4.1.1     pillar_1.7.0       generics_0.1.2    
[34] classInt_0.4-3     pkgconfig_2.0.3   
tlapak commented 2 years ago

This is a known limitation of data.table. See #4415 for a discussion of why this occurs. The gist is the following: As you are aware bbox is stored as an attribute of geom and depends on the values of that vector. No data.table function ever touches these attributes. When you call subset() it actually ends up falling back on the data.frame method which calls c() which in turn recomputes the bbox. That's why it fixes it.

jkaucic commented 1 year ago

Sorry for the question: I also use data.table rbindlist to bind two multipolygon sf objects together. Does this mean the boundingbox of the new layer is always messed up after this and I have to run the subset command as a fix? Thank you!