apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.29k stars 3.47k forks source link

[R] existing_data_behavior = "overwrite" does not work as expected in arrow::write_dataset #37760

Open ablack3 opened 1 year ago

ablack3 commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

It seems that the default for existing_data_behavior='overwrite' does not actually overwrite existing data. @jonkeane

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

path <- file.path(tempdir(), "mtcars-data")

mtcars %>% 
  group_by(cyl) %>% 
  arrow::write_dataset(path)

list.dirs(path, full.names = F)
#> [1] ""      "cyl=4" "cyl=6" "cyl=8"
fs::file_size(path)
#> 160

mtcars %>% 
  group_by(vs) %>% 
  arrow::write_dataset(path, existing_data_behavior = "overwrite")

list.dirs(path, full.names = F)
#> [1] ""      "cyl=4" "cyl=6" "cyl=8" "vs=0"  "vs=1"
fs::file_size(path)
#> 224

sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Ventura 13.5.2
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/Chicago
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.1.3
#> 
#> loaded via a namespace (and not attached):
#>  [1] vctrs_0.6.3       cli_3.6.1         knitr_1.44        rlang_1.1.1      
#>  [5] xfun_0.40         purrr_1.0.2       generics_0.1.3    assertthat_0.2.1 
#>  [9] bit_4.0.5         glue_1.6.2        htmltools_0.5.6   fansi_1.0.4      
#> [13] rmarkdown_2.24    evaluate_0.21     tibble_3.2.1      fastmap_1.1.1    
#> [17] yaml_2.3.7        lifecycle_1.0.3   compiler_4.3.1    fs_1.6.3         
#> [21] pkgconfig_2.0.3   rstudioapi_0.15.0 digest_0.6.33     R6_2.5.1         
#> [25] reprex_2.0.2      tidyselect_1.2.0  utf8_1.2.3        pillar_1.9.0     
#> [29] magrittr_2.0.3    bit64_4.0.5       tools_4.3.1       withr_2.5.0      
#> [33] arrow_13.0.0

Created on 2023-09-17 with reprex v2.0.2

Component(s)

R

jonkeane commented 1 year ago

(sorry for the incorrect link up there, too many tabs open, apparently!)

Ah, I see what's going on here. In the overwrite mode we anticipated someone was overwriting a dataset with the same partitioning. If I remember correctly, the way that we delete files in overwrite mode, the C++ looks for the folders the partitions would create and then deletes the files inside of those. But if you write with a new partitioning, it doesn't look in other folders that might exist.

There are a few things we could do (and these aren't necessarily mutually exclusive):