[R] existing_data_behavior = "overwrite" does not work as expected in arrow::write_dataset #37760

Open ablack3 opened 1 year ago

ablack3 commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

It seems that the default for existing_data_behavior='overwrite' does not actually overwrite existing data. @jonkeane

path <- file.path(tempdir(), "mtcars-data")

mtcars %>% 
  group_by(cyl) %>% 

list.dirs(path, full.names = F)
#> [1] ""      "cyl=4" "cyl=6" "cyl=8"
#> 160

mtcars %>% 
  group_by(vs) %>% 
  arrow::write_dataset(path, existing_data_behavior = "overwrite")

list.dirs(path, full.names = F)
#> [1] ""      "cyl=4" "cyl=6" "cyl=8" "vs=0"  "vs=1"
#> 224

jonkeane commented 1 year ago

(sorry for the incorrect link up there, too many tabs open, apparently!)

Ah, I see what's going on here. In the overwrite mode we anticipated someone was overwriting a dataset with the same partitioning. If I remember correctly, the way that we delete files in overwrite mode, the C++ looks for the folders the partitions would create and then deletes the files inside of those. But if you write with a new partitioning, it doesn't look in other folders that might exist.

There are a few things we could do (and these aren't necessarily mutually exclusive):