[R] passing a schema calls open_dataset to fail on hive-partitioned csv files

asfimport commented 2 years ago

Consider this reprex:

Create a dataset with hive partitions in csv format with write_dataset() (so cool!):


library(arrow)
library(dplyr)
path <- fs::dir_create("tmp")
mtcars %>% group_by(gear) %>% write_dataset(path, format="csv")## works fine, even with 'collect()'
ds <- open_dataset(path, format="csv")## but pass a schema, and things fail
df <- open_dataset(path, format="csv", schema = ds$schema, skip_rows=1)
df %>% collect()

In the first call to open_dataset, we don't pass a schema and things work as expected.

However, csv files often need a schema to be read in correctly, particularly with partitioned data where it is easy to 'guess' the wrong type. Passing the schema though confuses open_dataset, because the grouping column (partition column) isn't found on the individual files even though it is mentioned in the schema!

Nor can we just omit the grouping column from the schema, since then it is effectively lost from the data.

Reporter: Carl Boettiger / @cboettig

PRs and other links:

GitHub Pull Request #12831

_{Note: This issue was originally created as ARROW-15879. Please see the migration documentation for further details.}

asfimport commented 2 years ago

Dewey Dunnington / @paleolimbot: It's not all that intuitive, but if you skip the partitioning column I think it works!


library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
path <- fs::dir_create("tmp")
mtcars %>% group_by(gear) %>% write_dataset(path, format="csv")
ds <- open_dataset(path, format="csv")

# skip the partitioning columns and it works
non_partitioning_cols <- setdiff(names(ds), "gear")
non_partitioning_schema <- ds$schema[non_partitioning_cols]
df <- open_dataset(path, format="csv", schema = non_partitioning_schema, skip_rows = 1)
df %>% collect()
#> # A tibble: 32 × 10
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  carb
#>    <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int>
#>  1  26       4 120.     91  4.43  2.14  16.7     0     1     2
#>  2  30.4     4  95.1   113  3.77  1.51  16.9     1     1     2
#>  3  15.8     8 351     264  4.22  3.17  14.5     0     1     4
#>  4  19.7     6 145     175  3.62  2.77  15.5     0     1     6
#>  5  15       8 301     335  3.54  3.57  14.6     0     1     8
#>  6  21.4     6 258     110  3.08  3.22  19.4     1     0     1
#>  7  18.7     8 360     175  3.15  3.44  17.0     0     0     2
#>  8  18.1     6 225     105  2.76  3.46  20.2     1     0     1
#>  9  14.3     8 360     245  3.21  3.57  15.8     0     0     4
#> 10  16.4     8 276.    180  3.07  4.07  17.4     0     0     3
#> # … with 22 more rows

asfimport commented 2 years ago

Dewey Dunnington / @paleolimbot: Right! I see what you mean...we loose 'gear' here. Flagging @thisisnic again just in case there's something I missed with respect to the CSV reader here.

asfimport commented 2 years ago

Carl Boettiger / @cboettig: Sorry my minimal example was too minimal. Yes, I had noticed dropping the partition works, but I cannot then filter() on the partition column before collect. Continuing from your reprex, try:


> df %>% filter(gear < 3) %>% collect()
Error in lapply(args, function(x) { : object 'gear' not found

The primary incentive to hive-partition I thought was to benefit from arrow's ability not to even need to parse those files excluded by the filter. (though admittedly hive-partition is more of a parquet concept I guess, I was initially very pleasantly surprised that write_dataset() would even partition in this way with format="csv", so very cool!)

asfimport commented 2 years ago

Sam Albers / @boshek: I did some digging to the extent that I added a test that captured this failures here: https://github.com/apache/arrow/pull/12831

I can confirm that this does not happen when format = 'parquet'. The error message is coming from here but that is about as far as I got. I think this is also related to ARROW-14743

asfimport commented 2 years ago

Neal Richardson / @nealrichardson: Confirmed that this is still an issue in 8.0.0

apache / arrow

[R] passing a schema calls open_dataset to fail on hive-partitioned csv files #31312

PRs and other links: