Open asfimport opened 2 years ago
Dewey Dunnington / @paleolimbot: It's not all that intuitive, but if you skip the partitioning column I think it works!
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
path <- fs::dir_create("tmp")
mtcars %>% group_by(gear) %>% write_dataset(path, format="csv")
ds <- open_dataset(path, format="csv")
# skip the partitioning columns and it works
non_partitioning_cols <- setdiff(names(ds), "gear")
non_partitioning_schema <- ds$schema[non_partitioning_cols]
df <- open_dataset(path, format="csv", schema = non_partitioning_schema, skip_rows = 1)
df %>% collect()
#> # A tibble: 32 × 10
#> mpg cyl disp hp drat wt qsec vs am carb
#> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int>
#> 1 26 4 120. 91 4.43 2.14 16.7 0 1 2
#> 2 30.4 4 95.1 113 3.77 1.51 16.9 1 1 2
#> 3 15.8 8 351 264 4.22 3.17 14.5 0 1 4
#> 4 19.7 6 145 175 3.62 2.77 15.5 0 1 6
#> 5 15 8 301 335 3.54 3.57 14.6 0 1 8
#> 6 21.4 6 258 110 3.08 3.22 19.4 1 0 1
#> 7 18.7 8 360 175 3.15 3.44 17.0 0 0 2
#> 8 18.1 6 225 105 2.76 3.46 20.2 1 0 1
#> 9 14.3 8 360 245 3.21 3.57 15.8 0 0 4
#> 10 16.4 8 276. 180 3.07 4.07 17.4 0 0 3
#> # … with 22 more rows
Dewey Dunnington / @paleolimbot: Right! I see what you mean...we loose 'gear' here. Flagging @thisisnic again just in case there's something I missed with respect to the CSV reader here.
Carl Boettiger / @cboettig:
Sorry my minimal example was too minimal. Yes, I had noticed dropping the partition works, but I cannot then filter()
on the partition column before collect. Continuing from your reprex, try:
> df %>% filter(gear < 3) %>% collect()
Error in lapply(args, function(x) { : object 'gear' not found
The primary incentive to hive-partition I thought was to benefit from arrow
's ability not to even need to parse those files excluded by the filter. (though admittedly hive-partition is more of a parquet concept I guess, I was initially very pleasantly surprised that write_dataset() would even partition in this way with format="csv", so very cool!)
Sam Albers / @boshek: I did some digging to the extent that I added a test that captured this failures here: https://github.com/apache/arrow/pull/12831
I can confirm that this does not happen when format = 'parquet'
. The error message is coming from here but that is about as far as I got. I think this is also related to ARROW-14743
Neal Richardson / @nealrichardson: Confirmed that this is still an issue in 8.0.0
Consider this reprex:
Create a dataset with hive partitions in csv format with write_dataset() (so cool!):
In the first call to open_dataset, we don't pass a schema and things work as expected.
However, csv files often need a schema to be read in correctly, particularly with partitioned data where it is easy to 'guess' the wrong type. Passing the schema though confuses open_dataset, because the grouping column (partition column) isn't found on the individual files even though it is mentioned in the schema!
Nor can we just omit the grouping column from the schema, since then it is effectively lost from the data.
Reporter: Carl Boettiger / @cboettig
PRs and other links:
Note: This issue was originally created as ARROW-15879. Please see the migration documentation for further details.