apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.69k stars 3.56k forks source link

[R] col_types of open_delim_dataset() does not work as described #39811

Open joelnitta opened 10 months ago

joelnitta commented 10 months ago

Describe the bug, including details regarding any error messages, version, and platform.

(originally posted as a comment on #38903, but suggested by @thisisnic to file as its own issue)

The current documentation of open_delim_dataset() says that a "compact string representation" of column types can be used for the col_types argument. This nearly identical to wording for the col_types argument of {readr}, but no additional explanation is provided. So I assumed that's what it meant, but that this does not seem to work:

library(readr)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

# works
read_csv(readr_example("mtcars.csv"), col_types = paste(rep("c", 11), collapse = ""))
#> # A tibble: 32 × 11
#>    mpg   cyl   disp  hp    drat  wt    qsec  vs    am    gear  carb 
#>    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#>  1 21    6     160   110   3.9   2.62  16.46 0     1     4     4    
#>  2 21    6     160   110   3.9   2.875 17.02 0     1     4     4    
#>  3 22.8  4     108   93    3.85  2.32  18.61 1     1     4     1    
#>  4 21.4  6     258   110   3.08  3.215 19.44 1     0     3     1    
#>  5 18.7  8     360   175   3.15  3.44  17.02 0     0     3     2    
#>  6 18.1  6     225   105   2.76  3.46  20.22 1     0     3     1    
#>  7 14.3  8     360   245   3.21  3.57  15.84 0     0     3     4    
#>  8 24.4  4     146.7 62    3.69  3.19  20    1     0     4     2    
#>  9 22.8  4     140.8 95    3.92  3.15  22.9  1     0     4     2    
#> 10 19.2  6     167.6 123   3.92  3.44  18.3  1     0     4     4    
#> # ℹ 22 more rows

# works
open_csv_dataset(readr_example("mtcars.csv"))
#> FileSystemDataset with 1 csv file
#> mpg: double
#> cyl: int64
#> disp: double
#> hp: int64
#> drat: double
#> wt: double
#> qsec: double
#> vs: int64
#> am: int64
#> gear: int64
#> carb: int64

# doesn't work
open_csv_dataset(readr_example("mtcars.csv"), col_types = paste(rep("c", 11), collapse = ""))
#> Error:
#> ! Unsupported `col_types` specification.
#> ℹ `col_types` must be NULL, or a <Schema>.
#> Backtrace:
#>      ▆
#>   1. └─arrow (local) `<fn>`(...)
#>   2.   └─arrow::open_dataset(...)
#>   3.     └─DatasetFactory$create(...)
#>   4.       └─FileFormat$create(...)
#>   5.         └─CsvFileFormat$create(...)
#>   6.           └─arrow:::check_csv_file_format_args(dots, partitioning = partitioning)
#>   7.             ├─base::do.call(csv_file_format_convert_opts, args)
#>   8.             └─arrow (local) `<fn>`(...)
#>   9.               ├─base::do.call(csv_convert_options, opts)
#>  10.               └─arrow (local) `<fn>`(...)
#>  11.                 └─rlang::abort(c("Unsupported `col_types` specification.", i = "`col_types` must be NULL, or a <Schema>."))

Created on 2024-01-24 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.2 (2023-10-31) #> os macOS Sonoma 14.1.2 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype UTF-8 #> tz Asia/Tokyo #> date 2024-01-24 #> pandoc 3.1.2 @ /usr/local/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> arrow * 14.0.0.2 2023-12-02 [1] CRAN (R 4.3.1) #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.3.0) #> bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.0) #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.0) #> cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.1) #> crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.0) #> digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0) #> evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.1) #> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.1) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0) #> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0) #> hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0) #> htmltools 0.5.7 2023-11-03 [1] CRAN (R 4.3.1) #> knitr 1.45 2023-10-30 [1] CRAN (R 4.3.1) #> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.1) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0) #> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.0) #> R.utils 2.12.3 2023-11-18 [1] CRAN (R 4.3.1) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0) #> readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.0) #> rlang 1.1.2 2023-11-04 [1] CRAN (R 4.3.1) #> rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.1) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0) #> styler 1.10.2 2023-08-29 [1] CRAN (R 4.3.0) #> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0) #> tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0) #> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.1) #> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.1) #> vroom 1.6.5 2023-12-05 [1] CRAN (R 4.3.1) #> withr 2.5.2 2023-10-30 [1] CRAN (R 4.3.1) #> xfun 0.41 2023-11-01 [1] CRAN (R 4.3.1) #> yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.1) #> #> [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```

Component(s)

R

thisisnic commented 10 months ago

Thanks for reporting this @joelnitta! I can replicate this, and this is a bug.

thisisnic commented 10 months ago

From the perspective of fixing this, I had a look and :

I think what we need to do here is one of:

a) set up this schema manually if we need to. It's probably a change which needs making in the body of check_csv_file_format_args where we checking options for validity and setting up the various options classes for reading in datasets.

b) call readr_to_csv_parse_options() in check_csv_file_format_args(), though I'm not convinced this is the right path here, as open_csv_dataset() is just a wrapper around open_dataset(format = "csv"). The original function open_dataset() supports more options than open_csv_dataset() and so we might break things if we do this.