apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.88k stars 3.38k forks source link

[R][Docs] Improve documentation of `col_types` #38903

Open assignUser opened 7 months ago

assignUser commented 7 months ago

Describe the enhancement requested

In a recent SO question about using partial schemas in open_dataset (which is possible using col_types) even a seasond arrow user did not know about the proper solution.

The docs for open_dataset hide a lot of more specialized options behind a ... and it it's not obvious how to find those as the linked dataset factory page also doesn't show all possibility. Some are explained in the specialized wrapper functions like https://arrow.apache.org/docs/r/reference/open_delim_dataset.html or https://arrow.apache.org/docs/r/reference/csv_convert_options.html but even there col_types is not described in a way that makes it obvious that it is to be used to pass in partial schemas.

At the minimum the doc strings for col_types should make the intended uses case clear, ideally we should link to the detailed descriptions from open_dataset or find another way to document the possible options more visibly.

Component(s)

Documentation, R

ShaiviAgarwal2 commented 5 months ago

@assignUser Is this issue resolved? If not, I want to contribute to it!!

assignUser commented 5 months ago

Nope and afaik noone is working on it so feel free to take it on!

ShaiviAgarwal2 commented 5 months ago

@assignUser To solve the issue of unclear documentation while working with partial schemas in the open_dataset function using col_types, we'll take a few steps to make things clearer for users.

First, we will go to the documentation and update the doc strings for col_types then make sure to clearly explain that col_types is used for passing partial schemas in open_dataset.

Next, we will add a direct link in the open_dataset documentation that leads to the detailed descriptions of the possible options, including col_types. Or we could find another way to make these options more visible in the documentation. Maybe by creating a separate section or even a dedicated page for these specialized options.

ShaiviAgarwal2 commented 5 months ago

@assignUser Am I thinking in the right direction and are you satisfied with my answer?

ShaiviAgarwal2 commented 5 months ago

Could you please assign this task to me, I want to contribute to it!!

assignUser commented 5 months ago

Am I thinking in the right direction and are you satisfied with my answer?

I have assigned the issue to you. You can also comment "/take" on an issue and a bot will assign it to you :)

joelnitta commented 5 months ago

I would add that the current documentation says that a "compact string representation" of column types is allowable. This is very similar to the wording of {readr}, so without additional explanation I assumed that's what it meant, but that this does not seem to work:

library(readr)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

# works
read_csv(readr_example("mtcars.csv"), col_types = paste(rep("c", 11), collapse = ""))
#> # A tibble: 32 × 11
#>    mpg   cyl   disp  hp    drat  wt    qsec  vs    am    gear  carb 
#>    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#>  1 21    6     160   110   3.9   2.62  16.46 0     1     4     4    
#>  2 21    6     160   110   3.9   2.875 17.02 0     1     4     4    
#>  3 22.8  4     108   93    3.85  2.32  18.61 1     1     4     1    
#>  4 21.4  6     258   110   3.08  3.215 19.44 1     0     3     1    
#>  5 18.7  8     360   175   3.15  3.44  17.02 0     0     3     2    
#>  6 18.1  6     225   105   2.76  3.46  20.22 1     0     3     1    
#>  7 14.3  8     360   245   3.21  3.57  15.84 0     0     3     4    
#>  8 24.4  4     146.7 62    3.69  3.19  20    1     0     4     2    
#>  9 22.8  4     140.8 95    3.92  3.15  22.9  1     0     4     2    
#> 10 19.2  6     167.6 123   3.92  3.44  18.3  1     0     4     4    
#> # ℹ 22 more rows

# works
open_csv_dataset(readr_example("mtcars.csv"))
#> FileSystemDataset with 1 csv file
#> mpg: double
#> cyl: int64
#> disp: double
#> hp: int64
#> drat: double
#> wt: double
#> qsec: double
#> vs: int64
#> am: int64
#> gear: int64
#> carb: int64

# doesn't work
open_csv_dataset(readr_example("mtcars.csv"), col_types = paste(rep("c", 11), collapse = ""))
#> Error:
#> ! Unsupported `col_types` specification.
#> ℹ `col_types` must be NULL, or a <Schema>.
#> Backtrace:
#>      ▆
#>   1. └─arrow (local) `<fn>`(...)
#>   2.   └─arrow::open_dataset(...)
#>   3.     └─DatasetFactory$create(...)
#>   4.       └─FileFormat$create(...)
#>   5.         └─CsvFileFormat$create(...)
#>   6.           └─arrow:::check_csv_file_format_args(dots, partitioning = partitioning)
#>   7.             ├─base::do.call(csv_file_format_convert_opts, args)
#>   8.             └─arrow (local) `<fn>`(...)
#>   9.               ├─base::do.call(csv_convert_options, opts)
#>  10.               └─arrow (local) `<fn>`(...)
#>  11.                 └─rlang::abort(c("Unsupported `col_types` specification.", i = "`col_types` must be NULL, or a <Schema>."))

Created on 2024-01-24 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.2 (2023-10-31) #> os macOS Sonoma 14.1.2 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype UTF-8 #> tz Asia/Tokyo #> date 2024-01-24 #> pandoc 3.1.2 @ /usr/local/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> arrow * 14.0.0.2 2023-12-02 [1] CRAN (R 4.3.1) #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.3.0) #> bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.0) #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.0) #> cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.1) #> crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.0) #> digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0) #> evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.1) #> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.1) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0) #> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0) #> hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0) #> htmltools 0.5.7 2023-11-03 [1] CRAN (R 4.3.1) #> knitr 1.45 2023-10-30 [1] CRAN (R 4.3.1) #> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.1) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0) #> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.0) #> R.utils 2.12.3 2023-11-18 [1] CRAN (R 4.3.1) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0) #> readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.0) #> rlang 1.1.2 2023-11-04 [1] CRAN (R 4.3.1) #> rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.1) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0) #> styler 1.10.2 2023-08-29 [1] CRAN (R 4.3.0) #> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0) #> tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0) #> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.1) #> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.1) #> vroom 1.6.5 2023-12-05 [1] CRAN (R 4.3.1) #> withr 2.5.2 2023-10-30 [1] CRAN (R 4.3.1) #> xfun 0.41 2023-11-01 [1] CRAN (R 4.3.1) #> yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.1) #> #> [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```