Open assignUser opened 7 months ago
@assignUser Is this issue resolved? If not, I want to contribute to it!!
Nope and afaik noone is working on it so feel free to take it on!
@assignUser
To solve the issue of unclear documentation while working with partial schemas in the open_dataset
function using col_types
, we'll take a few steps to make things clearer for users.
First, we will go to the documentation and update the doc strings for col_types
then make sure to clearly explain that col_types
is used for passing partial schemas in open_dataset
.
Next, we will add a direct link in the open_dataset
documentation that leads to the detailed descriptions of the possible options, including col_types
.
Or we could find another way to make these options more visible in the documentation. Maybe by creating a separate section or even a dedicated page for these specialized options.
@assignUser Am I thinking in the right direction and are you satisfied with my answer?
Could you please assign this task to me, I want to contribute to it!!
Am I thinking in the right direction and are you satisfied with my answer?
I have assigned the issue to you. You can also comment "/take" on an issue and a bot will assign it to you :)
I would add that the current documentation says that a "compact string representation" of column types is allowable. This is very similar to the wording of {readr}, so without additional explanation I assumed that's what it meant, but that this does not seem to work:
library(readr)
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
# works
read_csv(readr_example("mtcars.csv"), col_types = paste(rep("c", 11), collapse = ""))
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 21 6 160 110 3.9 2.62 16.46 0 1 4 4
#> 2 21 6 160 110 3.9 2.875 17.02 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
#> 8 24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
#> # ℹ 22 more rows
# works
open_csv_dataset(readr_example("mtcars.csv"))
#> FileSystemDataset with 1 csv file
#> mpg: double
#> cyl: int64
#> disp: double
#> hp: int64
#> drat: double
#> wt: double
#> qsec: double
#> vs: int64
#> am: int64
#> gear: int64
#> carb: int64
# doesn't work
open_csv_dataset(readr_example("mtcars.csv"), col_types = paste(rep("c", 11), collapse = ""))
#> Error:
#> ! Unsupported `col_types` specification.
#> ℹ `col_types` must be NULL, or a <Schema>.
#> Backtrace:
#> ▆
#> 1. └─arrow (local) `<fn>`(...)
#> 2. └─arrow::open_dataset(...)
#> 3. └─DatasetFactory$create(...)
#> 4. └─FileFormat$create(...)
#> 5. └─CsvFileFormat$create(...)
#> 6. └─arrow:::check_csv_file_format_args(dots, partitioning = partitioning)
#> 7. ├─base::do.call(csv_file_format_convert_opts, args)
#> 8. └─arrow (local) `<fn>`(...)
#> 9. ├─base::do.call(csv_convert_options, opts)
#> 10. └─arrow (local) `<fn>`(...)
#> 11. └─rlang::abort(c("Unsupported `col_types` specification.", i = "`col_types` must be NULL, or a <Schema>."))
Created on 2024-01-24 with reprex v2.0.2
Describe the enhancement requested
In a recent SO question about using partial schemas in
open_dataset
(which is possible usingcol_types
) even a seasond arrow user did not know about the proper solution.The docs for open_dataset hide a lot of more specialized options behind a
...
and it it's not obvious how to find those as the linked dataset factory page also doesn't show all possibility. Some are explained in the specialized wrapper functions like https://arrow.apache.org/docs/r/reference/open_delim_dataset.html or https://arrow.apache.org/docs/r/reference/csv_convert_options.html but even there col_types is not described in a way that makes it obvious that it is to be used to pass in partial schemas.At the minimum the doc strings for
col_types
should make the intended uses case clear, ideally we should link to the detailed descriptions fromopen_dataset
or find another way to document the possible options more visibly.Component(s)
Documentation, R