Open annakrystalli opened 1 year ago
Is this perhaps related to the need to skip the header row when specifying schema in read_csv_arrow
?
https://arrow.apache.org/docs/r/reference/read_delim_arrow.html#ref-examples
For CSV files, col_types
should be used to change the type of a particular column.
readr::readr_example("mtcars.csv") |>
arrow::read_csv_arrow(schema = arrow::schema(cyl = arrow::utf8()))
#> Error:
#> ! Invalid: CSV parse error: Expected 1 columns, got 11: "mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
#> Backtrace:
#> ▆
#> 1. └─arrow (local) `<fn>`(...)
#> 2. └─base::tryCatch(...)
#> 3. └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#> 4. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#> 5. └─value[[3L]](cond)
#> 6. └─arrow:::augment_io_error_msg(e, call, schema = schema)
#> 7. └─rlang::abort(msg, call = call)
readr::readr_example("mtcars.csv") |>
arrow::open_dataset(schema = arrow::schema(cyl = arrow::utf8()), format = "csv") |>
dplyr::collect()
#> Error in `compute.Dataset()`:
#> ! Invalid: Could not open CSV input source '/usr/local/lib/R/site-library/readr/extdata/mtcars.csv': Invalid: CSV parse error: Row #1: Expected 1 columns, got 11: "mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
#> Backtrace:
#> ▆
#> 1. ├─dplyr::collect(...)
#> 2. └─arrow:::collect.Dataset(...)
#> 3. ├─arrow:::collect.ArrowTabular(compute.Dataset(x), as_data_frame)
#> 4. │ └─base::as.data.frame(x, ...)
#> 5. └─arrow:::compute.Dataset(x)
#> 6. └─base::tryCatch(...)
#> 7. └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#> 8. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#> 9. └─value[[3L]](cond)
#> 10. └─arrow:::augment_io_error_msg(e, call, schema = schema())
#> 11. └─rlang::abort(msg, call = call)
readr::readr_example("mtcars.csv") |>
arrow::read_csv_arrow(col_types = arrow::schema(cyl = arrow::utf8()))
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <chr> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # … with 22 more rows
readr::readr_example("mtcars.csv") |>
arrow::open_dataset(col_types = arrow::schema(cyl = arrow::utf8()), format = "csv") |>
dplyr::collect()
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <chr> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # … with 22 more rows
Created on 2023-03-16 with reprex v2.0.2
If the schema option is used for Parquet files, only that columns will be read.
readr::readr_example("mtcars.csv") |>
readr::read_csv(show_col_types = FALSE) |>
arrow::write_parquet("test.parquet")
arrow::open_dataset("test.parquet", schema = arrow::schema(cyl = arrow::utf8())) |>
dplyr::collect()
#> # A tibble: 32 × 1
#> cyl
#> <chr>
#> 1 6
#> 2 6
#> 3 4
#> 4 6
#> 5 8
#> 6 6
#> 7 8
#> 8 4
#> 9 4
#> 10 6
#> # … with 22 more rows
Created on 2023-03-16 with reprex v2.0.2
There are multiple things going on here, and I'm currently trying to work through a more complete answer, but part of the issue here is that the partitioning column is included in the schema, which works fine for the Parquet dataset but not the CSV dataset.
Thanks, @thisisnic that's really helpful. If you have further advice on how to approach this best it would be more than welcome but I've got something to work with for now.
I think there is also another issue with providing both a schema and a partition variable in CSV datasets; I've opened #34640 to look into this.
Describe the bug, including details regarding any error messages, version, and platform.
Hi,
I originally asked this as a question on stackoverflow but the more I thought about it, and given I got no answers, I feel it might a bug so thought I'd report it here too.
I'm including the full context of what I'm trying to do but ultimately the problem seems to be that
open_dataset()
is not recognising/reading csv files when an explicit schema is provided.Full context
I'm trying to open a FileSystemDataset using
arrow::open_dataset()
from a directory that contains two different file formats (csv & parquet). The single parquet file also has an additional field (age_group
). The approach needs to be generalisable as the field names as well as file formats might change between projects.My initial plan for dealing with more than one file format was to create a
FileSystemDataset
for each format and then open a singleUnionDataset
from allFileSystemDataset
s.However, this approach errors because one of the fields (
horizon
) is parsed asint64()
in the csvFileSystemDataset
andint32()
in the parquetFileSystemDataset
which doesn't allow the schema to be unified.To get around this in a flexible and general way, I created a unified schema by keeping the schema from the first FileSystemDataset (csv) and adding any additional fields from other FileSystemDatasets. I then used that to create appropriate schema subsets for each FileSystemDataset. I tried to replace each dataset's schema through assignment but that threw the same initial error.
Finally I tried to reopen the FileSystemDatasets using the appropriate schema for each format but now in the csv FileSystemDataset, 0 csv files are read. I'm really confused as the schema in the original csv FileSystemDataset is exactly the same as the one created from the unified schema as well as the FileSystemDataset opened by explicitly specifying the schema.
Not sure what I'm doing wrong. Very open to more elegant approaches to tackling the overall problem also.
Reprex
Created on 2023-03-16 with reprex v2.0.2
Session info
``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.1 (2022-06-23) #> os macOS Ventura 13.2.1 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Athens #> date 2023-03-16 #> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> arrow 11.0.0.3 2023-03-08 [1] CRAN (R 4.2.0) #> askpass 1.1 2019-01-13 [1] CRAN (R 4.2.0) #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0) #> bit 4.0.5 2022-11-15 [1] CRAN (R 4.2.0) #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.2.0) #> cli 3.6.0 2023-01-09 [1] CRAN (R 4.2.0) #> crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.0) #> credentials 1.3.2 2021-11-29 [1] CRAN (R 4.2.1) #> curl 5.0.0 2023-01-12 [1] CRAN (R 4.2.0) #> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.0) #> dplyr * 1.1.0 2023-01-29 [1] CRAN (R 4.2.0) #> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.0) #> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.0) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.2.0) #> fs 1.6.1 2023-02-06 [1] CRAN (R 4.2.0) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.1) #> gert 1.9.2 2022-12-05 [1] CRAN (R 4.2.0) #> gh 1.3.1 2022-09-08 [1] CRAN (R 4.2.1) #> gitcreds 0.1.2 2022-09-08 [1] CRAN (R 4.2.1) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.0) #> httr 1.4.5 2023-02-24 [1] CRAN (R 4.2.0) #> jsonlite 1.8.4 2022-12-06 [1] CRAN (R 4.2.0) #> knitr 1.42 2023-01-25 [1] CRAN (R 4.2.0) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> openssl 2.0.5 2022-12-06 [1] CRAN (R 4.2.0) #> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> purrr 1.0.1 2023-01-10 [1] CRAN (R 4.2.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0) #> R.utils 2.12.0 2022-06-28 [1] CRAN (R 4.2.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> reprex 2.0.2 2022-08-17 [3] CRAN (R 4.2.0) #> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0) #> rmarkdown 2.20 2023-01-19 [1] CRAN (R 4.2.0) #> rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.2.1) #> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.1) #> sessioninfo 1.2.2 2021-12-06 [3] CRAN (R 4.2.0) #> styler 1.7.0 2022-03-13 [1] CRAN (R 4.2.0) #> sys 3.4.1 2022-10-18 [1] CRAN (R 4.2.0) #> tibble 3.2.0 2023-03-08 [1] CRAN (R 4.2.0) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0) #> usethis 2.1.6 2022-05-25 [1] CRAN (R 4.2.0) #> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.0) #> vctrs 0.5.2 2023-01-23 [1] CRAN (R 4.2.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> xfun 0.37 2023-01-31 [1] CRAN (R 4.2.0) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.0) #> #> [1] /Users/Anna/Library/R/arm64/4.2/library #> [2] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/site-library #> [3] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```Component(s)
R