apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.59k stars 3.54k forks source link

[R] Supper encoding options for CSVs in open_dataset #31415

Open asfimport opened 2 years ago

asfimport commented 2 years ago

The encoding options are passed when a single file is read with read_delim_arrow, but not when opening a folder with open_dataset.

read_delim_arrow creates a reader using CsvTableReader$create (which is what is tested in the package's tests).

open_dataset creates a factory and I'm unable to follow what happens when $Finish() is called.

 

Also, the documentation ("CsvReadOptions" page) lists the "encoding" option under "CsvConvertOptions$create()" instead of "CsvReadOptions$create()"

 


library(dplyr)
library(arrow)
# Opens one file just fine:
one_file <- arrow::read_delim_arrow(
  "test/Test1.txt", 
  as_data_frame = FALSE,
  delim = ";",
  read_options = CsvReadOptions$create(encoding = "ISO-8859-1")
)
collect(one_file)
 
# Can't open the folder that has "Test1.txt" properly, results in Column2 being typed as binary
one_folder <- arrow::open_dataset(
  "test", 
  delim = ";",
  read_options = CsvReadOptions$create(encoding = "ISO-8859-1")
)
collect(one_folder)
 
# Even when specify the schema
one_folder_w_schema <- arrow::open_dataset(
  "test", 
  schema = Schema$create(Column1 = string(), Column2 = string()),
  format = FileFormat$create("text", skip_rows = 1L, delimiter = ";", column_names = c("Column1", "Column2"),
                             read_options = CsvReadOptions$create(encoding = "ISO-8859-1"))
  
)
collect(one_folder_w_schema) 

 

Reporter: Gregoire Leleu

Related issues:

Note: This issue was originally created as ARROW-15992. Please see the migration documentation for further details.

asfimport commented 2 years ago

Nicola Crane / @thisisnic: Thanks for reporting this [~gregleleu] . I don't think this is currently supported - I've opened ticket ARROW-16000 to ask for this to be implemented in the C++, so once it has been we should be able to expose this functionality in R.

asfimport commented 2 years ago

Nicola Crane / @thisisnic: I'm leaving this ticket as a bug for now as until there is functionality in C++ to allow this, we should provide users with better error messaging than there is at the moment.

jweickm commented 7 months ago

Are there any updates on this? I am trying to read in a csv dataset consisting of multiple files in the ISO-8859-1 encoding, but keep encountering the error "CSV conversion error to string: invalid UTF8 data", despite setting the encoding with

arrow::open_delim_dataset(
  files, 
  delim = ";",
  convert_options = arrow::csv_convert_options(decimal_point = ","), 
  read_options = arrow::csv_read_options(encoding = "ISO-8859-1")
  )