apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.37k stars 3.49k forks source link

[R] Add an argument to `open_csv_dataset()` to repair duplicated column names or ignore them? #34965

Open etiennebacher opened 1 year ago

etiennebacher commented 1 year ago

Describe the enhancement requested

Hello, I'm just getting started with arrow so I hope this request makes sense.

I have a list of CSV files that have some duplicated column names. When I use readr::read_csv(), it automatically "repairs" those names to make them unique but arrow::open_csv_dataset() errors because of those duplicated column names:

packageVersion("arrow")
#> [1] '11.0.0.3'

file_location <- tempfile(fileext = ".csv")

test <- data.frame(x = 1, x = 2, check.names = FALSE)
write.csv(test, file_location)

readr::read_csv(file_location)
#> New names:
#> Rows: 1 Columns: 3
#> ── Column specification
#> ──────────────────────────────────────────────────────── Delimiter: "," dbl
#> (3): ...1, x...2, x...3
#> ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
#> Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> • `` -> `...1`
#> • `x` -> `x...2`
#> • `x` -> `x...3`
#> # A tibble: 1 × 3
#>    ...1 x...2 x...3
#>   <dbl> <dbl> <dbl>
#> 1     1     1     2

arrow::open_csv_dataset(file_location)
#> Error in `open_dataset()`:
#> ! Invalid: Error creating dataset. Could not read schema from 'C:/Users/etienne/AppData/Local/Temp/RtmpYLHyDe/file3c7c3a407a21.csv': Could not open CSV input source 'C:/Users/etienne/AppData/Local/Temp/RtmpYLHyDe/file3c7c3a407a21.csv': Invalid: CSV file contained multiple columns named x. Is this a 'csv' file?
#> Backtrace:
#>     ▆
#>  1. └─arrow (local) `<fn>`(sources = file_location, delim = ",")
#>  2.   └─arrow::open_dataset(...)
#>  3.     └─base::tryCatch(...)
#>  4.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  5.         └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  6.           └─value[[3L]](cond)
#>  7.             └─arrow:::augment_io_error_msg(e, call, format = format)
#>  8.               └─rlang::abort(msg, call = call)

I could ignore those names with col_select = - c(<colnames to ignore>). However, those columns are not necessarily in all CSV files and if they're not then open_csv_dataset() also errors.

Would it be possible to add an argument to repair the column names or to ignore them (i.e not load them)? Or am I missing something obvious?

Thanks for your work!

Component(s)

R

thisisnic commented 1 year ago

Thanks for reporting this @etiennebacher! I can confirm that this is reproducible on the dev version of Arrow. You're not missing something obvious; Arrow Dataset objects don't allow you to have duplicated column names I believe. That error message isn't the most helpful, so we could probably do with improving it and/or adding in code which fixes this.

As a temporary workaround, you could manually supply a schema to the data with the corrected column names. I've added a brief example below; let me know if this works for your specific case. If it's still tricky, there'll be other workarounds we can try.

library(arrow)

file_location <- tempfile(fileext = ".csv")

test <- data.frame(x = 1, x = 2, check.names = FALSE)

write.csv(test, file_location, row.names = FALSE)

# works fine with readr
readr::read_csv(file_location)
#> New names:
#> • `x` -> `x...1`
#> • `x` -> `x...2`
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (2): x...1, x...2
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>   x...1 x...2
#>   <dbl> <dbl>
#> 1     1     2

# read in the file as an Arrow Table
file <- read_csv_arrow(file_location, as_data_frame = FALSE)

# extract the schema from the table
my_schema <- file$schema

# we can see the duplicated names here
my_schema
#> Schema
#> x: int64
#> x: int64

# update the second field in the schema to be called "y" instead
my_schema[[2]] <- field("y", int64())

# open the dataset, specifying the new schema
# we have to include "skip" to skip the first row of the file
ds <- arrow::open_csv_dataset(file_location, schema = my_schema, skip = 1)
dplyr::collect(ds)
#> # A tibble: 1 × 2
#>       x     y
#>   <int> <int>
#> 1     1     2
etiennebacher commented 1 year ago

Thank you for your answer @thisisnic. The workaround you provided works in this very simple case because there are only 2 columns, but I have tens or hundreds of them in my scenario. I improved it a bit to detect the duplicated names, repair them by adding a random suffix, and plugging them back:

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
packageVersion("arrow")
#> [1] '11.0.0.3'

file_location <- tempfile(fileext = ".csv")

test <- data.frame(x = 1, x = 2, check.names = FALSE)
write.csv(test, file_location)

file <- read_csv_arrow(file_location, as_data_frame = FALSE)

# extract the schema from the table
my_schema <- file$schema

# we can see the duplicated names here
dupes <- which(duplicated(names(my_schema)))

for (i in dupes) {

  # get original variable name and add a random suffix (so that the new name
  # is not a duplicate of another one)
  orig <- names(my_schema)[i]
  set.seed(i)
  suffix <- paste(sample(letters, 8), collapse = "")

  new_var <- paste0(orig, "_", suffix)

  # get the variable type
  orig_field <- my_schema$fields[[i]]$type$code()

  # update the variable
  my_schema[[i]] <- field(new_var, eval(orig_field))

  cat(paste("Old variable name:", orig, "\nNew variable name:", new_var, "\n\n"))

}
#> Old variable name: x 
#> New variable name: x_elgdhkvj

# open the dataset, specifying the new schema
# we have to include "skip" to skip the first row of the file
ds <- arrow::open_csv_dataset(file_location, schema = my_schema, skip = 1)
dplyr::collect(ds)
#> # A tibble: 1 × 3
#>      ``     x x_elgdhkvj
#>   <int> <int>      <int>
#> 1     1     1          2

(Note that I didn't check that this worked with more than 2 duplicated names.)

Also, while this workaround is fast for small files, the original read_csv_arrow() takes some time with bigger ones. Nothing crazy, but extended to dozens of files, this can pile up and lead to an important delay. Maybe there's a faster way to do this?

thisisnic commented 1 year ago

In terms of quicker workarounds, it's hard to say without more context. If you're expecting random column names to be duplicated, are any of the column names meaningful at all? If not, you could just autogenerate them all, e.g.: arrow::open_csv_dataset(file_location, read_options = list(autogenerate_column_names = TRUE))

etiennebacher commented 1 year ago

Yes the column names are useful. To be a bit more specific this is some census data that appears to have one or two duplicated columns. The thing is that these duplicated columns are not always in the same position so any hardcoded index won't help here. Using autogenerate_column_names = TRUE is not an option since I would lose all information about column names.

I tried to make a small example with 200 vars and 1,000,000 rows but can't reproduce the time that took read_csv_arrow() in my "real" example.

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(tictoc)
packageVersion("arrow")
#> [1] '11.0.0.3'

file_location <- tempfile(fileext = ".csv")

# make a fake "big" dataset
tmp <- list()
for (i in 1:200) {
  set.seed(i)
  tmp[[paste0("var_", i)]] <- sample(1:100, 1e6, TRUE)
}
test <- list2DF(tmp)

# make a duplicated column name
names(test)[62] <- "var_1"

readr::write_csv(test, file_location)

tictoc::tic()
file <- read_csv_arrow(file_location, as_data_frame = FALSE)
tictoc::toc()
#> 11.82 sec elapsed

# extract the schema from the table
my_schema <- file$schema

# we can see the duplicated names here
dupes <- which(duplicated(names(my_schema)))

for (i in dupes) {

  # get original variable name and add a random suffix (so that the new name
  # is not a duplicate of another one)
  orig <- names(my_schema)[i]
  set.seed(i)
  suffix <- paste(sample(letters, 8), collapse = "")

  new_var <- paste0(orig, "_", suffix)

  # get the variable type
  orig_field <- my_schema$fields[[i]]$type$code()

  # update the variable
  my_schema[[i]] <- field(new_var, eval(orig_field))

  cat(paste("Old variable name:", orig, "\nNew variable name:", new_var, "\n\n"))

}
#> Old variable name: var_1 
#> New variable name: var_1_vkeoxuwd

# open the dataset, specifying the new schema
# we have to include "skip" to skip the first row of the file
ds <- arrow::open_csv_dataset(file_location, schema = my_schema, skip = 1)
out <- dplyr::collect(ds)

This is not an urgent issue for me but I think having a way to automatically repair duplicated column names would be useful. Is it feasible to implement? (You proposed a better error message and a workaround, so I'd just like to clarify on whether it could be a feature later)

thisisnic commented 1 year ago

Is it feasible to implement?

We can absolutely open up a C++ ticket, and see if anyone has time to work on it.

Another thing which is tangentially related, but may help is #35035 which would make it easier to rename all fields in a schema in one go. It doesn't solve the problem but could lead to a faster workaround once you have the existing schema.

A faster way of extracting the existing schema, would be to use readr::read_csv() which can limit the number of lines to read and works really quickly, e.g.

one_row <- readr::read_csv(file_location, n_max = 1) %>%
  arrow_table(as_data_frame = FALSE)

one_row$schema 

The file is being read in via readr and then converted to Arrow, rather than read in via Arrow, so theoretically you might could up with a different shcema, but in most cases, I don't think you would.

thisisnic commented 1 year ago

Found an existing ticket open relating to this: [C++] Decide on duplicate column handling in scanner, add more tests