apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.49k stars 3.52k forks source link

[R] Column names that are empty strings #37762

Open csgillespie opened 1 year ago

csgillespie commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

The following fails

library(arrow)
library(dplyr)
write.csv(sleep, "sleep.csv", row.names = TRUE)
open_dataset("sleep.csv", format = "csv") |>
  mutate(group = group + 1) |>
  collect()
# Error in env_bind0(env, data) : attempt to use zero-length variable name

This is due to the first column having no column name.

open_dataset("sleep.csv", format = "csv") |>
   head() |>
   collect()
# A tibble: 6 × 4
     `` extra group    ID
  <int> <dbl> <int> <int>
1     1   0.7     1     1
2     2  -1.6     1     2

Component(s)

R

thisisnic commented 1 year ago

Thanks for reporting this @csgillespie!

nealrichardson commented 1 year ago

FWIW if you use read.csv() or readr::read_csv() on that file, both will fill in a non-empty name for the first column ("X" and "...1", respectively). Not saying we should copy that, but that would be one reason they would not error if you tried the same on a data.frame version of this.

Not sure where exactly we should check this since it's technically not invalid in Arrow. And unfortunately it's not trivial to fix either once you've read it in. dplyr::rename() doesn't seem to let you rename an empty name. names<-.Dataset is not implemented, though it could be. You can do names(ds$schema)[1] <- "not_empty" and that does work, though clearly suboptimal.