gesistsa / rio

🐟 A Swiss-Army Knife for Data I/O
http://gesistsa.github.io/rio/
594 stars 77 forks source link

Conflicting `which` for compressed formats with multiple sheets #412

Closed chainsawriot closed 2 months ago

chainsawriot commented 2 months ago

Of course, one can argue why anyone would use compressed formats with multiple sheets in the first place, e.g. xlsx.zip. But a bug is a bug.

The issue is that the which parameter of import() is used twice: first for selecting a file in the archive, and second for selecting a sheet.

https://github.com/gesistsa/rio/blob/c86db70174bb9da81b7c4b6ee3f22dd9cbdb1c1e/R/import.R#L131

https://github.com/gesistsa/rio/blob/c86db70174bb9da81b7c4b6ee3f22dd9cbdb1c1e/R/import.R#L156

In order not to make thing more complicated (such as introducing new parameters for such an edge case), my suggestion is simply to make some precedence rules.

zip_file <- tempfile(fileext = ".xlsx.zip")

rio::export(head(iris), zip_file)

raw_file <- utils::unzip(zip_file, list = TRUE)$Name[1]

rio::import(zip_file)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

## this is fine-ish, I guess?
rio::import(zip_file, which = "aaaa.xlsx")
#> Warning in extract_func(file, files = file_list[grep(which2, file_list)[1]], :
#> requested file not found in the zip file
#> Error: `path` does not exist: '/tmp/RtmpH9K6ta/file831fb50f53589/aaaa.xlsx'

rio::import(zip_file, which = raw_file)
#> Error: Sheet 'file831fb5a3e85e.xlsx' not found

## a more illustrative example

zip_file2 <- tempfile(fileext = ".xlsx.zip")

rio::export(list(first_sheet = head(iris), second_sheet = tail(iris)), zip_file2)

xlsx_file <- tempfile(fileext = ".xlsx")

rio::export(list(first_sheet = head(iris), second_sheet = tail(iris)), xlsx_file)

rio::import(zip_file2, which = 1)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa
rio::import(zip_file2, which = 2)
#> Warning in extract_func(file, files = file_list[which], exdir = d): requested
#> file not found in the zip file
#> Error: 'file' has no extension

rio::import(xlsx_file, which = 1)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa
rio::import(xlsx_file, which = 2)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#> 1          6.7         3.3          5.7         2.5 virginica
#> 2          6.7         3.0          5.2         2.3 virginica
#> 3          6.3         2.5          5.0         1.9 virginica
#> 4          6.5         3.0          5.2         2.0 virginica
#> 5          6.2         3.4          5.4         2.3 virginica
#> 6          5.9         3.0          5.1         1.8 virginica

Created on 2024-05-14 with reprex v2.1.0

chainsawriot commented 2 months ago

ref #400