gesistsa / rio

🐟 A Swiss-Army Knife for Data I/O
http://gesistsa.github.io/rio/
599 stars 76 forks source link

`get_ext` is mispecified #350

Closed chainsawriot closed 12 months ago

chainsawriot commented 12 months ago

https://github.com/gesistsa/rio/blob/f0198a386db58335adee1a435cc39c5eba6d9103/R/utils.R#L11-L30

The function actually is not actually doing what #169 and the documentation ("A characters string containing a file type recognized by rio.") said and what @billdenney requested. Actually, "file type" is confusing. Let's call this format (as in the argument of import). The main issue is that get_type() is just running inside get_ext() without assigning it to anything.

To give an example:

rio::get_ext("hello.arff")
#> [1] "arff"
rio::get_ext("hello.matlab")
#> [1] "matlab"

rio::get_ext("hello.yaml")
#> [1] "yaml"
rio::get_ext("hello.yml")
#> [1] "yml"

Created on 2023-09-07 with reprex v2.0.2

It is almost the same as tools::file_ext, just with an additional care of URL.

Also a consequence to this, get_ext("twotables.htm") actually return "htm". And therefore, this bunch of check for html wont work

https://github.com/gesistsa/rio/blob/f0198a386db58335adee1a435cc39c5eba6d9103/R/import_list.R#L115-L130

temp_htm <- tempfile(fileext = ".htm")
download.file("https://github.com/gesistsa/rio/blob/main/tests/testdata/twotables.html", temp_htm, quiet = TRUE)

dat <- rio::import_list(temp_htm) ## only one table is imported

## failed
testthat::expect_true(identical(dim(dat[[2]]), dim(iris)))
#> Error in dat[[2]]: subscript out of bounds

Created on 2023-09-08 with reprex v2.0.2

May I ask you @billdenney : What did you really want in the first place? You want it to be like tools::file_ext() (I get confused by the function name get_ext()) or to return format.

billdenney commented 12 months ago

@chainsawriot, That's a good question, and I understand the complexity you're asking about.

My main goal was the slight extension of what is done in tools::file_ext(). But, if we had something that could detect the real file format, from your example, returning html for an htm file extension, I see a benefit of that, too. I think that your idea to normalize file extension with expected file format ("htm" becomes "html", "yml" becomes "yaml") would be more beneficial.

chainsawriot commented 12 months ago

TODO

Need to coordinate with #313 #351