gesistsa / rio

🐟 A Swiss-Army Knife for Data I/O
http://gesistsa.github.io/rio/
598 stars 76 forks source link

`...` and remapping behaviors for v1.0.0 #326

Closed chainsawriot closed 1 year ago

chainsawriot commented 1 year ago

This is a meta discussion about the ... behavior for import() and export(). It is related to the following issues

I closed them all so that we can focus the discussion here.

The ... and the undocumented remapping behaviors have been a major source of confusion in the past; and I think we need to rectify this in the upcoming v1.0.0, the major release where the API should be considered "stable" and probably the only chance in the near future to introduce breaking changes.

As I argue in #280 , in my opinion passing unused arguments to the underlying reading/writing function should be an error, rather than a warning. But I really want to know how you all (users and developers) think about this.

Also, we also have two choices regarding the undocumented remapping behaviors (mostly for xlsx, ods, and csv): (1) document them all in v1.0.0; or (2) abandon them all and let ... be ... that gets passed to the underlying function.

My initial think was to abandon them all to reduce the complexity of rio. But I don't have a grip on how many of you wrote code in the past like this:

tempods <- tempfile(fileext = ".ods")
export(mtcars, tempods)
import(tempods, header = TRUE) ## expecting header will get passed as `col_names`

And certainly, I don't want to break your existing code. If you do have code that was written like that, please let me know. It would be taken into account when I think about the question of these undocumented remapping behaviors.

chainsawriot commented 1 year ago

Additional information to this #248 and also this question on Stackoverflow. Sometimes ... must be evaluated in the correct call stack aka context.

This is one more reason rio should not mess with ....

chainsawriot commented 1 year ago

Another option, which is less breaking, is just to remap the following three

No less, no more.

It also means header should be an official argument.

All of these three remapping must be documented. Also, we need to make sure all underlying functions with these three parameters must get remapped also. And both remapped and non-remapped executions should work, i.e.

import("whatever.ods", which = "Sheet1") # work
import("whatever.ods", sheet = "Sheet1") # work

And then all ... is just ....

chainsawriot commented 1 year ago

Even without a concensus on this, the remapping should be documented. Maybe as a vignette.

schochastics commented 1 year ago

My opinion on this should probably not matter too much but I really like the remapping approach for a fixed set of important parameters and let ... be .... I always feel like ... should be something for people who "know what they are doing" and not control potential standard behavior.

jsonbecker commented 1 year ago

Although breaking changes are acceptable for a 1.0, I don’t think we should remove the remapping and cause those breaking changes to code. However, we should just pass without remapping. IMO, it is within scope of rio to smooth out common differences between various ways to read data. That’s kind of the point. So standardizing some arguments such as header, skip, file makes sense to me for serving the core purpose— I don’t actually have to know the underlying functions used to read or write this data very well, because I have one consistent interface to it. is the escape valve so that there’s little or no reason to stop using rio if you start there.

chainsawriot commented 1 year ago

@schochastics @jsonbecker Thank you very much for the comments. I compiled the following table. There are good news (other than I did not get insane after compiling this). If we are going to target only file, which and header (let's call this "the base convention"), we covered almost all of the them except the read.DIF odd case and fst (the parameter columns functions differently).

format function file which header
csv, dat, txt, csv, csvy, psv, clipboard data.table::fread file which header
fwf readr::read_fwf file - 1
fwf utils::read.table file - header
r dget file - -
dump source file 2 -
rds readRDS file 3 -
rdata, rda load file 2 -
feather feather::read_feather path 3 -
fst fst::read.fst path 3 4
matlab rmatio::read.mat filename 3 -
dta (Stata) haven::read_dta file - -
dta (Stata) foreign::read.dta file - -
dbf (xbase) foreign::read.dbf file - -
dif utils::read.DIF file - 5
sav, zsav (SPSS) haven::read_sav file - -
sav foreign::read.spss file - -
spss haven::read_por file - -
sas7bdat haven::read_sas file - -
xpt haven::read_xpt file - -
mtp foreign::read.mtp file - -
syd foreign::read.systat file - -
json jsonlite::fromJSON txt - -
rec (Epiinfo) foreign::read.epiinfo file - -
arff foreign::read.arff file - -
xls readxl::read_xls path sheet col_names
xlsx readxl::read_xlsx path sheet col_names
xlsx openxlsx::read.xlsx xlsxFile sheet colNames
fortran (see read.fwf) utils::read.fortran file - -
ods readODS::read_ods path sheet col_names
xml xml2::read_xml x 3 -
html xml2::read_xml x 2 -
yaml yaml::read_yaml file - -
eviews hexView::readEViews file - -
pzfx pzfx::read_pzfx path table -
parquet arrow::read_parquet file - -
  1. col_names has different meanings
  2. We overloaded which to select object from the multi-item object, e.g. envionment, tables in an html file
  3. Has which in the method, but doesn't get passed. The underlying function does not have this concept
  4. columns
  5. The underlying function has this concept, but not available in the method
chainsawriot commented 1 year ago

Actually, we only need to deploy arg_reconcile or a simplified version of it in the following cases: xlsx, xls, ods, and pzfx.

schochastics commented 1 year ago

I wanted to go through arg_reconcile for #334 but I will leave it for fixing it with this issue

chainsawriot commented 1 year ago

So many issues related to ..., arg_reconcile and friends.

Short-term solution

tempdtazip <- tempfile(fileext = ".dta.zip")
rio::export(mtcars, tempdtazip, format = "dta")
rio::import(tempdtazip)
#>     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> 8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> 11 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> 12 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#> 13 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#> 14 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#> 15 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#> 16 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#> 17 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#> 18 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> 19 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> 20 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> 21 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> 22 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#> 23 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#> 24 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> 25 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> 26 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> 27 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> 28 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> 29 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> 30 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> 31 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
#> 32 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

rio::import_list(tempdtazip) ## the warning message, hello arg_concile
#> Warning in arg_reconcile(haven::read_dta, file = file, ..., .docall = TRUE, : The following arguments were ignored for haven::read_dta:
#> which
#> $file1fe5255d64f8fb.dta
#>     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> 8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> 11 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> 12 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#> 13 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#> 14 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#> 15 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#> 16 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#> 17 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#> 18 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> 19 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> 20 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> 21 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> 22 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#> 23 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#> 24 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> 25 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> 26 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> 27 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> 28 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> 29 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> 30 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> 31 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
#> 32 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Created on 2023-09-11 with reprex v2.0.2