get_file directly into environment with user-specified file format

kuriwaki commented 4 years ago

What the issue is about:

[x] a suggested code or documentation change, improvement to the code, or feature request

Issue: I think most users who want to get data from the R dataverse package want to start working with the data in their R environment right away. However, get_file only returns raw binary output which is not usable on its own.

Proposal: The help page shows how to write the class raw object into a temp file and read it back in. The proposed feature is to add an optional argument in get_file or make a function that does this write-in / read-in-again process automatically. Users will enter a function that will be used to read in the tempfile. An example function that does this is below.

How does this sound?

# hide my key

library(dataverse)

# function ----

# @param file to be passed on to get_file
# @param dataset to be passed on to get_file
# @param read_function If supplied a function object, this will write the 
#   raw file to a tempfile and read it back in with the supplied function. This
#   is useful when you want to start working with the data right away in the R
#   environment
get_file_addon <- function(file,
                            dataset = NULL,
                            read_function = NULL,
                            ...) {

  raw_file <- get_file(file, dataset)

  # default of get_file
  if (is.null(read_function))
    return(raw_file)

  # save to temp and then read it in with supplied function
  if (!is.null(read_function)) {
    tmp <- tempfile(file, fileext = stringr::str_extract(file, "\\.[A-z]+$"))
    writeBin(raw_file, tmp)
    return(do.call(read_function, list(tmp)))
  }
}

# read in two non-tab ingested files ----
cces_dta <- get_file_addon(file = "cumulative_2006_2018.dta", 
                           dataset = "10.7910/DVN/II2DB6",
                           read_function = haven::read_dta)
cces_rds <- get_file_addon(file = "cumulative_2006_2018.Rds", 
                           dataset = "10.7910/DVN/II2DB6",
                           read_function = readr::read_rds)
class(cces_dta)
#> [1] "tbl_df"     "tbl"        "data.frame"
class(cces_rds)
#> [1] "tbl_df"     "tbl"        "data.frame"
dim(cces_dta)
#> [1] 452755     73
dim(cces_rds)
#> [1] 452755     73

^{Created on 2019-12-16 by the reprex package (v0.3.0)}

wibeasley commented 4 years ago

@kuriwaki,

I like this idea. I agree that it's a step that is reasonably automated and will remove a (small) barrier encountered in almost all use cases.
I'm wondering if it's best to offer the data.frame conversion only for ingested datasets? I'm guessing that's the majority of what most users would consider converting to a data.frame. It also alleviates us from assuming the responsibility of guessing correctly for ambiguous files (like a csv with 'txt' extension, or a csv file that's actually separated with semicolons). I'd rather rely on Dataverse's own ingesting logic. They'll do a better job initially, and they're more likely to be better about maintaining that logic over time.
But I'm happy to be convinced otherwise. If the package does assume this responsibility, maybe the mime can help with that decision logic.
If only ingested datasets are returned as data.frames, I guessing it makes sense only to use the available rds. And not to convert the tab to rds. For three reasons.
1. it's less for us to develop & maintain
2. the rds is smaller, and therefore should travel the internet faster than the plain-text tab file.
3. our csv-to-rds process may repair column names differently than the Dataverse ingestion process. For example, the tab file has a subject id variable. Some parsing procedures repair that name automatically (e.g., subject_id, subject id) and some don't. Therefore the user code and documentation might not use the same variable name --depending on how the csv was converted to an rds.
@pdurbin, if we go this route, I might need help identifying the ingestion code that creates the rds. The R package should probably mimic the ingestion process as close as possible. My search isn't popping anything I recognize as this part.

What are your thoughts? Would this restriction (ie, only ingested datasets are returned as data.frames) be too limiting?

pdurbin commented 4 years ago

4. the ingestion code that creates the rds

Well, here's a lead: "Confirming what Phil said - if the original ingested file was Stata (.dta) or SPSS (.sav or *.por), we use R package "foreign" to directly convert that saved original file to an .RData dataframe. For all the other supported formats, the dataframe is generated by R from the tab-delimited file and the variable metadata in the database." -- https://groups.google.com/d/msg/dataverse-community/QDRnM6ztbt8/AYynuwocBAAJ

Let me dig a bit.

Update. I'm pretty sure this R code is called: https://github.com/IQSS/dataverse/blob/v4.18.1/src/main/java/edu/harvard/iq/dataverse/rserve/scripts/dataverse_r_functions.R

From this Java code: https://github.com/IQSS/dataverse/blob/v4.18.1/src/main/java/edu/harvard/iq/dataverse/rserve/RemoteDataFrameService.java#L125

wibeasley commented 4 years ago

@pdurbin, that helped a lot

@kuriwaki, this shows how inexperienced I still am with Dataverse. I didn't realize they really meant "RData", instead of "Rds".

So unless Dataverse also offers Rds files soon, I totally support with your proposal.

In addition, what do you think about a function that always returns a data.frame for an ingested tab file? In that case, it never passes through the rds stage. Something like readr::read_delim() converts the plain-text to a tibble, and returns the tibble to the caller. Isn't this the most frequent use case? I really don't know --do/would many people use an R package to download a Stata/Spss/Whatever file?

For those who don't know, RData saves the equivalent of an environment/workspace --not necessarily a single rectangular data. When it's restored from all the variables used by the developer populate the client. The user is forced to (at least initially) use the old names. Besides the naming complication, multiple variables can use contained, which can lead to more confusion.

Excerpt from Efficient R programming

(RData) is the most widely used. It uses uses the save function which takes any number of R objects and writes them to a file, which must be specified by the file = argument. save is like save.image, which saves all the objects currently loaded in R.

The second method is slightly less used but we recommend it. Apart from being slightly more concise for saving single R objects, the readRDS function is more flexible: as shown in the subsequent line, the resulting object can be assigned to any name. In this case we called it df_co2_rds (which we show to be identical to df_co2, loaded with the load command) but we could have called it anything or simply printed it to the console.

Using saveRDS is good practice because it forces you to specify object names. If you use save without care, you could forget the names of the objects you saved and accidentally overwrite objects that already existed.

kuriwaki commented 4 years ago

Thank you.

My intention with the read_function argument (with no default provided) is to leave it up to the user to discern what function could be used with the data. Sometimes, several commands should work fine (e.g. foreign::read.dta vs. haven::read_dta, or readr::read_delim vs. read.delim); more often, only certain funtions will work.

As for ingested datasets.. my sense is that get_file will always return the original, not the ingested format. For example constructionData.tab used as an example in the get_file help page is a Stata dta ingested into a tab, but get_file returns a raw file that can only be reasonably ingested with a read.dta/read_dta.

Re:

would many people use an R package to download a Stata/Spss/Whatever file?

If the replication file on dataset comes from Stata/SPSS and the one who is analyzing the replication is a R user, then the R user will have no choice but to read the Stata/SPSS file into R. Even if the ingestion works, sometimes important metadata (like variable and value labels) are stripped off in the ingest.

pdurbin commented 4 years ago

So unless Dataverse also offers Rds files soon

This just in. A request for RDS support in Dataverse from @reikoch at https://github.com/IQSS/dataverse/issues/6678

@wibeasley @kuriwaki please feel free to comment on that issue! You both know way more about R than I do! 😄

kuriwaki commented 4 years ago

Reviewing this thread, I think it's worth clarifying whether we want this functionality to options!format %in% c("original"), i.e. "RData", "prep", "bundle" which are the current options (and maybe "Rds" in the future).

My original thought was it was ok to limit functionality so that it can only read files in their original format (i.e. get_file(file, format = "original")). R's package ecosystem is pretty good at reading in files of different file formats. It can certainly read all the file types that dataverse will ingest (Stata, SPSS, Excel, tsv, csv).

kuriwaki commented 3 years ago

This functionality is now called get_dataframe_* in #66.

I reread this conversation after implementing that PR. Re the above comment (https://github.com/IQSS/dataverse-client-r/issues/35#issuecomment-567594825) by @wibeasley:

Re your bullet point 2, I think no, we want get_dataframe_* to be able to read in datafiles that are not ingested (e.g. nlsw88_rds-export.rds). Then it'll just be up to the user to specify the correct function
I don't quite understand "not to convert the tab to rds" in point 4. Either we use read_tsv to read the ingested version of the ingested file (original = FALSE), OR we ask the user to find the appropriate function to read the original version of the ingested or non-ingested file (original = TRUE). We never need to use Download Options > RData Format for any of this.

IQSS / dataverse-client-r

get_file directly into environment with user-specified file format #35