IQSS / dataverse-client-r

R Client for Dataverse Repositories
https://iqss.github.io/dataverse-client-r
60 stars 24 forks source link

How to download an RData file? #127

Closed paulgronke closed 2 months ago

paulgronke commented 11 months ago

Please specify whether your issue is about:

In the download vignette, there is a section titled "Retrieving Custom Data Formats (RDS, Stata, SPSS)" that works as described.

But there is no description of how to download a file in RData format. Is this possible?

The code successfully downloads the .sav file but cannot figure out how to load an RData file, so as to avoid the extra step of using haven.

## load package
library("dataverse")

## code goes here
#  This code works to obtain a SAV file
#
SPAE22 <- get_dataframe_by_name(
  filename = "MITU0042_OUTPUT_0120.tab",
  dataset = "10.7910/DVN/SPU2XP",
  server = "dataverse.harvard.edu",
  original = TRUE,
    .f = haven::read_sav)
)

# This code does not work to obtain an RData file

SPAE22 <- get_dataframe_by_name(
  filename = "MITU0042_OUTPUT_0120.tab",
  dataset = "10.7910/DVN/SPU2XP",
  server = "dataverse.harvard.edu",
  original = TRUE,
    .f = function(x) load(x)
)

## session info for your system
sessionInfo()
pdurbin commented 11 months ago

I'm not sure if the R library supports it but it should work on the backend just fine. Here's an example:

wget --content-disposition 'https://dataverse.unc.edu/api/access/datafile/7527436?format=RData'

Please see https://guides.dataverse.org/en/5.13/api/dataaccess.html#basic-file-access

Danny-dK commented 10 months ago

On the site when trying to download that 7103004 data in Rdata format manually in a webbrowser I receive image

I receive the same 404 with the curl command:`

require(httr)

params = list(
  `format` = "RData"
)

res <- httr::GET(url = "https://dataverse.harvard.edu/api/access/datafile/7103004", query = params)

while the example of Philip works fine. Could be an issue with this specific publication? (the other formats download fine)

If you don't care about variable labels, this will do: SPAE22 <- get_dataframe_by_name( filename = "MITU0042_OUTPUT_0120.tab", dataset = "10.7910/DVN/SPU2XP", server = "dataverse.harvard.edu")

pdurbin commented 10 months ago

@Danny-dK huh, you're right, when I do either of these...

wget --content-disposition 'https://dataverse.harvard.edu/api/access/datafile/7103004?format=RData'

curl 'https://dataverse.harvard.edu/api/access/datafile/7103004?format=RData'

... I get 404 and {"status":"ERROR","code":404,"message":"datafile access error: requested optional service (image scaling, format conversion, etc.) could not be performed on this datafile."}

It's strange because when I go to https://dataverse.harvard.edu/file.xhtml?fileId=7103004 it offers RData as a download format:

Screen Shot 2023-09-06 at 3 06 05 PM

Perhaps there's a problem with the file? @Danny-dK please feel free to email support@dataverse.harvard.edu if you'd like someone at Harvard Dataverse to investigate.

One more thing I should mention is that even offering RData as a file format is somewhat controversial these days. Some people think it's obsolete:

paulgronke commented 10 months ago

I asked this very question on a Slack workspace and got the answer. Downloading an RData file isn’t as simple as SPSS and Stata, but is feasible. It’s not documented in the package documentation, but is documented here in the GitHub development space.

Here is the snippet from the documentation. Note that you will need the numerical dataverse entry number for the file. For my own part, I simply went back to using the SPSS version since it was read into R just fine.

3. RData files are read in by base::load() but cannot be assigned to an

object name. The following shows two possible ways to read in such files.

First, without relying on get_dataframe_*, write as a binary file:

as_binary <- get_file_by_doi https://iqss.github.io/dataverse-client-r/reference/files.html( filedoi = "doi:10.70122/FK2/PPIAXE/5VPXKE", server = "demo.dataverse.org")

temp <- tempdir https://rdrr.io/r/base/tempfile.html() writeBin https://rdrr.io/r/base/readBin.html(as_binary, path(temp, "county.RData")) load https://rdrr.io/r/base/load.html(path(temp, "county.RData"))

If you are certain each RData contains only one object, one could define a

custom function used in https://stackoverflow.com/a/34926943

load_object <- function(file) { tmp <- new.env https://rdrr.io/r/base/environment.html() load https://rdrr.io/r/base/load.html(file = file, envir = tmp) tmp[[ls https://rdrr.io/r/base/ls.html(tmp)[1]]] }

https://demo.dataverse.org/file.xhtml?persistentId=doi:10.70122/FK2/PPIAXE/X2FC5V

as_rda <- get_dataframe_by_id( file = 1939003, server = "demo.dataverse.org", .f = load_object, original = TRUE) }

https://iqss.github.io/dataverse-client-r/reference/get_dataframe.html#examples


Paul Gronke Professor, Reed College Director, Elections and Voting Information Center http://evic.reed.edu

General Inquiries: Michelle Shafer, @.***

On Sep 6, 2023, at 12:09 PM, Philip Durbin @.***> wrote:

@Danny-dK https://github.com/Danny-dK huh, you're right, when I do either of these...

wget --content-disposition 'https://dataverse.harvard.edu/api/access/datafile/7103004?format=RData'

curl 'https://dataverse.harvard.edu/api/access/datafile/7103004?format=RData'

... I get 404 and {"status":"ERROR","code":404,"message":"datafile access error: requested optional service (image scaling, format conversion, etc.) could not be performed on this datafile."}

It's strange because when I go to https://dataverse.harvard.edu/file.xhtml?fileId=7103004 it offers RData as a download format:

https://user-images.githubusercontent.com/21006/266113719-65cb8d55-f9d0-44ba-a00b-3e584f5608f6.png Perhaps there's a problem with the file? @Danny-dK https://github.com/Danny-dK please feel free to email @. @.> if you'd like someone at Harvard Dataverse to investigate.

One more thing I should mention is that even offering RData as a file format is somewhat controversial these days. Some people think it's obsolete:

IQSS/dataverse#6678 https://github.com/IQSS/dataverse/issues/6678 IQSS/dataverse#7249 https://github.com/IQSS/dataverse/issues/7249 — Reply to this email directly, view it on GitHub https://github.com/IQSS/dataverse-client-r/issues/127#issuecomment-1708941231, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGBOF6G2KG326EBMN6DWCQDXZDC7TANCNFSM6AAAAAA3AQ3NDI. You are receiving this because you authored the thread.

Danny-dK commented 10 months ago

Don't think this solves it though. That help documentation is somewhat incomplete or incorrect.

Aside from the 5VPXKE file not being found using the function (receiving a file information not found on Dataverse API), but I can find the name and type of file in version 2 of https://demo.dataverse.org/file.xhtml?fileId=1939003&version=3.0. That file is nlsw88_rda-export.rda and specifically is an rda file and thus already an R based file. The original question is why can't the tab file not be downloaded in RData format (more on that below). The help file also specifies

writeBin(as_binary, path(temp, ""county.RData""))
load(path(temp, "county.RData"))

but there is no path() function (assuming this should be file.path()) and the rda in question does not have a name County(?, at least I'm not seeing that). Considering this is already a R formatted file, this simply works (no need for any other writeBin stuff; X2FC5V is the same file but in version 3 of that demo publication):

get_dataframe_by_doi(
  filedoi = "10.70122/FK2/PPIAXE/X2FC5V",
  server = "demo.dataverse.org",
  original = TRUE,
  .f = function(x) load(x, envir = .GlobalEnv))

The original question was why a tab file can be donwloaded as a format from dataverse website, but not through the R functions. The https://doi.org/10.7910/DVN/SPU2XP MITU0042_OUTPUT_0120.tab file cannot be downloaded from the website due to the previous 404 error message. I found others as well with the same error message (for example https://doi.org/10.7910/DVN/ONZOPT gets the same error when trying to download as a RData format from the website). This https://doi.org/10.7910/DVN/NKN0E8/Y2HP2J Data Set I.tab however downloads fine in RData format from the website and can be loaded into R. But trying this using the R dataverse code does not work and receives the error:

Error in load(x, envir = .GlobalEnv) : 
  bad restore file magic number (file may be corrupted) -- no data loaded
In addition: Warning messages:
1: In readChar(con, 5L, useBytes = TRUE) :
  truncating string with embedded nuls
2: file ‘foo498c108a300’ has magic number 's'
  Use of save versions prior to 2 is deprecated

A quick google on the last error message shows that R > 3.5.0 RData files are saved in version 3, any below are saved in version 2 and are not compatible to be loaded. (examples https://stackoverflow.com/questions/12463583/the-cause-of-bad-magic-number-error-when-loading-a-workspace-and-how-to-avoid and https://stackoverflow.com/questions/57242296/workspace-cannot-be-loaded-in-server-file-has-magic-number-rdx3)

I'll contact support@dataverse to see whether they find anything odd with the offering of RData files (perhaps they are using older versions to offer RData files) and why some show the 404 error .

Danny-dK commented 10 months ago

Ah, this makes sense:

https://github.com/IQSS/dataverse/issues/9490#issuecomment-1492640510

Unfortunately, this download-as-RData support, that uses a remote R instance via Rserve, is just flaky and unreliable. The whole subsystem is rather obsolete by now, and we are seriously considering retiring it. There's some lively debate (including in the Dataverse users group right now) about whether this "download-as-RData" functionality is actually providing any useful value. (If the ingested original was a Stata file, any R user can easily download the .dta file and import it into R - which has excellent support for Stata via the package "foreign"; if the original was RData... the whole point is moot; etc. etc.) Originally posted by @landreev in https://github.com/IQSS/dataverse/issues/9490#issuecomment-1492640510

https://github.com/IQSS/dataverse/issues/8711#issue-1239788342

The option to download tabular data in RData format should not appear in the dropdown menu if a Dataverse installation has not been configured to handle RData

Aside from updating the help documentation to load a published rda file, this issue here could pretty much be closed I guess.

kuriwaki commented 10 months ago

@Danny-dK that method load(x, envir = .GlobalEnv) is better -- thanks. It is in dev now (#107). And yes, path should have been file.path or fs::path.

As for @paulgronke's original dataset, as a R user I don't see the advantage of loading a SPSS file like MITU0042_OUTPUT_0120.sav as a RData object rather than a sav file or ingested plain-text file. Paul's first example with haven::read_sav seems superior in all respects.

dataverse aside, I don't see how a binary/sav/text file can be loaded as a rda file without first relying on sav/text. So I think it's fine that Danny's example with MITU0042_OUTPUT_0120 are not working. I guess Rserve does some transformations that makes it happen, but I don't know that system.

pdurbin commented 10 months ago

Yes, Dataverse uses RServe to create an RData file out of the tab-separated version.

Danny-dK commented 10 months ago

@kuriwaki Thanks!

Indeed, I don't particularly see the need for conversion to RData / rda through dataverse. R and various libraries are perfectly capable reading in various formats itself. I agree with the discussions on the aforementioned git issue links. Thanks for the work!