IQSS / dataverse-client-r

R Client for Dataverse Repositories
https://iqss.github.io/dataverse-client-r
61 stars 25 forks source link

Implement solution to #128 #129

Closed JBGruber closed 4 months ago

JBGruber commented 1 year ago

Please ensure the following before submitting a PR:

description

As noted in #128, I believe that it makes sense to have the option to make the get_file_by_* functions return a URL so larger files can be downloaded using other packages or software. Here is a quick demo:

library(dataverse)
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
ds <- get_dataset("doi:10.7910/DVN/ZY3RV7")

DE <- which(ds$files$label == "ParlEE_DE_plenary_speeches.csv")

get_file(ds$files$id[DE], 
         dataset = "doi:10.7910/DVN/ZY3RV7",
         return_url = TRUE)
#> [1] "https://dataverse.harvard.edu/api/access/datafile/6435504"

get_file_by_name("ParlEE_DE_plenary_speeches.csv", 
                 dataset = "doi:10.7910/DVN/ZY3RV7",
                 return_url = TRUE)
#> [1] "https://dataverse.harvard.edu/api/access/datafile/6435504"

get_file_by_id(ds$files$id[DE], 
               dataset = "doi:10.7910/DVN/ZY3RV7",
               return_url = TRUE)
#> [1] "https://dataverse.harvard.edu/api/access/datafile/6435504"

get_file_by_doi(filedoi  = "10.70122/FK2/PPIAXE/MHDB0O",
                server   = "demo.dataverse.org",
                return_url = TRUE)
#> [1] "https://demo.dataverse.org/api/access/datafile/:persistentId?persistentId=doi:10.70122/FK2/PPIAXE/MHDB0O"

Created on 2023-09-14 with reprex v2.0.2

I then like to use:

curl::curl_download("https://dataverse.harvard.edu/api/access/datafile/6435504", 
                    destfile = "ParlEE_DE_plenary_speeches.csv")

since it is fast and reliable. But it's up to you :grin:

input needed

What is left to do is to decide what happens to the get_dataframe_by_* functions. I would suggest that they should error when the return_url parameter is returned, since this option makes little sense, I believe.

kuriwaki commented 1 year ago

Thanks. What about making a new set of functions get_url_by_* instead of adding an option? Putting it in get_file seems to muddy the function a bit, as in getdataframe*

Danny-dK commented 1 year ago

@kuriwaki Maybe a dumb suggestion, but..... If the point of the url return is to create a second function to download a file from dataverse, why not build it in as a function of existing dataverse package? You could create a separate function like download_file_by_url() But I could also imagine just build it in the current functions where you have an option something like download = TRUE|FALSE and download_path = 'path_to_file' instead of return_url=TRUE|FALSE. Then at the get_file_by_id function at the end instead:

u <- paste0(api_url(server), u_part, fileid)

if (isFALSE(download)){
  r <- httr::GET(u, httr::add_headers(`X-Dataverse-key` = key), 
                 query = query, httr::progress(type = "down"), ...)
  httr::stop_for_status(r, task = httr::content(r)$message)
  httr::content(r, as = "raw")
 }

if (isTRUE(download)){
  httr::GET(u, httr::add_headers(`X-Dataverse-key` = key), 
           query = query, httr::progress(type = "down"), httr::write_disk(download_path, overwrite = TRUE), ...)
 }

(note that I removed the if for progress that is currently in that part of the function as I think it is always useful to see progress of what is happening, even if it may take some overhead)

I did not test whether the query and add_headers parts worked in using write_disk, but I assume it would.

Ignore above if not purpose of the suggestion.

kuriwaki commented 12 months ago

I was thinking something similar to the first option of @Danny-dK -- a function separate from the get_dataframe_* family. Unless others think otherwise, I will try to implement this standalone return_url function in the next CRAN fix.

kuriwaki commented 4 months ago

@JBGruber @Danny-dK I've opted to make a

Here is the help page as a reprex

library(dataverse)

# get URLs
get_url_by_name(
  filename = "nlsw88.tab",
  dataset  = "10.70122/FK2/PPIAXE",
  server   = "demo.dataverse.org"
)
#> [1] "https://demo.dataverse.org/api/access/datafile/1734017?format=original"
# https://demo.dataverse.org/api/access/datafile/1734017?format=original

# For ingested, tab-delimited files
get_url_by_name(
  filename = "nlsw88.tab",
  dataset  = "10.70122/FK2/PPIAXE",
  original = FALSE,
  server   = "demo.dataverse.org"
)
#> [1] "https://demo.dataverse.org/api/access/datafile/1734017"
# https://demo.dataverse.org/api/access/datafile/1734017

# To download to local directory
curl::curl_download(
  "https://demo.dataverse.org/api/access/datafile/1734017?format=original",
  destfile = "nlsw88.dta")

Created on 2024-05-12 with reprex v2.0.2

kuriwaki commented 4 months ago

Updated the tests and edited the previous post above to get the correct reprex

JBGruber commented 3 months ago

Hey, thanks for carrying this over the finish line, @kuriwaki, and sorry for being unresponsive to the requests to change things. I had to use this again just now and it worked perfectly!

kuriwaki commented 3 months ago

@JBGruber great to hear.