IQSS / dataverse-client-r

R Client for Dataverse Repositories
https://iqss.github.io/dataverse-client-r
60 stars 24 forks source link

Problem downloading larger files #128

Closed JBGruber closed 2 months ago

JBGruber commented 10 months ago
library(dataverse)

Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
ds <- get_dataset("doi:10.7910/DVN/ZY3RV7")

DE <- which(ds$files$label == "ParlEE_DE_plenary_speeches.csv")

get_file_by_id(ds$files$id[DE], 
               dataset = "doi:10.7910/DVN/ZY3RV7")

The problem

When downloading files, they are loaded into memory before being written to disk. This is no issue with smaller files, but I keep running into issues when I try to download larger ones, such as the 1.8GB csv file from this data set. The code above is not always throwing an error, only when running on a machine with limited resources or a slow internet connection.

What I would suggest is to allow the user to download the files in a different way. A simple argument like return_url, for example, could just do that and skip the download. The user could then use an external function like curl::curl_download or copy the link to an external application. I have a pull request ready, if there is interest.

kuriwaki commented 10 months ago

I'm definitely intrigued to try out a PR. Trouble with large files has been a clear issue.