IQSS / dataverse-client-r

R Client for Dataverse Repositories
https://iqss.github.io/dataverse-client-r
61 stars 25 forks source link

404 errors in vignette - get_file() #33

Closed wibeasley closed 3 years ago

wibeasley commented 4 years ago

You guys might start regretting inviting me to be a maintainer. I'm having trouble reproducing the vignettes, even easy parts like retrieving plain-text R & CSVs.

Part 1: out of the box

remotes::install_github("iqss/dataverse-client-r")
#> Skipping install of 'dataverse' from a github remote, the SHA1 (bac89f46) has not changed since last install.
#>   Use `force = TRUE` to force installation
library("dataverse")
Sys.setenv("DATAVERSE_KEY" = "examplekey12345")
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
get_dataset("doi:10.7910/DVN/ARKOTI")
# Dataset (75170): 
# Version: 1.0, RELEASED
# Release Date: 2015-07-07T02:57:02Z
# License: CC0
# 22 Files:
# label version      id                  contentType
# 1                  alpl2013.tab       2 2692294    text/tab-separated-values
# 2                   BPchap7.tab       2 2692295    text/tab-separated-values
# 3                   chapter01.R       2 2692202 text/plain; charset=US-ASCII
# ...
# 16             drugCoverage.csv       1 2692233 text/plain; charset=US-ASCII
# ...

# Retrieve files by ID
object.size(dataverse::get_file(2692294)) # tab works
#> 211040 bytes
object.size(dataverse::get_file(2692295)) # tab works
#> 61336 bytes
object.size(dataverse::get_file(2692210)) # R fails
#> Error in dataverse::get_file(2692210): Not Found (HTTP 404).
object.size(dataverse::get_file(2692233)) # csv fails
#> Error in dataverse::get_file(2692233): Not Found (HTTP 404).

# Retrieve files by name & doi
object.size(get_file("alpl2013.tab"     , "doi:10.7910/DVN/ARKOTI")) # tab works
#> 211040 bytes
object.size(get_file("BPchap7.tab"      , "doi:10.7910/DVN/ARKOTI")) # tab works
#> 61336 bytes
object.size(get_file("chapter01.R"      , "doi:10.7910/DVN/ARKOTI")) # R fails
#> Error in get_file("chapter01.R", "doi:10.7910/DVN/ARKOTI"): Not Found (HTTP 404).
object.size(get_file("drugCoverage.csv" , "doi:10.7910/DVN/ARKOTI")) # csv fails
#> Error in get_file("drugCoverage.csv", "doi:10.7910/DVN/ARKOTI"): Not Found (HTTP 404).

# Taken straight from https://cran.r-project.org/web/packages/dataverse/vignettes/C-retrieval.html
code3 <- get_file("chapter03.R", "doi:10.7910/DVN/ARKOTI")
#> Error in get_file("chapter03.R", "doi:10.7910/DVN/ARKOTI"): Not Found (HTTP 404).

Created on 2019-12-06 by the reprex package (v0.3.0)

Part 2: digging.

Using debug(dataverse::get_file), the error-throwing line is in get_file():

r <- httr::GET(u, httr::add_headers(`X-Dataverse-key` = key), query = query, ...)

To make things a tad more direct, I called dataverse::get_file(2692233). The two relevant parameters to httr::GET() are

Browse[2]> query
$format
[1] "original"

Browse[2]> u
[1] "https://dataverse.harvard.edu/api/access/datafile/2692233"

The r value returned is

Response [https://dataverse.harvard.edu/api/access/datafile/2692233?format=original]
  Date: 2019-12-07 05:13
  Status: 404
  Content-Type: application/json
  Size: 201 B

That u value is fine when pasted into Chrome. I saw several Dataverse discussions about a trailing /. When I added that, the response appears good.

Browse[2]> u2 <- paste0("https://dataverse.harvard.edu/api/access/datafile/2692233", "/")
Browse[2]> httr::GET(u2, httr::add_headers(`X-Dataverse-key` = key), ... )
Response [https://dvn-cloud.s3.amazonaws.com/10.7910/DVN/ARKOTI/14e66408488-c678717f7c4d?response-content-disposition=attachment%3B%20filename%2A%3DUTF-8%27%27drugCoverage.csv&response-content-type=text%2Fplain%3B%20charset%3DUS-ASCII&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20191207T051632Z&X-Amz-SignedHeaders=host&X-Amz-Expires=60&X-Amz-Credential=AKIAIEJ3NV7UYCSRJC7A%2F20191207%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=c1b13a7d3ea2a53c1c1e70c18a762ae0e4ae14eb41fae7d79c71fce26a9b354f]
  Date: 2019-12-07 05:16
  Status: 200
  Content-Type: text/plain; charset=US-ASCII
  Size: 4.06 kB

Part 3: Questions

  1. I assume this is error is fairly new. Some change with Dataverse? If not, maybe it's related to change with curl that was released 4 days ago?

  2. Why are csv & R files affected, but not tab files? As I step through a tab file (e.g., dataverse::get_file(2692294)), it appears the exact same lines are executed. And that u value doesn't have a trailing slash (https://dataverse.harvard.edu/api/access/datafile/2692294). I see two differences: (a) the content type and (b) this one doesn't go through AWS/S3.

    Response [https://dataverse.harvard.edu/api/access/datafile/2692294?format=original]
    Date: 2019-12-07 05:35
    Status: 200
    Content-Type: application/x-stata; name="alpl2013.dta"
    Size: 211 kB
    <BINARY BODY>

    This is probably related to @EdJeeOnGitHub's recent issue #31. Notice he mentions problems with certain file formats.

  3. Is this related at all to https://github.com/IQSS/dataverse/issues/3130, https://github.com/IQSS/dataverse/issues/2559, or https://github.com/IQSS/dataverse/issues/4196? You can see that my knowledge with the web side of this is limited; I don't understand them that well.

devtools::session_info() - Session info --------------------------------------------------------------------------- setting value version R version 3.6.1 Patched (2019-08-12 r76979) os Windows 10 x64 system x86_64, mingw32 ui RStudio language (EN) collate English_United States.1252 ctype English_United States.1252 tz America/Chicago date 2019-12-06 - Packages ------------------------------------------------------------------------------- package * version date lib source assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0) backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1) callr 3.3.2 2019-09-22 [1] CRAN (R 3.6.1) cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.0) clipr 0.7.0 2019-07-23 [1] CRAN (R 3.6.1) crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0) curl 4.3 2019-12-02 [1] CRAN (R 3.6.1) dataverse * 0.2.1 2019-12-07 [1] Github (iqss/dataverse-client-r@bac89f4) desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0) devtools 2.2.1 2019-09-24 [1] CRAN (R 3.6.1) digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.1) ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.1) evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0) fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.0) glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0) htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.1) httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.1) jsonlite 1.6 2018-12-07 [1] CRAN (R 3.6.0) knitr 1.26 2019-11-12 [1] CRAN (R 3.6.1) magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0) memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0) packrat 0.5.0 2018-11-14 [1] CRAN (R 3.6.0) pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.6.1) pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0) prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.6.0) processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.1) ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.0) R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.1) Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.6.1) remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.0) reprex 0.3.0 2019-05-16 [1] CRAN (R 3.6.0) rlang 0.4.2 2019-11-23 [1] CRAN (R 3.6.1) rmarkdown 1.18 2019-11-27 [1] CRAN (R 3.6.1) rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0) rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.6.0) sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0) testthat 2.3.1 2019-12-01 [1] CRAN (R 3.6.1) usethis 1.5.1 2019-07-04 [1] CRAN (R 3.6.1) whisker 0.4 2019-08-28 [1] CRAN (R 3.6.1) withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0) xfun 0.11 2019-11-12 [1] CRAN (R 3.6.1) xml2 1.2.2 2019-08-09 [1] CRAN (R 3.6.1)
screenshot of postman postman
EdJeeOnGitHub commented 4 years ago

Whilst I was looking around @wibeasley I noticed that line 104 in get_file.R never gets called because length(query) is always greater than 1. However, often the query = query argument in the if statement above was causing issues - the else{} block will often work but it's never ran.

The relevant code is copied below.

if (length(query)) {
  r <- httr::GET(u, httr::add_headers("X-Dataverse-key" = key), query = query, ...)
} else {
  r <- httr::GET(u, httr::add_headers("X-Dataverse-key" = key), ...)
}
pdurbin commented 4 years ago

You guys might start regretting inviting me to be a maintainer.

Never! Are you coming to the 2020 Dataverse Community Meeting? :smile:

  1. I assume this is error is fairly new. Some change with Dataverse? If not, maybe it's related to change with curl that was released 4 days ago?

My money is on a change in Dataverse, not curl. :smile:

I'll try to dig in more on this during the work week. Have a good weekend!

pdurbin commented 4 years ago

2. Why are csv & R files affected, but not tab files? As I step through a tab file

Is this related to the fact that passing "format=original" only works for tabular files? Please see https://github.com/IQSS/dataverse/issues/6408

@wibeasley to be honest, I'm a little lost in this issue, probably because I'm not much of an R hacker. Please keep the questions coming. Please let me know how I can help. 😄

kuriwaki commented 4 years ago

Appreciate if there is a fix/workaround for this. I currently cannot read non-ingested datasets as well as ingested Stata datasets that originate from Stata v14+ files. Here are three examples in the CCES, where the first one works but not the other two.

library(dataverse)
# hide my key

# tab files in CCES 2017 (Stata v12 dataset) WORKS
# https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3STEZY
cc17 <- get_file("Common Content Data.tab", "doi:10.7910/DVN/3STEZY")
writeBin(cc17,  "Common Content Data.dta")
cc17_dta <- foreign::read.dta("Common Content Data.dta")
cc17_dta <- haven::read_dta("Common Content Data.dta")

# tab files in CCES 2018 (Stata v14+dataset) DOES NOT WORK
# possibly because of Stata version issue
# https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ZSBZ7K
cc18 <- get_file("cces18_common_vv.tab", "doi:10.7910/DVN/ZSBZ7K")
writeBin(cc18,  "cces18_common_vv.dta")
cc18_dta <- foreign::read.dta("cces18_common_vv.dta")
#> Error in foreign::read.dta("cces18_common_vv.dta"): not a Stata version 5-12 .dta file
cc18_dta <- haven::read_dta("cces18_common_vv.dta")
#> Error in df_parse_dta_file(spec, encoding, cols_skip, n_max, skip, name_repair = .name_repair): 
#> Failed to parse /private/var/folders/gy/sd6ddp895s7dyqbdh2432fwm0000gn/T/
#> RtmpoYWXRS/reprex14f97d9c5c8b/cces18_common_vv.dta:
#> This version of the file format is not supported.

# Cumualtive common content dta, not tabulated, DOES NOT WORK
# https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/II2DB6
ccc_d <- get_file("cumulative_2006_2018.dta", "doi:10.7910/DVN/II2DB6")
#> Error in get_file("cumulative_2006_2018.dta", "doi:10.7910/DVN/II2DB6"): 
#> Not Found (HTTP 404).
ccc_r <- get_file("cumulative_2006_2018.Rds", "doi:10.7910/DVN/II2DB6")
#> Error in get_file("cumulative_2006_2018.Rds", "doi:10.7910/DVN/II2DB6"): 
#> Not Found (HTTP 404).

Created on 2019-12-09 by the reprex package (v0.3.0)

pdurbin commented 4 years ago

Error in foreign::read.dta("cces18_common_vv.dta"): not a Stata version 5-12 .dta file

It looks like @kuriwaki has opened #34 about this.

pdurbin commented 4 years ago

I just spent a couple minutes thinking about how I might confirm this bug and decided to try this repo in https://mybinder.org like in the screenshots below.

Screen Shot 2019-12-10 at 12 49 09 PM Screen Shot 2019-12-10 at 12 49 13 PM Screen Shot 2019-12-10 at 12 49 19 PM

So far so good but I didn't want the screenshots to include my API token. 😄

I'm also thinking it might be good to try to reproduce this bug using Sid: https://www.iq.harvard.edu/roadmap-sid

pdurbin commented 4 years ago

I agree that it's weird that one file id in the listing works...

object.size(dataverse::get_file(2692294)) #works

... but another doesn't...

object.size(dataverse::get_file(2692210)) # 404

... especially because I can get to the file landing pages for both files by visiting the following URLs in my browser:

I'll attach some screenshots.

Screen Shot 2019-12-10 at 1 00 16 PM Screen Shot 2019-12-10 at 12 58 52 PM Screen Shot 2019-12-10 at 12 58 33 PM

Screen Shot 2019-12-10 at 12 56 48 PM

Screen Shot 2019-12-10 at 12 56 56 PM

Screen Shot 2019-12-10 at 12 57 47 PM

Downloading 2692210 from the GUI seems to work fine. Here are some more screenshots.

Screen Shot 2019-12-10 at 1 04 11 PM Screen Shot 2019-12-10 at 1 04 50 PM

@wibeasley @kuriwaki is this helping? 😄

kuriwaki commented 4 years ago

Thanks. It's good to know the data is there. As was originally pointed out, when I remove the query argument in httr::GET everything goes through fine. (Testable at devtools::install_github("kuriwaki/dataverse-client-r")).

Since dput(query) gives only this length-1 objectlist(format = "original") at least in my case, is there any need to keep that argument at all?

kuriwaki commented 4 years ago

@wibeasley: the changes I've made in the fork seem to fix get_file to read any single-file object. However, I'm not familiar enough with the dataverse API or httr to assess if it is stable or to asses if what I'm doing is recommended. If I submitted a PR, would you (or @pdurbin) be able to review / discuss it?

for example here are results from @EdJeeOnGitHub 's #31

library(dataverse) # devtools::install_github("kuriwaki/dataverse-client-r")

dv_files <- get_dataset("doi:10.7910/DVN/JGLOZF")$files

# for each file in data
for (f in 1:nrow(dv_files)) {
  data_bytes <- as.integer(object.size(dataverse::get_file(dv_files$id[f])))
  data_mb <- round(measurements::conv_unit(data_bytes, "byte", "MB"), 3)

  metadata_mb <- round(measurements::conv_unit(dv_files$filesize[f], "byte", "MB"), 3)

  print(glue::glue("{dv_files$filename[f]}, {metadata_mb} MB in metadata, {data_mb} MB when downloaded"))
}
#> finalusingindices_anon.tab, 11.251 MB in metadata, 11.263 MB when downloaded
#> ReadMe with Codebook.docx, 0.036 MB in metadata, 0.036 MB when downloaded
#> The Hunger Project Dataverse Files.zip, 1.98 MB in metadata, 1.98 MB when downloaded
#> THPawareness_HH_anon.tab, 0.585 MB in metadata, 0.586 MB when downloaded

Created on 2019-12-16 by the reprex package (v0.3.0)

wibeasley commented 4 years ago

@kuriwaki the examples I posted initially are now working. Thanks so much for figuring it out and fixing. I made one small addition in the commit referenced above. Basically, it catches the case if the file is already specified as a number/id. Was that your intention, or am I misunderstanding something?

kuriwaki commented 4 years ago

@wibeasley yes, forgot to commit that. thank you for catching and merging!

mayeulk commented 4 years ago

Hi, we are having an issue in an R package which uses the R dataverse client. This is disussed here: https://github.com/andybega/icews/issues/51#issuecomment-571313416 (mentionned just above by andybega). If there is a fix or patch, I'd be happy to test it the icews R package with the fixed R dataverse client. Advice on how to install that version is welcome. I have this:

print(sessionInfo())
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 19.10
[...]
other attached packages:
[1] dataverse_0.2.0
kuriwaki commented 4 years ago

That part runs with the current dev version of the package (not the CRAN version) -- is it possible to use that for that moment? (installation by devtools::install_github("IQSS/dataverse-client-r"))

library("icews")
#> Options not set, consider running 'setup_icews()'
#> data_dir: NULL
#> use_db: NULL
#> keep_files: NULL
library("dataverse")
packageVersion("dataverse")
#> [1] '0.2.1'
file_binary <- dataverse::get_file(2711073, dataset = get_doi()$historic)
str(file_binary)
#>  raw [1:1283024] 25 50 44 46 ...

Created on 2020-01-23 by the reprex package (v0.3.0)

mayeulk commented 4 years ago

Thanks for swift reply. I did this:

.libPaths() # make sure to remove all dataverse in all places
remove.packages("dataverse")
# restart R
devtools::install_github("IQSS/dataverse-client-r")
#Installing package into ‘/home/mk/R/x86_64-pc-linux-gnu-library/3.6’
#(as ‘lib’ is unspecified)
#* installing *source* package ‘dataverse’
library("icews")
library("DBI")
library("dplyr")
library("usethis")
print(sessionInfo())
#loaded via a namespace (and not attached):
#[...]
#[25] glue_1.3.1             dataverse_0.2.1.9001   RSQLite_2.2.0

setup_icews(data_dir = "~/temp_icews", use_db = TRUE, keep_files = TRUE,
            r_profile = TRUE)
# this will give instructions for what to add to .Rprofile so that settings
# persist between R sessions

update_icews(dryrun = TRUE) # Should list proposed downloads, ingests, etc.
update_icews(dryrun = FALSE) # Wait until all is done; like 45 minutes or more the first time around

The last two commands (update_icews) exit with this error code: Error in value[3L] :

  Something went wrong in 'dataverse' or the Dataverse API, try again. Original error message:
'server' is missing with no default set in DATAVERSE_SERVER environment variable.

While, with the CRAN dataverse package (0.2.0) I had none of these error. For instance update_icews(dryrun = TRUE) does correctly list the files to download (with CRAN version). Have some dataverse functions changed, that require to be called differently? Mayeul

mayeulk commented 4 years ago

Note that:

Sys.getenv("DATAVERSE_SERVER")
[1] ""

There are some harvard dataverse urls and pointer (doi) at: https://github.com/andybega/icews/search?q=harvard&unscoped_q=harvard I'm trying to play with that environment variable (but, this was not needed with the CRAN version of dataverse)

kuriwaki commented 4 years ago

Ok -- I have, for example,

> Sys.getenv("DATAVERSE_SERVER")
[1] "dataverse.harvard.edu"

and ?dataverse::get_user_key in the dev version has more info on what to use for that variable. So maybe that is part of the issue.

I can't tell from your update_icews code that this is an issue with get_file. Is it possible to track down where it errors out?

mayeulk commented 4 years ago

Simply setting this Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu") got me R to download files:

update_icews(dryrun = FALSE)

Downloading 'events.1995.20150313082510.tab.zip'
Ingesting records from 'events.1995.20150313082510.tab'
Downloading 'events.1996.20150313082528.tab.zip'
Ingesting records from 'events.1996.20150313082528.tab'
Downloading 'events.1997.20150313082554.tab.zip'

I killed R and checked that these 3 files were indeed pushed to the SQLite database, which it did. I'll check with the full datasets (might take hours) but for me this is fixed provided we use dataverse_0.2.1.9001 in icews R package, and add Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu") Maybe andybega (icews R package maintainer) would prefer DOIs than a hard coded server URL, but this is another story! Thank you very much for your help! I'll mention this in the icews R package thread. Cheers, Mayeul