404 errors in vignette - get_file() #33

Closed wibeasley closed 3 years ago

wibeasley commented 4 years ago

You guys might start regretting inviting me to be a maintainer. I'm having trouble reproducing the vignettes, even easy parts like retrieving plain-text R & CSVs.

Part 1: out of the box

#> Skipping install of 'dataverse' from a github remote, the SHA1 (bac89f46) has not changed since last install.
#>   Use `force = TRUE` to force installation
Sys.setenv("DATAVERSE_KEY" = "examplekey12345")
Sys.setenv("DATAVERSE_SERVER" = "")
# Dataset (75170): 
# Version: 1.0, RELEASED
# Release Date: 2015-07-07T02:57:02Z
# License: CC0
# 22 Files:
# label version      id                  contentType
# 1               2 2692294    text/tab-separated-values
# 2                2 2692295    text/tab-separated-values
# 3                   chapter01.R       2 2692202 text/plain; charset=US-ASCII
# ...
# 16             drugCoverage.csv       1 2692233 text/plain; charset=US-ASCII
# ...

# Retrieve files by ID
object.size(dataverse::get_file(2692294)) # tab works
#> 211040 bytes
object.size(dataverse::get_file(2692295)) # tab works
#> 61336 bytes
object.size(dataverse::get_file(2692210)) # R fails
#> Error in dataverse::get_file(2692210): Not Found (HTTP 404).
object.size(dataverse::get_file(2692233)) # csv fails
#> Error in dataverse::get_file(2692233): Not Found (HTTP 404).

# Retrieve files by name & doi
object.size(get_file(""     , "doi:10.7910/DVN/ARKOTI")) # tab works
#> 211040 bytes
object.size(get_file(""      , "doi:10.7910/DVN/ARKOTI")) # tab works
#> 61336 bytes
object.size(get_file("chapter01.R"      , "doi:10.7910/DVN/ARKOTI")) # R fails
#> Error in get_file("chapter01.R", "doi:10.7910/DVN/ARKOTI"): Not Found (HTTP 404).
object.size(get_file("drugCoverage.csv" , "doi:10.7910/DVN/ARKOTI")) # csv fails
#> Error in get_file("drugCoverage.csv", "doi:10.7910/DVN/ARKOTI"): Not Found (HTTP 404).

# Taken straight from
code3 <- get_file("chapter03.R", "doi:10.7910/DVN/ARKOTI")
#> Error in get_file("chapter03.R", "doi:10.7910/DVN/ARKOTI"): Not Found (HTTP 404).

Part 2: digging.

Using debug(dataverse::get_file), the error-throwing line is in get_file():

r <- httr::GET(u, httr::add_headers(`X-Dataverse-key` = key), query = query, ...)

To make things a tad more direct, I called dataverse::get_file(2692233). The two relevant parameters to httr::GET() are

Browse[2]> query
[1] "original"

Browse[2]> u
[1] ""

The r value returned is

Response []
  Date: 2019-12-07 05:13
  Status: 404
  Content-Type: application/json
  Size: 201 B

That u value is fine when pasted into Chrome. I saw several Dataverse discussions about a trailing /. When I added that, the response appears good.

Browse[2]> u2 <- paste0("", "/")
Browse[2]> httr::GET(u2, httr::add_headers(`X-Dataverse-key` = key), ... )
Response []
  Date: 2019-12-07 05:16
  Status: 200
  Content-Type: text/plain; charset=US-ASCII
  Size: 4.06 kB

Part 3: Questions

  1. I assume this is error is fairly new. Some change with Dataverse? If not, maybe it's related to change with curl that was released 4 days ago?

  2. Why are csv & R files affected, but not tab files? As I step through a tab file (e.g., dataverse::get_file(2692294)), it appears the exact same lines are executed. And that u value doesn't have a trailing slash ( I see two differences: (a) the content type and (b) this one doesn't go through AWS/S3.

    Response []
    Date: 2019-12-07 05:35
    Status: 200
    Content-Type: application/x-stata; name="alpl2013.dta"
    Size: 211 kB

    This is probably related to @EdJeeOnGitHub's recent issue #31. Notice he mentions problems with certain file formats.

  3. Is this related at all to,, or You can see that my knowledge with the web side of this is limited; I don't understand them that well.

screenshot of postman postman
EdJeeOnGitHub commented 4 years ago

Whilst I was looking around @wibeasley I noticed that line 104 in get_file.R never gets called because length(query) is always greater than 1. However, often the query = query argument in the if statement above was causing issues - the else{} block will often work but it's never ran.

The relevant code is copied below.

if (length(query)) {
  r <- httr::GET(u, httr::add_headers("X-Dataverse-key" = key), query = query, ...)
} else {
  r <- httr::GET(u, httr::add_headers("X-Dataverse-key" = key), ...)
pdurbin commented 4 years ago

Are you coming to the 2020 Dataverse Community Meeting?

Never! Are you coming to the 2020 Dataverse Community Meeting? :smile:

  1. I assume this is error is fairly new. Some change with Dataverse? If not, maybe it's related to change with curl that was released 4 days ago?

My money is on a change in Dataverse, not curl. :smile:

I'll try to dig in more on this during the work week. Have a good weekend!

pdurbin commented 4 years ago

2. Why are csv & R files affected, but not tab files? As I step through a tab file

Is this related to the fact that passing "format=original" only works for tabular files? Please see

@wibeasley to be honest, I'm a little lost in this issue, probably because I'm not much of an R hacker. Please keep the questions coming. Please let me know how I can help. 😄

kuriwaki commented 4 years ago

Appreciate if there is a fix/workaround for this. I currently cannot read non-ingested datasets as well as ingested Stata datasets that originate from Stata v14+ files. Here are three examples in the CCES, where the first one works but not the other two.

# hide my key

# tab files in CCES 2017 (Stata v12 dataset) WORKS
cc17 <- get_file("Common Content", "doi:10.7910/DVN/3STEZY")
writeBin(cc17,  "Common Content Data.dta")
cc17_dta <- foreign::read.dta("Common Content Data.dta")
cc17_dta <- haven::read_dta("Common Content Data.dta")

# tab files in CCES 2018 (Stata v14+dataset) DOES NOT WORK
# possibly because of Stata version issue
cc18 <- get_file("", "doi:10.7910/DVN/ZSBZ7K")
writeBin(cc18,  "cces18_common_vv.dta")
cc18_dta <- foreign::read.dta("cces18_common_vv.dta")
#> Error in foreign::read.dta("cces18_common_vv.dta"): not a Stata version 5-12 .dta file
cc18_dta <- haven::read_dta("cces18_common_vv.dta")
#> Error in df_parse_dta_file(spec, encoding, cols_skip, n_max, skip, name_repair = .name_repair): 
#> Failed to parse /private/var/folders/gy/sd6ddp895s7dyqbdh2432fwm0000gn/T/
#> RtmpoYWXRS/reprex14f97d9c5c8b/cces18_common_vv.dta:
#> This version of the file format is not supported.

# Cumualtive common content dta, not tabulated, DOES NOT WORK
ccc_d <- get_file("cumulative_2006_2018.dta", "doi:10.7910/DVN/II2DB6")
#> Error in get_file("cumulative_2006_2018.dta", "doi:10.7910/DVN/II2DB6"): 
#> Not Found (HTTP 404).
ccc_r <- get_file("cumulative_2006_2018.Rds", "doi:10.7910/DVN/II2DB6")
#> Error in get_file("cumulative_2006_2018.Rds", "doi:10.7910/DVN/II2DB6"): 
#> Not Found (HTTP 404).

pdurbin commented 4 years ago

Error in foreign::read.dta("cces18_common_vv.dta"): not a Stata version 5-12 .dta file

It looks like @kuriwaki has opened #34 about this.

pdurbin commented 4 years ago

I just spent a couple minutes thinking about how I might confirm this bug and decided to try this repo in like in the screenshots below.

Screen Shot 2019-12-10 at 12 49 09 PM Screen Shot 2019-12-10 at 12 49 13 PM Screen Shot 2019-12-10 at 12 49 19 PM

So far so good but I didn't want the screenshots to include my API token. 😄

I'm also thinking it might be good to try to reproduce this bug using Sid:

pdurbin commented 4 years ago

I agree that it's weird that one file id in the listing works...

object.size(dataverse::get_file(2692294)) #works

... but another doesn't...

object.size(dataverse::get_file(2692210)) # 404

... especially because I can get to the file landing pages for both files by visiting the following URLs in my browser:

I'll attach some screenshots.

Screen Shot 2019-12-10 at 1 00 16 PM Screen Shot 2019-12-10 at 12 58 52 PM Screen Shot 2019-12-10 at 12 58 33 PM

Screen Shot 2019-12-10 at 12 56 48 PM

Screen Shot 2019-12-10 at 12 56 56 PM

Screen Shot 2019-12-10 at 12 57 47 PM

Downloading 2692210 from the GUI seems to work fine. Here are some more screenshots.

Screen Shot 2019-12-10 at 1 04 11 PM Screen Shot 2019-12-10 at 1 04 50 PM

@wibeasley @kuriwaki is this helping? 😄

kuriwaki commented 4 years ago

Thanks. It's good to know the data is there. As was originally pointed out, when I remove the query argument in httr::GET everything goes through fine. (Testable at devtools::install_github("kuriwaki/dataverse-client-r")).

Since dput(query) gives only this length-1 objectlist(format = "original") at least in my case, is there any need to keep that argument at all?

kuriwaki commented 4 years ago

@wibeasley: the changes I've made in the fork seem to fix get_file to read any single-file object. However, I'm not familiar enough with the dataverse API or httr to assess if it is stable or to asses if what I'm doing is recommended. If I submitted a PR, would you (or @pdurbin) be able to review / discuss it?

for example here are results from @EdJeeOnGitHub 's #31

library(dataverse) # devtools::install_github("kuriwaki/dataverse-client-r")

dv_files <- get_dataset("doi:10.7910/DVN/JGLOZF")$files

# for each file in data
for (f in 1:nrow(dv_files)) {
  data_bytes <- as.integer(object.size(dataverse::get_file(dv_files$id[f])))
  data_mb <- round(measurements::conv_unit(data_bytes, "byte", "MB"), 3)

  metadata_mb <- round(measurements::conv_unit(dv_files$filesize[f], "byte", "MB"), 3)

  print(glue::glue("{dv_files$filename[f]}, {metadata_mb} MB in metadata, {data_mb} MB when downloaded"))
#>, 11.251 MB in metadata, 11.263 MB when downloaded
#> ReadMe with Codebook.docx, 0.036 MB in metadata, 0.036 MB when downloaded
#> The Hunger Project Dataverse, 1.98 MB in metadata, 1.98 MB when downloaded
#>, 0.585 MB in metadata, 0.586 MB when downloaded

wibeasley commented 4 years ago

@kuriwaki the examples I posted initially are now working. Thanks so much for figuring it out and fixing. I made one small addition in the commit referenced above. Basically, it catches the case if the file is already specified as a number/id. Was that your intention, or am I misunderstanding something?

kuriwaki commented 4 years ago

@wibeasley yes, forgot to commit that. thank you for catching and merging!

mayeulk commented 4 years ago

Hi, we are having an issue in an R package which uses the R dataverse client. This is disussed here: (mentionned just above by andybega). If there is a fix or patch, I'd be happy to test it the icews R package with the fixed R dataverse client. Advice on how to install that version is welcome. I have this:

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 19.10
other attached packages:
[1] dataverse_0.2.0
kuriwaki commented 4 years ago

That part runs with the current dev version of the package (not the CRAN version) -- is it possible to use that for that moment? (installation by devtools::install_github("IQSS/dataverse-client-r"))

#> Options not set, consider running 'setup_icews()'
#> data_dir: NULL
#> use_db: NULL
#> keep_files: NULL
#> [1] '0.2.1'
file_binary <- dataverse::get_file(2711073, dataset = get_doi()$historic)
#>  raw [1:1283024] 25 50 44 46 ...

mayeulk commented 4 years ago

Thanks for swift reply. I did this:

.libPaths() # make sure to remove all dataverse in all places
# restart R
#Installing package into ‘/home/mk/R/x86_64-pc-linux-gnu-library/3.6’
#(as ‘lib’ is unspecified)
#* installing *source* package ‘dataverse’
#loaded via a namespace (and not attached):
#[25] glue_1.3.1             dataverse_0.2.1.9001   RSQLite_2.2.0

setup_icews(data_dir = "~/temp_icews", use_db = TRUE, keep_files = TRUE,
            r_profile = TRUE)
# this will give instructions for what to add to .Rprofile so that settings
# persist between R sessions

update_icews(dryrun = TRUE) # Should list proposed downloads, ingests, etc.
update_icews(dryrun = FALSE) # Wait until all is done; like 45 minutes or more the first time around

The last two commands (update_icews) exit with this error code: Error in value[3L] :

  Something went wrong in 'dataverse' or the Dataverse API, try again. Original error message:
'server' is missing with no default set in DATAVERSE_SERVER environment variable.

While, with the CRAN dataverse package (0.2.0) I had none of these error. For instance update_icews(dryrun = TRUE) does correctly list the files to download (with CRAN version). Have some dataverse functions changed, that require to be called differently? Mayeul

mayeulk commented 4 years ago

Note that:

[1] ""

There are some harvard dataverse urls and pointer (doi) at: I'm trying to play with that environment variable (but, this was not needed with the CRAN version of dataverse)

kuriwaki commented 4 years ago

Ok -- I have, for example,

> Sys.getenv("DATAVERSE_SERVER")
[1] ""

and ?dataverse::get_user_key in the dev version has more info on what to use for that variable. So maybe that is part of the issue.

I can't tell from your update_icews code that this is an issue with get_file. Is it possible to track down where it errors out?

mayeulk commented 4 years ago

Simply setting this Sys.setenv("DATAVERSE_SERVER" = "") got me R to download files:

update_icews(dryrun = FALSE)

Downloading ''
Ingesting records from ''
Downloading ''
Ingesting records from ''
Downloading ''

I killed R and checked that these 3 files were indeed pushed to the SQLite database, which it did. I'll check with the full datasets (might take hours) but for me this is fixed provided we use dataverse_0.2.1.9001 in icews R package, and add Sys.setenv("DATAVERSE_SERVER" = "") Maybe andybega (icews R package maintainer) would prefer DOIs than a hard coded server URL, but this is another story! Thank you very much for your help! I'll mention this in the icews R package thread. Cheers, Mayeul