Closed wibeasley closed 3 years ago
Whilst I was looking around @wibeasley I noticed that line 104 in get_file.R never gets called because length(query)
is always greater than 1. However, often the query = query
argument in the if statement above was causing issues - the else{} block will often work but it's never ran.
The relevant code is copied below.
if (length(query)) {
r <- httr::GET(u, httr::add_headers("X-Dataverse-key" = key), query = query, ...)
} else {
r <- httr::GET(u, httr::add_headers("X-Dataverse-key" = key), ...)
}
You guys might start regretting inviting me to be a maintainer.
Never! Are you coming to the 2020 Dataverse Community Meeting? :smile:
- I assume this is error is fairly new. Some change with Dataverse? If not, maybe it's related to change with curl that was released 4 days ago?
My money is on a change in Dataverse, not curl. :smile:
I'll try to dig in more on this during the work week. Have a good weekend!
2. Why are csv & R files affected, but not tab files? As I step through a tab file
Is this related to the fact that passing "format=original" only works for tabular files? Please see https://github.com/IQSS/dataverse/issues/6408
@wibeasley to be honest, I'm a little lost in this issue, probably because I'm not much of an R hacker. Please keep the questions coming. Please let me know how I can help. 😄
Appreciate if there is a fix/workaround for this. I currently cannot read non-ingested datasets as well as ingested Stata datasets that originate from Stata v14+ files. Here are three examples in the CCES, where the first one works but not the other two.
library(dataverse)
# hide my key
# tab files in CCES 2017 (Stata v12 dataset) WORKS
# https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3STEZY
cc17 <- get_file("Common Content Data.tab", "doi:10.7910/DVN/3STEZY")
writeBin(cc17, "Common Content Data.dta")
cc17_dta <- foreign::read.dta("Common Content Data.dta")
cc17_dta <- haven::read_dta("Common Content Data.dta")
# tab files in CCES 2018 (Stata v14+dataset) DOES NOT WORK
# possibly because of Stata version issue
# https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ZSBZ7K
cc18 <- get_file("cces18_common_vv.tab", "doi:10.7910/DVN/ZSBZ7K")
writeBin(cc18, "cces18_common_vv.dta")
cc18_dta <- foreign::read.dta("cces18_common_vv.dta")
#> Error in foreign::read.dta("cces18_common_vv.dta"): not a Stata version 5-12 .dta file
cc18_dta <- haven::read_dta("cces18_common_vv.dta")
#> Error in df_parse_dta_file(spec, encoding, cols_skip, n_max, skip, name_repair = .name_repair):
#> Failed to parse /private/var/folders/gy/sd6ddp895s7dyqbdh2432fwm0000gn/T/
#> RtmpoYWXRS/reprex14f97d9c5c8b/cces18_common_vv.dta:
#> This version of the file format is not supported.
# Cumualtive common content dta, not tabulated, DOES NOT WORK
# https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/II2DB6
ccc_d <- get_file("cumulative_2006_2018.dta", "doi:10.7910/DVN/II2DB6")
#> Error in get_file("cumulative_2006_2018.dta", "doi:10.7910/DVN/II2DB6"):
#> Not Found (HTTP 404).
ccc_r <- get_file("cumulative_2006_2018.Rds", "doi:10.7910/DVN/II2DB6")
#> Error in get_file("cumulative_2006_2018.Rds", "doi:10.7910/DVN/II2DB6"):
#> Not Found (HTTP 404).
Created on 2019-12-09 by the reprex package (v0.3.0)
Error in foreign::read.dta("cces18_common_vv.dta"): not a Stata version 5-12 .dta file
It looks like @kuriwaki has opened #34 about this.
I just spent a couple minutes thinking about how I might confirm this bug and decided to try this repo in https://mybinder.org like in the screenshots below.
So far so good but I didn't want the screenshots to include my API token. 😄
I'm also thinking it might be good to try to reproduce this bug using Sid: https://www.iq.harvard.edu/roadmap-sid
I agree that it's weird that one file id in the listing works...
object.size(dataverse::get_file(2692294)) #works
... but another doesn't...
object.size(dataverse::get_file(2692210)) # 404
... especially because I can get to the file landing pages for both files by visiting the following URLs in my browser:
I'll attach some screenshots.
Downloading 2692210 from the GUI seems to work fine. Here are some more screenshots.
@wibeasley @kuriwaki is this helping? 😄
Thanks. It's good to know the data is there. As was originally pointed out, when I remove the query
argument in httr::GET
everything goes through fine. (Testable at devtools::install_github("kuriwaki/dataverse-client-r")
).
Since dput(query)
gives only this length-1 objectlist(format = "original")
at least in my case, is there any need to keep that argument at all?
@wibeasley: the changes I've made in the fork seem to fix get_file
to read any single-file object. However, I'm not familiar enough with the dataverse API or httr
to assess if it is stable or to asses if what I'm doing is recommended. If I submitted a PR, would you (or @pdurbin) be able to review / discuss it?
for example here are results from @EdJeeOnGitHub 's #31
library(dataverse) # devtools::install_github("kuriwaki/dataverse-client-r")
dv_files <- get_dataset("doi:10.7910/DVN/JGLOZF")$files
# for each file in data
for (f in 1:nrow(dv_files)) {
data_bytes <- as.integer(object.size(dataverse::get_file(dv_files$id[f])))
data_mb <- round(measurements::conv_unit(data_bytes, "byte", "MB"), 3)
metadata_mb <- round(measurements::conv_unit(dv_files$filesize[f], "byte", "MB"), 3)
print(glue::glue("{dv_files$filename[f]}, {metadata_mb} MB in metadata, {data_mb} MB when downloaded"))
}
#> finalusingindices_anon.tab, 11.251 MB in metadata, 11.263 MB when downloaded
#> ReadMe with Codebook.docx, 0.036 MB in metadata, 0.036 MB when downloaded
#> The Hunger Project Dataverse Files.zip, 1.98 MB in metadata, 1.98 MB when downloaded
#> THPawareness_HH_anon.tab, 0.585 MB in metadata, 0.586 MB when downloaded
Created on 2019-12-16 by the reprex package (v0.3.0)
@kuriwaki the examples I posted initially are now working. Thanks so much for figuring it out and fixing. I made one small addition in the commit referenced above. Basically, it catches the case if the file is already specified as a number/id. Was that your intention, or am I misunderstanding something?
@wibeasley yes, forgot to commit that. thank you for catching and merging!
Hi, we are having an issue in an R package which uses the R dataverse client. This is disussed here: https://github.com/andybega/icews/issues/51#issuecomment-571313416 (mentionned just above by andybega). If there is a fix or patch, I'd be happy to test it the icews R package with the fixed R dataverse client. Advice on how to install that version is welcome. I have this:
print(sessionInfo())
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 19.10
[...]
other attached packages:
[1] dataverse_0.2.0
That part runs with the current dev version of the package (not the CRAN version) -- is it possible to use that for that moment? (installation by devtools::install_github("IQSS/dataverse-client-r")
)
library("icews")
#> Options not set, consider running 'setup_icews()'
#> data_dir: NULL
#> use_db: NULL
#> keep_files: NULL
library("dataverse")
packageVersion("dataverse")
#> [1] '0.2.1'
file_binary <- dataverse::get_file(2711073, dataset = get_doi()$historic)
str(file_binary)
#> raw [1:1283024] 25 50 44 46 ...
Created on 2020-01-23 by the reprex package (v0.3.0)
Thanks for swift reply. I did this:
.libPaths() # make sure to remove all dataverse in all places
remove.packages("dataverse")
# restart R
devtools::install_github("IQSS/dataverse-client-r")
#Installing package into ‘/home/mk/R/x86_64-pc-linux-gnu-library/3.6’
#(as ‘lib’ is unspecified)
#* installing *source* package ‘dataverse’
library("icews")
library("DBI")
library("dplyr")
library("usethis")
print(sessionInfo())
#loaded via a namespace (and not attached):
#[...]
#[25] glue_1.3.1 dataverse_0.2.1.9001 RSQLite_2.2.0
setup_icews(data_dir = "~/temp_icews", use_db = TRUE, keep_files = TRUE,
r_profile = TRUE)
# this will give instructions for what to add to .Rprofile so that settings
# persist between R sessions
update_icews(dryrun = TRUE) # Should list proposed downloads, ingests, etc.
update_icews(dryrun = FALSE) # Wait until all is done; like 45 minutes or more the first time around
The last two commands (update_icews) exit with this error code: Error in value[3L] :
Something went wrong in 'dataverse' or the Dataverse API, try again. Original error message:
'server' is missing with no default set in DATAVERSE_SERVER environment variable.
While, with the CRAN dataverse package (0.2.0) I had none of these error. For instance update_icews(dryrun = TRUE) does correctly list the files to download (with CRAN version). Have some dataverse functions changed, that require to be called differently? Mayeul
Note that:
Sys.getenv("DATAVERSE_SERVER")
[1] ""
There are some harvard dataverse urls and pointer (doi) at: https://github.com/andybega/icews/search?q=harvard&unscoped_q=harvard I'm trying to play with that environment variable (but, this was not needed with the CRAN version of dataverse)
Ok -- I have, for example,
> Sys.getenv("DATAVERSE_SERVER")
[1] "dataverse.harvard.edu"
and ?dataverse::get_user_key
in the dev version has more info on what to use for that variable. So maybe that is part of the issue.
I can't tell from your update_icews
code that this is an issue with get_file
. Is it possible to track down where it errors out?
Simply setting this
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
got me R to download files:
update_icews(dryrun = FALSE)
Downloading 'events.1995.20150313082510.tab.zip' Ingesting records from 'events.1995.20150313082510.tab' Downloading 'events.1996.20150313082528.tab.zip' Ingesting records from 'events.1996.20150313082528.tab' Downloading 'events.1997.20150313082554.tab.zip'
I killed R and checked that these 3 files were indeed pushed to the SQLite database, which it did. I'll check with the full datasets (might take hours) but for me this is fixed provided we use dataverse_0.2.1.9001 in icews R package, and add
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
Maybe andybega (icews R package maintainer) would prefer DOIs than a hard coded server URL, but this is another story! Thank you very much for your help! I'll mention this in the icews R package thread. Cheers, Mayeul
You guys might start regretting inviting me to be a maintainer. I'm having trouble reproducing the vignettes, even easy parts like retrieving plain-text R & CSVs.
Part 1: out of the box
Created on 2019-12-06 by the reprex package (v0.3.0)
Part 2: digging.
Using
debug(dataverse::get_file)
, the error-throwing line is inget_file()
:To make things a tad more direct, I called
dataverse::get_file(2692233)
. The two relevant parameters tohttr::GET()
areThe
r
value returned isThat
u
value is fine when pasted into Chrome. I saw several Dataverse discussions about a trailing/
. When I added that, the response appears good.Part 3: Questions
I assume this is error is fairly new. Some change with Dataverse? If not, maybe it's related to change with curl that was released 4 days ago?
Why are csv & R files affected, but not tab files? As I step through a tab file (e.g.,
dataverse::get_file(2692294)
), it appears the exact same lines are executed. And thatu
value doesn't have a trailing slash (https://dataverse.harvard.edu/api/access/datafile/2692294
). I see two differences: (a) the content type and (b) this one doesn't go through AWS/S3.This is probably related to @EdJeeOnGitHub's recent issue #31. Notice he mentions problems with certain file formats.
Is this related at all to https://github.com/IQSS/dataverse/issues/3130, https://github.com/IQSS/dataverse/issues/2559, or https://github.com/IQSS/dataverse/issues/4196? You can see that my knowledge with the web side of this is limited; I don't understand them that well.
devtools::session_info()
- Session info --------------------------------------------------------------------------- setting value version R version 3.6.1 Patched (2019-08-12 r76979) os Windows 10 x64 system x86_64, mingw32 ui RStudio language (EN) collate English_United States.1252 ctype English_United States.1252 tz America/Chicago date 2019-12-06 - Packages ------------------------------------------------------------------------------- package * version date lib source assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0) backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1) callr 3.3.2 2019-09-22 [1] CRAN (R 3.6.1) cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.0) clipr 0.7.0 2019-07-23 [1] CRAN (R 3.6.1) crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0) curl 4.3 2019-12-02 [1] CRAN (R 3.6.1) dataverse * 0.2.1 2019-12-07 [1] Github (iqss/dataverse-client-r@bac89f4) desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0) devtools 2.2.1 2019-09-24 [1] CRAN (R 3.6.1) digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.1) ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.1) evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0) fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.0) glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0) htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.1) httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.1) jsonlite 1.6 2018-12-07 [1] CRAN (R 3.6.0) knitr 1.26 2019-11-12 [1] CRAN (R 3.6.1) magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0) memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0) packrat 0.5.0 2018-11-14 [1] CRAN (R 3.6.0) pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.6.1) pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0) prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.6.0) processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.1) ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.0) R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.1) Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.6.1) remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.0) reprex 0.3.0 2019-05-16 [1] CRAN (R 3.6.0) rlang 0.4.2 2019-11-23 [1] CRAN (R 3.6.1) rmarkdown 1.18 2019-11-27 [1] CRAN (R 3.6.1) rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0) rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.6.0) sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0) testthat 2.3.1 2019-12-01 [1] CRAN (R 3.6.1) usethis 1.5.1 2019-07-04 [1] CRAN (R 3.6.1) whisker 0.4 2019-08-28 [1] CRAN (R 3.6.1) withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0) xfun 0.11 2019-11-12 [1] CRAN (R 3.6.1) xml2 1.2.2 2019-08-09 [1] CRAN (R 3.6.1)screenshot of postman