Closed adam3smith closed 3 years ago
If anything, it's probably a bug or at least a weirdness in the Dataverse API, which shows "description" twice. Here's a screenshot from https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi%3A10.7910/DVN/WOT075
@adam3smith I'd encourage you to create an issue at https://github.com/IQSS/dataverse/issues but I'd be afraid that if we delete one of the "description" fields from the Dataverse API that an integration would break. It's probably better to think of this as a wart in the Dataverse API, something to fix in v2 or whatever. 😄
The columns also get duplicated when binding here (both have the description
column name).
In my fork (https://github.com/kuriwaki/dataverse-client-r/commit/49fd9e5186daad58da7bb57aa62f5ae8f1900bf9), I've removed the duplicate and it works:
Sys.setenv("DATAVERSE_KEY" = "5b514e42-1260-4b78-b395-e27de83d3115")
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
library(tibble)
library(dataverse) # devtools::install_github("kuriwaki/dataverse-client-r")
# description about each dataset
obrien_files <- get_dataset("doi:10.7910/DVN/WOT075")[['files']]
any(duplicated(colnames(obrien_files)))
#> [1] FALSE
# non-duplicated column names makes tibble possible
as_tibble(obrien_files)
#> # A tibble: 6 x 22
#> label restricted version datasetVersionId categories id persistentId
#> <chr> <lgl> <int> <int> <list> <int> <chr>
#> 1 Geog… FALSE 1 178559 <chr [1]> 3.64e6 doi:10.7910…
#> 2 Land… FALSE 1 178559 <chr [1]> 3.64e6 doi:10.7910…
#> 3 Land… FALSE 1 178559 <chr [1]> 3.64e6 doi:10.7910…
#> 4 Prop… FALSE 1 178559 <chr [1]> 3.64e6 doi:10.7910…
#> 5 Road… FALSE 1 178559 <chr [1]> 3.64e6 doi:10.7910…
#> 6 Road… FALSE 1 178559 <chr [1]> 3.64e6 doi:10.7910…
#> # … with 16 more variables: pidURL <chr>, filename <chr>, contentType <chr>,
#> # filesize <int>, description <chr>, storageIdentifier <chr>,
#> # rootDataFileId <int>, md5 <chr>, checksum$type <chr>, $value <chr>,
#> # creationDate <chr>, originalFileFormat <chr>, originalFormatLabel <chr>,
#> # originalFileSize <int>, UNF <chr>, tabularTags <list>
Created on 2019-12-16 by the reprex package (v0.3.0)
Duplicate column was manually removed after the fact in PR #39, in commit https://github.com/kuriwaki/dataverse-client-r/commit/49fd9e5186daad58da7bb57aa62f5ae8f1900bf9
library("dataverse")
## code goes here
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
obrien_files <- get_dataset("doi:10.7910/DVN/WOT075")[['files']]
colnames(obrien_files)
#> [1] "label" "restricted" "version"
#> [4] "datasetVersionId" "categories" "id"
#> [7] "persistentId" "pidURL" "filename"
#> [10] "contentType" "filesize" "description"
#> [13] "storageIdentifier" "rootDataFileId" "md5"
#> [16] "checksum" "creationDate" "originalFileFormat"
#> [19] "originalFormatLabel" "originalFileSize" "originalFileName"
#> [22] "UNF" "tabularTags"
any(duplicated(colnames(obrien_files)))
#> [1] FALSE
Created on 2020-12-28 by the reprex package (v0.3.0)
Please specify whether your issue is about:
The "description" for files is repeated, resulting in a duplicate data.frame column name which causes all sorts of issues. Not sure if this is a problem with the API or the R-package, but figured I'd start here. CC @pdurbin