IQSS / dataverse-client-r

R Client for Dataverse Repositories
https://iqss.github.io/dataverse-client-r
61 stars 25 forks source link

Duplicate file description #32

Closed adam3smith closed 3 years ago

adam3smith commented 4 years ago

Please specify whether your issue is about:

The "description" for files is repeated, resulting in a duplicate data.frame column name which causes all sorts of issues. Not sure if this is a problem with the API or the R-package, but figured I'd start here. CC @pdurbin

## load package
library("dataverse")

## code goes here
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")

obrien_files <- get_dataset("doi:10.7910/DVN/WOT075")[['files']]
colnames(obrien_files)

 [1] "description"         "label"               "restricted"         
 [4] "version"             "datasetVersionId"    "categories"         
 [7] "id"                  "persistentId"        "pidURL"             
[10] "filename"            "contentType"         "filesize"           
[13] "description"         "storageIdentifier"   "rootDataFileId"     
[16] "md5"                 "checksum"            "creationDate"       
[19] "originalFileFormat"  "originalFormatLabel" "originalFileSize"   
[22] "UNF"                 "tabularTags"

## session info for your system
sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.8.0.1   dataverse_0.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1        rstudioapi_0.10   xml2_1.2.0        magrittr_1.5     
 [5] tidyselect_0.2.5  R6_2.4.0          rlang_0.3.4       httr_1.4.1       
 [9] tools_3.4.3       pkgbuild_1.0.2    cli_1.1.0         withr_2.1.2      
[13] remotes_2.1.0     assertthat_0.2.1  rprojroot_1.3-2   tibble_2.1.1     
[17] crayon_1.3.4      processx_3.3.0    purrr_0.3.2       callr_3.1.1      
[21] ps_1.3.0          curl_3.3          glue_1.3.1        pillar_1.4.2     
[25] compiler_3.4.3    backports_1.1.4   prettyunits_1.0.2 jsonlite_1.6     
[29] pkgconfig_2.0.2  
pdurbin commented 4 years ago

If anything, it's probably a bug or at least a weirdness in the Dataverse API, which shows "description" twice. Here's a screenshot from https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi%3A10.7910/DVN/WOT075

Screen Shot 2019-12-04 at 10 49 56 PM

@adam3smith I'd encourage you to create an issue at https://github.com/IQSS/dataverse/issues but I'd be afraid that if we delete one of the "description" fields from the Dataverse API that an integration would break. It's probably better to think of this as a wart in the Dataverse API, something to fix in v2 or whatever. 😄

kuriwaki commented 4 years ago

The columns also get duplicated when binding here (both have the description column name).

https://github.com/IQSS/dataverse-client-r/blob/ac67f0f6c8b5e2903ccdce79f96a1b7231ab5884/R/utils.R#L137

In my fork (https://github.com/kuriwaki/dataverse-client-r/commit/49fd9e5186daad58da7bb57aa62f5ae8f1900bf9), I've removed the duplicate and it works:

Sys.setenv("DATAVERSE_KEY" = "5b514e42-1260-4b78-b395-e27de83d3115")
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")

library(tibble)
library(dataverse) # devtools::install_github("kuriwaki/dataverse-client-r")

# description about each dataset
obrien_files <- get_dataset("doi:10.7910/DVN/WOT075")[['files']]
any(duplicated(colnames(obrien_files)))
#> [1] FALSE

# non-duplicated column names makes tibble possible
as_tibble(obrien_files)
#> # A tibble: 6 x 22
#>   label restricted version datasetVersionId categories     id persistentId
#>   <chr> <lgl>        <int>            <int> <list>      <int> <chr>       
#> 1 Geog… FALSE            1           178559 <chr [1]>  3.64e6 doi:10.7910…
#> 2 Land… FALSE            1           178559 <chr [1]>  3.64e6 doi:10.7910…
#> 3 Land… FALSE            1           178559 <chr [1]>  3.64e6 doi:10.7910…
#> 4 Prop… FALSE            1           178559 <chr [1]>  3.64e6 doi:10.7910…
#> 5 Road… FALSE            1           178559 <chr [1]>  3.64e6 doi:10.7910…
#> 6 Road… FALSE            1           178559 <chr [1]>  3.64e6 doi:10.7910…
#> # … with 16 more variables: pidURL <chr>, filename <chr>, contentType <chr>,
#> #   filesize <int>, description <chr>, storageIdentifier <chr>,
#> #   rootDataFileId <int>, md5 <chr>, checksum$type <chr>, $value <chr>,
#> #   creationDate <chr>, originalFileFormat <chr>, originalFormatLabel <chr>,
#> #   originalFileSize <int>, UNF <chr>, tabularTags <list>

Created on 2019-12-16 by the reprex package (v0.3.0)

kuriwaki commented 3 years ago

Duplicate column was manually removed after the fact in PR #39, in commit https://github.com/kuriwaki/dataverse-client-r/commit/49fd9e5186daad58da7bb57aa62f5ae8f1900bf9

library("dataverse")

## code goes here
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")

obrien_files <- get_dataset("doi:10.7910/DVN/WOT075")[['files']]
colnames(obrien_files)
#>  [1] "label"               "restricted"          "version"            
#>  [4] "datasetVersionId"    "categories"          "id"                 
#>  [7] "persistentId"        "pidURL"              "filename"           
#> [10] "contentType"         "filesize"            "description"        
#> [13] "storageIdentifier"   "rootDataFileId"      "md5"                
#> [16] "checksum"            "creationDate"        "originalFileFormat" 
#> [19] "originalFormatLabel" "originalFileSize"    "originalFileName"   
#> [22] "UNF"                 "tabularTags"

any(duplicated(colnames(obrien_files)))
#> [1] FALSE

Created on 2020-12-28 by the reprex package (v0.3.0)