IQSS / dataverse-client-r

R Client for Dataverse Repositories
https://iqss.github.io/dataverse-client-r
60 stars 24 forks source link

Unable to retrieve an unpublished data file. #115

Closed famuvie closed 2 years ago

famuvie commented 2 years ago

Please specify whether your issue is about:

I'm having an issue while trying to access an unpublished data file, which requires an API token. Unfortunately, this makes the following code not reproducible. Let me know if there is a way of making a reproducible example in this case.

The problem is that I cannot use any of the get_dataframe_by_* functions, due to an issue with is_ingested() which seems unable to find the target file. However, if I work around is_ingested() I can retrieve the data as is shown in the example below.

library(dataverse)
packageVersion("dataverse")
#> [1] '0.3.10'

get_dataframe_by_id(
  fileid = 12930,
  dataset = "https://doi.org/10.18167/DVN1/8Z1ZI9"
)
#> Error in is_ingested(fileid, ...): File information not found on Dataverse API

# A successful read
server <- Sys.getenv("DATAVERSE_SERVER")
key <- Sys.getenv("DATAVERSE_KEY")
fileid = 12930
query <- list(format = "original")
u_part <- "access/datafile/"
u <- paste0(dataverse:::api_url(server), u_part, fileid)
r <- httr::GET(u, httr::add_headers(`X-Dataverse-key` = key), query = query)
httr::content(r)
#> Rows: 1347 Columns: 1
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): ;Vache;Date;Commune;eleveur;Troupeau;R_bursa;H_marginatum;I_ricinus...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1,347 × 1
#>    `;Vache;Date;Commune;eleveur;Troupeau;R_bursa;H_marginatum;I_ricinus;H_scupe…
#>    <chr>                                                                        
#>  1 1;200532 2248;26/02/2020;POPOLASCA;;;0;0;0;0;0;0;0;0;0                       
#>  2 2;200531 7320;26/02/2020;CASTIFAO;;;0;0;0;0;0;0;0;0;0                        
#>  3 3;200530 9555;26/02/2020;ZALANA;;;1;0;0;0;0;0;0;0;1                          
#>  4 4;200532 8365;26/02/2020;MOLTIFAO;;;0;0;0;0;0;0;0;0;0                        
#>  5 5;200533 1185;26/02/2020;PIEDIGRIGGIO;;;0;0;0;2;0;0;0;0;2                    
#>  6 6;200532 3312;26/02/2020;CORTE;;;0;0;0;0;0;0;0;0;0                           
#>  7 7;200531 0907;26/02/2020;BORGO;;;0;0;0;0;0;0;0;0;0                           
#>  8 8;200532 2246;26/02/2020;POPOLASCA;;;0;0;0;0;0;0;0;0;0                       
#>  9 9;200530 8506;26/02/2020;CORTE;;;1;0;0;0;0;0;0;0;1                           
#> 10 10;200532 2245;26/02/2020;POPOLASCA;;;0;0;0;0;0;0;0;0;0                      
#> # … with 1,337 more rows
sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Linux Mint 20.1
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
#>  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_GB.UTF-8   
#>  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] rstudioapi_0.13 knitr_1.37      magrittr_2.0.2  rlang_1.0.0    
#>  [5] fastmap_1.1.0   fansi_0.5.0     stringr_1.4.0   styler_1.4.1   
#>  [9] highr_0.9       tools_4.1.2     xfun_0.29       utf8_1.2.2     
#> [13] cli_3.1.0       withr_2.4.2     htmltools_0.5.2 ellipsis_0.3.2 
#> [17] yaml_2.2.2      digest_0.6.29   tibble_3.1.6    lifecycle_1.0.1
#> [21] crayon_1.4.2    purrr_0.3.4     vctrs_0.3.8     fs_1.5.2       
#> [25] glue_1.6.1      evaluate_0.14   rmarkdown_2.11  reprex_2.0.1   
#> [29] stringi_1.7.6   compiler_4.1.2  pillar_1.6.4    backports_1.4.1
#> [33] pkgconfig_2.0.3
famuvie commented 2 years ago

Ultimately, the problem in is_ingested() boils down to dataverse_search() not finding the file:

library(dataverse)
server <- Sys.getenv("DATAVERSE_SERVER")
key <- Sys.getenv("DATAVERSE_KEY")
dataverse_search(id = "datafile_12930", type = "file", server = server, key = key)
#> 0 of 0 results retrieved
#> list()

Created on 2022-02-04 by the reprex package (v2.0.1)

It is worth noting that I can find the file using some keywords on the web interface.

Whereas dataverse_search() correctly finds a published file.

pdurbin commented 2 years ago

Hmm, because the file is in draft, I bet _draft would need to be appended like this:

id = "datafile_12930_draft"

@famuvie do you want to see if you can find your draft file that way with curl? You'll have to pass your API token. Docs on this at https://guides.dataverse.org/en/5.9/api/search.html

@kuriwaki this might also work:

entityId:12930

An example: https://dataverse.harvard.edu/api/search?q=entityId:3371438

(I'm not sure why I suggested id instead of entityId at https://github.com/IQSS/dataverse-client-r/issues/113#issuecomment-1011208445 . The id changes (_draft is dropped on publish) but entityId stays the same.)

famuvie commented 2 years ago

Not sure how to pass the API token with curl, but it works with dataverse_search():

library(dataverse)
server <- Sys.getenv("DATAVERSE_SERVER")
key <- Sys.getenv("DATAVERSE_KEY")
dataverse_search(id = "datafile_12930_draft", type = "file", server = server, key = key)
#> 1 of 1 result retrieved
#>                   name type
#> 1 Bovine_2020_2021.tab file
#>                                                    url file_id
#> 1 https://dataverse.cirad.fr/api/access/datafile/12930   12930
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              description
#> 1 On this document, there is only the tick data collection between 2020 and 2021.\n\nSome information about variable :\n\n- Vache : Identifier of cow (10 digits)\n- Date : date of slaughterhouse visit\n- Commune : origin cow municipality\n- Eleveur : origin cow breeder\n- Troupeau : municipality and breeder\n- H_marginatus : number of *H_marginatus* collected\n- R_bursa : number of *R_bursa* collected\n- I_ricinus : number of *I_ricinus* collected\n- H_scupense : number of *H_scupense* collected\n- B_annulatus : number of *B_annulatus* collected\n- R_sanguineus : number of *R_sanguineus* collected\n- H_punctata : number of *H_punctata* collected.\n- D_marginatus : number of *D_marginatus* collected\n- Tiques ? : sum of ticks collected
#>       file_type         file_content_type size_in_bytes
#> 1 Tab-Delimited text/tab-separated-values         69648
#>                                md5 checksum.type
#> 1 688c6fc5f92e6526a3cd158854027e8b           MD5
#>                     checksum.value                            unf dataset_name
#> 1 688c6fc5f92e6526a3cd158854027e8b UNF:6:lt7ZJ1diuShhMCd8UWq5zQ==       Bovine
#>   dataset_id    dataset_persistent_id
#> 1      12928 doi:10.18167/DVN1/8Z1ZI9
#>                                                                                                                                         dataset_citation
#> 1 Bartholomee, Colombine, 2022, "Bovine", https://doi.org/10.18167/DVN1/8Z1ZI9, CIRAD Dataverse, DRAFT VERSION, UNF:6:ov2odYXNktsIbiuwc2MDJQ== [fileUNF]

Created on 2022-02-04 by the reprex package (v2.0.1)

pdurbin commented 2 years ago

I'm not sure how to pass the API token with curl. I'll check.

You can pass it as a header or a query parameter. Please see https://guides.dataverse.org/en/5.9/api/auth.html

famuvie commented 2 years ago

Sorry, I made a mistake in the previous example and have just corrected it. It actually works!

famuvie commented 2 years ago

Still, I can't find a hacky way for adding "_draft" to the file id. I guess that needs to be fixed in the package :)

kuriwaki commented 2 years ago

Thanks @famuvie for creating an issue. A partial fix is now on dev. @pdurbin, thanks for pointing out entityId. I implemented it on dev as there seems to be no downside.

I created a test dataset on demo dataverse that is intentionally unpublished. The get commands seem to go ok except for my unpublished test file does not have a UNF under the SEARCH API even though it does with the File API. Have you seen this before?

Proper UNF detection becomes necessary since that's how it currently determines if a file is ingested or not.

> str(dataset_files(dataset = "10.70122/FK2/4XHVAP", server = "demo.dataverse.org")[[1]]$dataFile)
List of 16
 $ id                 : int 1951382
 $ persistentId       : chr ""
 $ pidURL             : chr ""
 $ filename           : chr "mtcars.tab"
 $ contentType        : chr "text/tab-separated-values"
 $ filesize           : int 1713
 $ storageIdentifier  : chr "s3://demo-dataverse-org:17f75571af3-60325bcbb1f1"
 $ originalFileFormat : chr "text/csv"
 $ originalFormatLabel: chr "Comma Separated Values"
 $ originalFileSize   : int 1700
 $ originalFileName   : chr "mtcars.csv"
 $ UNF                : chr "UNF:6:KRE/AItWGJWd5tJ+bboN7A=="
 $ rootDataFileId     : int -1
 $ md5                : chr "c502359c26a0931eef53b2207b2344f9"
 $ checksum           :List of 2
  ..$ type : chr "MD5"
  ..$ value: chr "c502359c26a0931eef53b2207b2344f9"
 $ creationDate       : chr "2022-03-10"
> str(dataverse_search(entityId = 1951382, server = "demo.dataverse.org", key = Sys.getenv("DATAVERSE_KEY")))
1 of 1 result retrieved
'data.frame':   1 obs. of  13 variables:
 $ name                 : chr "mtcars.csv"
 $ type                 : chr "file"
 $ url                  : chr "https://demo.dataverse.org/api/access/datafile/1951382"
 $ file_id              : chr "1951382"
 $ file_type            : chr "Comma Separated Values"
 $ file_content_type    : chr "text/csv"
 $ size_in_bytes        : int 1700
 $ md5                  : chr "c502359c26a0931eef53b2207b2344f9"
 $ checksum             :'data.frame':  1 obs. of  2 variables:
  ..$ type : chr "MD5"
  ..$ value: chr "c502359c26a0931eef53b2207b2344f9"
 $ dataset_name         : chr "Permanent draft dataset for testing"
 $ dataset_id           : chr "1951381"
 $ dataset_persistent_id: chr "doi:10.70122/FK2/4XHVAP"
 $ dataset_citation     : chr "Kuriwaki, Shiro, 2022, \"Permanent draft dataset for testing\", https://doi.org/10.70122/FK2/4XHVAP, Demo Datav"| __truncated__
pdurbin commented 2 years ago

The get commands seem to go ok except for my unpublished test file does not have a UNF under the SEARCH API even though it does with the File API.

Huh. This is news to me but I see what you mean.

No UNF from the Search API when I look at your unpublished file...

curl -H X-Dataverse-key:$API_TOKEN https://demo.dataverse.org/api/search?q=id:datafile_1951382_draft

{
  "status": "OK",
  "data": {
    "q": "id:datafile_1951382_draft",
    "total_count": 1,
    "start": 0,
    "spelling_alternatives": {},
    "items": [
      {
        "name": "mtcars.csv",
        "type": "file",
        "url": "https://demo.dataverse.org/api/access/datafile/1951382",
        "file_id": "1951382",
        "file_type": "Comma Separated Values",
        "file_content_type": "text/csv",
        "size_in_bytes": 1700,
        "md5": "c502359c26a0931eef53b2207b2344f9",
        "checksum": {
          "type": "MD5",
          "value": "c502359c26a0931eef53b2207b2344f9"
        },
        "dataset_name": "Permanent draft dataset for testing",
        "dataset_id": "1951381",
        "dataset_persistent_id": "doi:10.70122/FK2/4XHVAP",
        "dataset_citation": "Kuriwaki, Shiro, 2022, \"Permanent draft dataset for testing\", https://doi.org/10.70122/FK2/4XHVAP, Demo Dataverse, DRAFT VERSION"
      }
    ],
    "count_in_response": 1
  }
}

... but when I look at a published file (different server but shouldn't matter), I do see a UNF:

curl https://dataverse.harvard.edu/api/search?q=id:datafile_3371438
{
  "status": "OK",
  "data": {
    "q": "id:datafile_3371438",
    "total_count": 1,
    "start": 0,
    "spelling_alternatives": {},
    "items": [
      {
        "name": "2019-02-25.tab",
        "type": "file",
        "url": "https://dataverse.harvard.edu/api/access/datafile/3371438",
        "file_id": "3371438",
        "description": "",
        "published_at": "2019-02-26T03:03:13Z",
        "file_type": "Tab-Delimited",
        "file_content_type": "text/tab-separated-values",
        "size_in_bytes": 17232,
        "md5": "9bd94d028049c9a53bca9bb19d4fb57e",
        "checksum": {
          "type": "MD5",
          "value": "9bd94d028049c9a53bca9bb19d4fb57e"
        },
        "unf": "UNF:6:2MMoV8KKO8R7sb27Q5GXtA==",
        "file_persistent_id": "doi:10.7910/DVN/TJCLKP/3VSTKY",
        "dataset_name": "Open Source at Harvard",
        "dataset_id": "3035124",
        "dataset_persistent_id": "doi:10.7910/DVN/TJCLKP",
        "dataset_citation": "Durbin, Philip, 2017, \"Open Source at Harvard\", https://doi.org/10.7910/DVN/TJCLKP, Harvard Dataverse, DRAFT VERSION, UNF:6:2MMoV8KKO8R7sb27Q5GXtA== [fileUNF]"
      }
    ],
    "count_in_response": 1
  }
}

Perhaps we don't reindex the file after ingest is complete? I'm not sure. You could test this by making a change to your draft dataset metadata (add a keyword or something). This will reindex the dataaset and its files.

kuriwaki commented 2 years ago

Yes! It was sufficient to add a data description to the draft dataset, and it somehow updated. Thank you.

pdurbin commented 2 years ago

@kuriwaki hmm, I can replicate this on "develop" on my laptop (around 0d853b74e9). When I first upload a file to a draft, the UNF does not appear in search results...

$ curl -s -H X-Dataverse-key:$API_TOKEN http://localhost:8080/api/search?q=id:datafile_5_draft | jq .
{
  "status": "OK",
  "data": {
    "q": "id:datafile_5_draft",
    "total_count": 1,
    "start": 0,
    "spelling_alternatives": {},
    "items": [
      {
        "name": "2016-06-29.csv",
        "type": "file",
        "url": "http://localhost:8080/api/access/datafile/5",
        "file_id": "5",
        "file_type": "Comma Separated Values",
        "file_content_type": "text/csv",
        "size_in_bytes": 58690,
        "md5": "d5de092a84304a9965c787b8dcd27c99",
        "checksum": {
          "type": "MD5",
          "value": "d5de092a84304a9965c787b8dcd27c99"
        },
        "dataset_name": "zzz",
        "dataset_id": "4",
        "dataset_persistent_id": "doi:10.5072/FK2/JJK8WY",
        "dataset_citation": "Admin, Dataverse, 2022, \"zzz\", https://doi.org/10.5072/FK2/JJK8WY, Root, DRAFT VERSION"
      }
    ],
    "count_in_response": 1
  }
}

... but if I edit the metadata of the draft dataset (forcing the file to be reindexed, the UNF appears):

$ curl -s -H X-Dataverse-key:$API_TOKEN http://localhost:8080/api/search?q=id:datafile_5_draft | jq .
{
  "status": "OK",
  "data": {
    "q": "id:datafile_5_draft",
    "total_count": 1,
    "start": 0,
    "spelling_alternatives": {},
    "items": [
      {
        "name": "2016-06-29.tab",
        "type": "file",
        "url": "http://localhost:8080/api/access/datafile/5",
        "file_id": "5",
        "file_type": "Tab-Delimited",
        "file_content_type": "text/tab-separated-values",
        "size_in_bytes": 59208,
        "md5": "d5de092a84304a9965c787b8dcd27c99",
        "checksum": {
          "type": "MD5",
          "value": "d5de092a84304a9965c787b8dcd27c99"
        },
        "unf": "UNF:6:6YVg+pUWsYD52stDkZuzUA==",
        "dataset_name": "zzzyyy",
        "dataset_id": "4",
        "dataset_persistent_id": "doi:10.5072/FK2/JJK8WY",
        "dataset_citation": "Admin, Dataverse, 2022, \"zzzyyy\", https://doi.org/10.5072/FK2/JJK8WY, Root, DRAFT VERSION, UNF:6:6YVg+pUWsYD52stDkZuzUA== [fileUNF]"
      }
    ],
    "count_in_response": 1
  }
}

Please feel free to open an issue about this at https://github.com/IQSS/dataverse/issues if you'd like.

kuriwaki commented 2 years ago

I will put a tip about this in the dataverse download vignette. I think it is a limitation that might be common to people who try to download draft datasets, but the current method to edit something seems not too onerous.

kuriwaki commented 2 years ago

Addressed by 0.3.11.