Unable to retrieve an unpublished data file. #115

Closed famuvie closed 2 years ago

famuvie commented 2 years ago

Please specify whether your issue is about:

I'm having an issue while trying to access an unpublished data file, which requires an API token. Unfortunately, this makes the following code not reproducible. Let me know if there is a way of making a reproducible example in this case.

The problem is that I cannot use any of the get_dataframe_by_* functions, due to an issue with is_ingested() which seems unable to find the target file. However, if I work around is_ingested() I can retrieve the data as is shown in the example below.

#> [1] '0.3.10'

  fileid = 12930,
  dataset = ""
#> Error in is_ingested(fileid, ...): File information not found on Dataverse API

# A successful read
server <- Sys.getenv("DATAVERSE_SERVER")
key <- Sys.getenv("DATAVERSE_KEY")
fileid = 12930
query <- list(format = "original")
u_part <- "access/datafile/"
u <- paste0(dataverse:::api_url(server), u_part, fileid)
r <- httr::GET(u, httr::add_headers(`X-Dataverse-key` = key), query = query)
#> Rows: 1347 Columns: 1
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): ;Vache;Date;Commune;eleveur;Troupeau;R_bursa;H_marginatum;I_ricinus...
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1,347 × 1
#>    `;Vache;Date;Commune;eleveur;Troupeau;R_bursa;H_marginatum;I_ricinus;H_scupe…
#>    <chr>                                                                        
#>  1 1;200532 2248;26/02/2020;POPOLASCA;;;0;0;0;0;0;0;0;0;0                       
#>  2 2;200531 7320;26/02/2020;CASTIFAO;;;0;0;0;0;0;0;0;0;0                        
#>  3 3;200530 9555;26/02/2020;ZALANA;;;1;0;0;0;0;0;0;0;1                          
#>  4 4;200532 8365;26/02/2020;MOLTIFAO;;;0;0;0;0;0;0;0;0;0                        
#>  5 5;200533 1185;26/02/2020;PIEDIGRIGGIO;;;0;0;0;2;0;0;0;0;2                    
#>  6 6;200532 3312;26/02/2020;CORTE;;;0;0;0;0;0;0;0;0;0                           
#>  7 7;200531 0907;26/02/2020;BORGO;;;0;0;0;0;0;0;0;0;0                           
#>  8 8;200532 2246;26/02/2020;POPOLASCA;;;0;0;0;0;0;0;0;0;0                       
#>  9 9;200530 8506;26/02/2020;CORTE;;;1;0;0;0;0;0;0;0;1                           
#> 10 10;200532 2245;26/02/2020;POPOLASCA;;;0;0;0;0;0;0;0;0;0                      
#> # … with 1,337 more rows
famuvie commented 2 years ago

Ultimately, the problem in is_ingested() boils down to dataverse_search() not finding the file:

server <- Sys.getenv("DATAVERSE_SERVER")
key <- Sys.getenv("DATAVERSE_KEY")
dataverse_search(id = "datafile_12930", type = "file", server = server, key = key)
#> 0 of 0 results retrieved
#> list()

It is worth noting that I can find the file using some keywords on the web interface.

Whereas dataverse_search() correctly finds a published file.

pdurbin commented 2 years ago

Hmm, because the file is in draft, I bet _draft would need to be appended like this:

id = "datafile_12930_draft"

@famuvie do you want to see if you can find your draft file that way with curl? You'll have to pass your API token. Docs on this at

@kuriwaki this might also work:


An example:

(I'm not sure why I suggested id instead of entityId at . The id changes (_draft is dropped on publish) but entityId stays the same.)

famuvie commented 2 years ago

Not sure how to pass the API token with curl, but it works with dataverse_search():

server <- Sys.getenv("DATAVERSE_SERVER")
key <- Sys.getenv("DATAVERSE_KEY")
dataverse_search(id = "datafile_12930_draft", type = "file", server = server, key = key)
#> 1 of 1 result retrieved
#>                   name type
#> 1 file
#>                                                    url file_id
#> 1   12930
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              description
#> 1 On this document, there is only the tick data collection between 2020 and 2021.\n\nSome information about variable :\n\n- Vache : Identifier of cow (10 digits)\n- Date : date of slaughterhouse visit\n- Commune : origin cow municipality\n- Eleveur : origin cow breeder\n- Troupeau : municipality and breeder\n- H_marginatus : number of *H_marginatus* collected\n- R_bursa : number of *R_bursa* collected\n- I_ricinus : number of *I_ricinus* collected\n- H_scupense : number of *H_scupense* collected\n- B_annulatus : number of *B_annulatus* collected\n- R_sanguineus : number of *R_sanguineus* collected\n- H_punctata : number of *H_punctata* collected.\n- D_marginatus : number of *D_marginatus* collected\n- Tiques ? : sum of ticks collected
#>       file_type         file_content_type size_in_bytes
#> 1 Tab-Delimited text/tab-separated-values         69648
#>                                md5 checksum.type
#> 1 688c6fc5f92e6526a3cd158854027e8b           MD5
#>                     checksum.value                            unf dataset_name
#> 1 688c6fc5f92e6526a3cd158854027e8b UNF:6:lt7ZJ1diuShhMCd8UWq5zQ==       Bovine
#>   dataset_id    dataset_persistent_id
#> 1      12928 doi:10.18167/DVN1/8Z1ZI9
#>                                                                                                                                         dataset_citation
#> 1 Bartholomee, Colombine, 2022, "Bovine",, CIRAD Dataverse, DRAFT VERSION, UNF:6:ov2odYXNktsIbiuwc2MDJQ== [fileUNF]

pdurbin commented 2 years ago

I'm not sure how to pass the API token with curl. I'll check.

You can pass it as a header or a query parameter. Please see

famuvie commented 2 years ago

Sorry, I made a mistake in the previous example and have just corrected it. It actually works!

famuvie commented 2 years ago

Still, I can't find a hacky way for adding "_draft" to the file id. I guess that needs to be fixed in the package :)

kuriwaki commented 2 years ago

Thanks @famuvie for creating an issue. A partial fix is now on dev. @pdurbin, thanks for pointing out entityId. I implemented it on dev as there seems to be no downside.

I created a test dataset on demo dataverse that is intentionally unpublished. The get commands seem to go ok except for my unpublished test file does not have a UNF under the SEARCH API even though it does with the File API. Have you seen this before?

Proper UNF detection becomes necessary since that's how it currently determines if a file is ingested or not.

> str(dataset_files(dataset = "10.70122/FK2/4XHVAP", server = "")[[1]]$dataFile)
List of 16
 $ id                 : int 1951382
 $ persistentId       : chr ""
 $ pidURL             : chr ""
 $ filename           : chr ""
 $ contentType        : chr "text/tab-separated-values"
 $ filesize           : int 1713
 $ storageIdentifier  : chr "s3://demo-dataverse-org:17f75571af3-60325bcbb1f1"
 $ originalFileFormat : chr "text/csv"
 $ originalFormatLabel: chr "Comma Separated Values"
 $ originalFileSize   : int 1700
 $ originalFileName   : chr "mtcars.csv"
 $ UNF                : chr "UNF:6:KRE/AItWGJWd5tJ+bboN7A=="
 $ rootDataFileId     : int -1
 $ md5                : chr "c502359c26a0931eef53b2207b2344f9"
 $ checksum           :List of 2
  ..$ type : chr "MD5"
  ..$ value: chr "c502359c26a0931eef53b2207b2344f9"
 $ creationDate       : chr "2022-03-10"
> str(dataverse_search(entityId = 1951382, server = "", key = Sys.getenv("DATAVERSE_KEY")))
1 of 1 result retrieved
'data.frame':   1 obs. of  13 variables:
 $ name                 : chr "mtcars.csv"
 $ type                 : chr "file"
 $ url                  : chr ""
 $ file_id              : chr "1951382"
 $ file_type            : chr "Comma Separated Values"
 $ file_content_type    : chr "text/csv"
 $ size_in_bytes        : int 1700
 $ md5                  : chr "c502359c26a0931eef53b2207b2344f9"
 $ checksum             :'data.frame':  1 obs. of  2 variables:
  ..$ type : chr "MD5"
  ..$ value: chr "c502359c26a0931eef53b2207b2344f9"
 $ dataset_name         : chr "Permanent draft dataset for testing"
 $ dataset_id           : chr "1951381"
 $ dataset_persistent_id: chr "doi:10.70122/FK2/4XHVAP"
 $ dataset_citation     : chr "Kuriwaki, Shiro, 2022, \"Permanent draft dataset for testing\",, Demo Datav"| __truncated__
pdurbin commented 2 years ago

The get commands seem to go ok except for my unpublished test file does not have a UNF under the SEARCH API even though it does with the File API.

Huh. This is news to me but I see what you mean.

No UNF from the Search API when I look at your unpublished file...

curl -H X-Dataverse-key:$API_TOKEN

  "status": "OK",
  "data": {
    "q": "id:datafile_1951382_draft",
    "total_count": 1,
    "start": 0,
    "spelling_alternatives": {},
    "items": [
        "name": "mtcars.csv",
        "type": "file",
        "url": "",
        "file_id": "1951382",
        "file_type": "Comma Separated Values",
        "file_content_type": "text/csv",
        "size_in_bytes": 1700,
        "md5": "c502359c26a0931eef53b2207b2344f9",
        "checksum": {
          "type": "MD5",
          "value": "c502359c26a0931eef53b2207b2344f9"
        "dataset_name": "Permanent draft dataset for testing",
        "dataset_id": "1951381",
        "dataset_persistent_id": "doi:10.70122/FK2/4XHVAP",
        "dataset_citation": "Kuriwaki, Shiro, 2022, \"Permanent draft dataset for testing\",, Demo Dataverse, DRAFT VERSION"
    "count_in_response": 1

... but when I look at a published file (different server but shouldn't matter), I do see a UNF:

  "status": "OK",
  "data": {
    "q": "id:datafile_3371438",
    "total_count": 1,
    "start": 0,
    "spelling_alternatives": {},
    "items": [
        "name": "",
        "type": "file",
        "url": "",
        "file_id": "3371438",
        "description": "",
        "published_at": "2019-02-26T03:03:13Z",
        "file_type": "Tab-Delimited",
        "file_content_type": "text/tab-separated-values",
        "size_in_bytes": 17232,
        "md5": "9bd94d028049c9a53bca9bb19d4fb57e",
        "checksum": {
          "type": "MD5",
          "value": "9bd94d028049c9a53bca9bb19d4fb57e"
        "unf": "UNF:6:2MMoV8KKO8R7sb27Q5GXtA==",
        "file_persistent_id": "doi:10.7910/DVN/TJCLKP/3VSTKY",
        "dataset_name": "Open Source at Harvard",
        "dataset_id": "3035124",
        "dataset_persistent_id": "doi:10.7910/DVN/TJCLKP",
        "dataset_citation": "Durbin, Philip, 2017, \"Open Source at Harvard\",, Harvard Dataverse, DRAFT VERSION, UNF:6:2MMoV8KKO8R7sb27Q5GXtA== [fileUNF]"
    "count_in_response": 1

Perhaps we don't reindex the file after ingest is complete? I'm not sure. You could test this by making a change to your draft dataset metadata (add a keyword or something). This will reindex the dataaset and its files.

kuriwaki commented 2 years ago

Yes! It was sufficient to add a data description to the draft dataset, and it somehow updated. Thank you.

pdurbin commented 2 years ago

@kuriwaki hmm, I can replicate this on "develop" on my laptop (around 0d853b74e9). When I first upload a file to a draft, the UNF does not appear in search results...

$ curl -s -H X-Dataverse-key:$API_TOKEN http://localhost:8080/api/search?q=id:datafile_5_draft | jq .
  "status": "OK",
  "data": {
    "q": "id:datafile_5_draft",
    "total_count": 1,
    "start": 0,
    "spelling_alternatives": {},
    "items": [
        "name": "2016-06-29.csv",
        "type": "file",
        "url": "http://localhost:8080/api/access/datafile/5",
        "file_id": "5",
        "file_type": "Comma Separated Values",
        "file_content_type": "text/csv",
        "size_in_bytes": 58690,
        "md5": "d5de092a84304a9965c787b8dcd27c99",
        "checksum": {
          "type": "MD5",
          "value": "d5de092a84304a9965c787b8dcd27c99"
        "dataset_name": "zzz",
        "dataset_id": "4",
        "dataset_persistent_id": "doi:10.5072/FK2/JJK8WY",
        "dataset_citation": "Admin, Dataverse, 2022, \"zzz\",, Root, DRAFT VERSION"
    "count_in_response": 1

... but if I edit the metadata of the draft dataset (forcing the file to be reindexed, the UNF appears):

$ curl -s -H X-Dataverse-key:$API_TOKEN http://localhost:8080/api/search?q=id:datafile_5_draft | jq .
  "status": "OK",
  "data": {
    "q": "id:datafile_5_draft",
    "total_count": 1,
    "start": 0,
    "spelling_alternatives": {},
    "items": [
        "name": "",
        "type": "file",
        "url": "http://localhost:8080/api/access/datafile/5",
        "file_id": "5",
        "file_type": "Tab-Delimited",
        "file_content_type": "text/tab-separated-values",
        "size_in_bytes": 59208,
        "md5": "d5de092a84304a9965c787b8dcd27c99",
        "checksum": {
          "type": "MD5",
          "value": "d5de092a84304a9965c787b8dcd27c99"
        "unf": "UNF:6:6YVg+pUWsYD52stDkZuzUA==",
        "dataset_name": "zzzyyy",
        "dataset_id": "4",
        "dataset_persistent_id": "doi:10.5072/FK2/JJK8WY",
        "dataset_citation": "Admin, Dataverse, 2022, \"zzzyyy\",, Root, DRAFT VERSION, UNF:6:6YVg+pUWsYD52stDkZuzUA== [fileUNF]"
    "count_in_response": 1

Please feel free to open an issue about this at if you'd like.

kuriwaki commented 2 years ago

I will put a tip about this in the dataverse download vignette. I think it is a limitation that might be common to people who try to download draft datasets, but the current method to edit something seems not too onerous.

kuriwaki commented 2 years ago

Addressed by 0.3.11.