IQSS / dataverse-client-r

R Client for Dataverse Repositories
https://iqss.github.io/dataverse-client-r
61 stars 25 forks source link

Better detection test for whether a file is ingested #113

Closed kuriwaki closed 2 years ago

kuriwaki commented 2 years ago

The current method to detect whether something is_ingested, introduced in v0.3.0 is problematic: It only checks if there is a metadata file associated with the fileid. But I guess some files, e.g. those that have ingestion warnings, don't have a metadata file. This can cause the wrong download format as in #80.

If I have a dataset id or name, I now know how to check whether something is ingested: check if the entry originalFileFormat exists (e.g. this JSON).

However, in the particular stage of the client, I sometimes don't have a dataset identifier, only the numeric fileid + server. This happens for example with get_*_by_doi where the user only provides a file DOI. @landreev pointed out that the Dataverse api/files API apparently does not contain info like originalFileFormat, perhaps for legacy reasons.

For now, what is the best way to access the parent dataset JSON with only the numeric file in hand? (@pdurbin ?). In the above example, how would I obtain the dataset iddoi:10.70122/FK2/PPIAXE only by knowing file id=1734017 and server = demo.dataverse.org?

pdurbin commented 2 years ago

For now, what is the best way to access the parent dataset JSON with only the numeric file in hand?

My first thought is to get https://demo.dataverse.org/api/search?q=fileId:1734017 and find a dataset_persistent_id of "doi:10.70122/FK2/PPIAXE".

kuriwaki commented 2 years ago

@pdurbin this looks promising. Our function dataverse_search() could possibly mimic this. But when I tried searching for fileId=3123547 , which I expected to be this CCES file, I got something completely different: https://dataverse.harvard.edu/api/search?q=fileId:3123547. Do you know why this occurs, and how to fix the query so I get the CCES file instead?

Here is the query confirming that at least the id of the CCES file of interest is 3123547. https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi%3A10.7910/DVN/GDF6Z0

even though this is a search, for the purpose of this issue I'd need to have it be a strict match on the file id. (that is, return a single entry if the file id exists, and return 0 results if the file id does not exist).

kuriwaki commented 2 years ago

Add-on Re:

For now, what is the best way to access the parent dataset JSON with only the numeric file in hand?

We would also need to have a method that can get the dataset JSON with only the file DOI (persistentID) in hand. (to use in get_*_by_doi). Using the same example of dataset id: doi:10.70122/FK2/PPIAXE, we'd want to know the dataset id with only persistentId=doi:10.70122/FK2/PPIAXE/MHDB0O and the server.

pdurbin commented 2 years ago

@kuriwaki huh. fileId works on the demo server but not for Harvard Dataverse nor my dev laptop. Can you please try id:datafile_NNN instead, like the example below?

https://dataverse.harvard.edu/api/search?q=id:datafile_3123547

pdurbin commented 2 years ago

I don't think that "MHDB0O" file is indexed. https://dataverse.harvard.edu/api/search?q=id:datafile_1734017 should find it but it doesn't. Can you please open an issue in https://github.com/IQSS/dataverse.harvard.edu/issues about this?

For a file that is properly indexed, like the CCES file we've been talking about ( https://dataverse.harvard.edu/api/search?q=id:datafile_3123547 ), you should be able to search for it by DOI like this (not the quotes around the DOI): https://dataverse.harvard.edu/api/search?q=filePersistentId:%22doi:10.7910/DVN/GDF6Z0/JPMOZZ%22

kuriwaki commented 2 years ago

(for numeric id's)

id:datafile_NNN

This is great. The following three examples work as intended - they give me the single entry. I will try implementing it on dev.

library(dataverse)

#  rds
dataverse_search(id = "datafile_1734017", server = "demo.dataverse.org", type = "file")$name

# CCES problematic dta
dataverse_search(id = "datafile_3123547", server = "dataverse.harvard.edu", type = "file")$name

# other dataverse
dataverse_search(id = "datafile_204446", server = "dataverse.nl", type = "file")$name
kuriwaki commented 2 years ago

I don't think that "MHDB0O" file is indexed.

That actually came from the demo dataverse, not Harvard dataverse. This one works great: https://demo.dataverse.org/api/search?q=id:datafile_1734017

For a file that is properly indexed, like the CCES file we've been talking about, you should be able to search for it by DOI like this (note the quotes around the DOI)

Thank you. This seems to work in the two examples below, with the quotes escaped

# CCES
dataverse_search(filePersistentId = "\"doi:10.7910/DVN/GDF6Z0/JPMOZZ\"", server = "dataverse.harvard.edu")$name

# demo.dataverse
dataverse_search(filePersistentId = "\"doi:10.70122/FK2/HXJVJU/SA3Z2V\"", server = "demo.dataverse.org")$name