Closed kuriwaki closed 2 years ago
For now, what is the best way to access the parent dataset JSON with only the numeric file in hand?
My first thought is to get https://demo.dataverse.org/api/search?q=fileId:1734017 and find a dataset_persistent_id
of "doi:10.70122/FK2/PPIAXE".
@pdurbin this looks promising. Our function dataverse_search()
could possibly mimic this. But when I tried searching for fileId=3123547
, which I expected to be this CCES file, I got something completely different: https://dataverse.harvard.edu/api/search?q=fileId:3123547. Do you know why this occurs, and how to fix the query so I get the CCES file instead?
Here is the query confirming that at least the id
of the CCES file of interest is 3123547.
https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi%3A10.7910/DVN/GDF6Z0
even though this is a search, for the purpose of this issue I'd need to have it be a strict match on the file id. (that is, return a single entry if the file id exists, and return 0 results if the file id does not exist).
Add-on Re:
For now, what is the best way to access the parent dataset JSON with only the numeric file in hand?
We would also need to have a method that can get the dataset JSON with only the file DOI (persistentID) in hand. (to use in get_*_by_doi
). Using the same example of dataset id: doi:10.70122/FK2/PPIAXE, we'd want to know the dataset id with only persistentId=doi:10.70122/FK2/PPIAXE/MHDB0O
and the server.
@kuriwaki huh. fileId
works on the demo server but not for Harvard Dataverse nor my dev laptop. Can you please try id:datafile_NNN
instead, like the example below?
https://dataverse.harvard.edu/api/search?q=id:datafile_3123547
I don't think that "MHDB0O" file is indexed. https://dataverse.harvard.edu/api/search?q=id:datafile_1734017 should find it but it doesn't. Can you please open an issue in https://github.com/IQSS/dataverse.harvard.edu/issues about this?
For a file that is properly indexed, like the CCES file we've been talking about ( https://dataverse.harvard.edu/api/search?q=id:datafile_3123547 ), you should be able to search for it by DOI like this (not the quotes around the DOI): https://dataverse.harvard.edu/api/search?q=filePersistentId:%22doi:10.7910/DVN/GDF6Z0/JPMOZZ%22
(for numeric id's)
id:datafile_NNN
This is great. The following three examples work as intended - they give me the single entry. I will try implementing it on dev
.
library(dataverse)
# rds
dataverse_search(id = "datafile_1734017", server = "demo.dataverse.org", type = "file")$name
# CCES problematic dta
dataverse_search(id = "datafile_3123547", server = "dataverse.harvard.edu", type = "file")$name
# other dataverse
dataverse_search(id = "datafile_204446", server = "dataverse.nl", type = "file")$name
I don't think that "MHDB0O" file is indexed.
That actually came from the demo dataverse, not Harvard dataverse. This one works great: https://demo.dataverse.org/api/search?q=id:datafile_1734017
For a file that is properly indexed, like the CCES file we've been talking about, you should be able to search for it by DOI like this (note the quotes around the DOI)
Thank you. This seems to work in the two examples below, with the quotes escaped
# CCES
dataverse_search(filePersistentId = "\"doi:10.7910/DVN/GDF6Z0/JPMOZZ\"", server = "dataverse.harvard.edu")$name
# demo.dataverse
dataverse_search(filePersistentId = "\"doi:10.70122/FK2/HXJVJU/SA3Z2V\"", server = "demo.dataverse.org")$name
The current method to detect whether something
is_ingested
, introduced in v0.3.0 is problematic: It only checks if there is a metadata file associated with the fileid. But I guess some files, e.g. those that have ingestion warnings, don't have a metadata file. This can cause the wrong download format as in #80.If I have a dataset id or name, I now know how to check whether something is ingested: check if the entry
originalFileFormat
exists (e.g. this JSON).However, in the particular stage of the client, I sometimes don't have a dataset identifier, only the numeric fileid + server. This happens for example with
get_*_by_doi
where the user only provides a file DOI. @landreev pointed out that the Dataverseapi/files
API apparently does not contain info likeoriginalFileFormat
, perhaps for legacy reasons.For now, what is the best way to access the parent dataset JSON with only the numeric file in hand? (@pdurbin ?). In the above example, how would I obtain the dataset id
doi:10.70122/FK2/PPIAXE
only by knowing fileid=1734017
andserver = demo.dataverse.org
?