Support discovery of datalad datasets on dataverse

yarikoptic commented 7 months ago

Sample dataset on demo node, in non-exported (key store) flavor of the special remote:

https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/2ZPHY7

so it seems we need to search for datasets which have a file like XDLRA-2D--2D-refs, probably just starting with XDLRA- and ending with -refs.

JSON file which lists all current dataverse deployments (if we are greedy to search through all of them):

https://github.com/IQSS/dataverse-installations/blob/main/data/data.json

For now we could just go through https://demo.dataverse.org/ and https://dataverse.harvard.edu as "groups" (like organization for github) and not care about any other.

The search API example invocation to search for that exact filename (for now):

for "keystore" types: https://{hostname}/api/search?q=fileName:%22XDLRA-2D--2D-refs%22
for "exporttree" types: https://{hostname}/api/search?q=fileName:%22repo.zip%22 (ideally for full path which would include foder _.datalad/dotgit/ but it seems not work).

in the returned record we get

"dataset_name": "Alt",
"dataset_id": "2349618",
"dataset_persistent_id": "doi:10.70122/FK2/BUOCCS",

The "things" to record would be the

hostname
dataset_persistent_id

per each dataset. Hyperlink for a dataset would be constructed as https://{hostname}/dataset.xhtml?persistentId=doi:{dataset_persistent_id}.

note: for those URLs to become clonable, first datalad should be configured to load dataverse and next extensions via changes to ~/.gitconfig

[datalad "extensions"]
    load = next
    load = dataverse

pdurbin commented 7 months ago

Right, as we discussed at the Distribits hackathon, now that @yarikoptic has a published dataset in Harvard Dataverse that came from DataLad we can find it with this query:

https://dataverse.harvard.edu/api/search?q=fileName:%22repo.zip%22

Here's how the search result looks:

{
  "status": "OK",
  "data": {
    "q": "fileName:\"repo.zip\"",
    "total_count": 1,
    "start": 0,
    "spelling_alternatives": {},
    "items": [
      {
        "name": "repo.zip",
        "type": "file",
        "url": "https://dataverse.harvard.edu/api/access/datafile/10069635",
        "file_id": "10069635",
        "description": "",
        "published_at": "2024-04-08T11:44:45Z",
        "file_type": "Unknown",
        "file_content_type": "application/octet-stream",
        "size_in_bytes": 155736,
        "md5": "b83bbf83371526579887b5879c3dce1f",
        "checksum": {
          "type": "MD5",
          "value": "b83bbf83371526579887b5879c3dce1f"
        },
        "dataset_name": "OpenNeuro:ds000003 Rhyme judgment (trimmed)",
        "dataset_id": "10069469",
        "dataset_persistent_id": "doi:10.7910/DVN/VMSH8U",
        "dataset_citation": "Halchenko, Yaroslav, 2024, \"OpenNeuro:ds000003 Rhyme judgment (trimmed)\", https://doi.org/10.7910/DVN/VMSH8U, Harvard Dataverse, V1"
      }
    ],
    "count_in_response": 1
  }
}

As mentioned above, the dataset-level fields to focus on are these:

"dataset_name": "OpenNeuro:ds000003 Rhyme judgment (trimmed)",
"dataset_id": "10069469",
"dataset_persistent_id": "doi:10.7910/DVN/VMSH8U",
"dataset_citation": "Halchenko, Yaroslav, 2024, \"OpenNeuro:ds000003 Rhyme judgment (trimmed)\", https://doi.org/10.7910/DVN/VMSH8U, Harvard Dataverse, V1"

https://doi.org/10.7910/DVN/VMSH8U will resolve and redirect to the dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VMSH8U

@yarikoptic and I talked about different ways to identify DataLad datasets. This "search for repo.zip" approach seems promising but could probably be refined. It's a good start!

yarikoptic commented 7 months ago

I think we are now doomed to wait (hopefully just a little) for @joeyh to (re)implement support for "git remotes in git-annex special remotes" natively in git-annex -- that is the design project he worked on with @mih during distribits hackathon.

datalad / datalad-usage-dashboard

Support discovery of datalad datasets on dataverse #46