datalad / datalad-registry

MIT License
0 stars 2 forks source link

Public DataLad dataset UUID resolver endpoint #217

Open mih opened 1 year ago

mih commented 1 year ago

This is related to #212. However, please forgive me any stupid questions, I am nohow familiar with the state of development.

My task is to come up with an IRI (Internationalized Resource Identifier) for datalad datasets. While not necessary, this IRI could be an actual URL that offers information on a dataset.

Do you have plans to operate an "endpoint" that could resolve a known dataset UUID to such a page/structured record? If so, can you tell me (or anticipate) the URL (pattern) for such a resolver?

Just to be clear: For a paper DOI we have . https://doi.org/10.21105/joss.03262 that is composed of

I am asking for the datalad-registry equivalent of `https://doi.org/

Thanks in advance!

Ping

candleindark commented 1 year ago

Currently, the our API supports queries by dataset UUID. The response of such a query returns a set (or page) of dataset URLs. The dataset with the given UUID resides in each of the returned URLs.

The format of the query is:

GET /api/v2/dataset-urls?ds_id={UUID}

For example, the query of

GET /api/v2/dataset-urls?ds_id=3304e775-5f5f-435a-b68e-d98c9f5fb72a

gives the response of

{
  "total": 1,
  "cur_pg_num": 1,
  "first_pg": "/api/v2/dataset-urls?ds_id=3304e775-5f5f-435a-b68e-d98c9f5fb72a&per_page=20&order_by=last_update&order_dir=desc&page=1",
  "last_pg": "/api/v2/dataset-urls?ds_id=3304e775-5f5f-435a-b68e-d98c9f5fb72a&per_page=20&order_by=last_update&order_dir=desc&page=1",
  "dataset_urls": [
    {
      "url": "https://github.com/psychoinformatics-de/studyforrest-data-freesurfer.git",
      "id": 4500,
      "ds_id": "3304e775-5f5f-435a-b68e-d98c9f5fb72a",
      "describe": "194f0422ca",
      "annex_key_count": 0,
      "annexed_files_in_wt_count": 7100,
      "annexed_files_in_wt_size": 7901879582,
      "last_update": "2023-05-17T07:03:05.174447+00:00",
      "git_objects_kb": 38944,
      "processed": true
    }
  ]
}

One can also choose to have the related metadata returned as well using the following query format:

GET /api/v2/dataset-urls?ds_id={UUID}&return_metadata=content

For example, the query of

GET /api/v2/dataset-urls?ds_id=3304e775-5f5f-435a-b68e-d98c9f5fb72a&return_metadata=content

gives the response of

{
  "total": 1,
  "cur_pg_num": 1,
  "first_pg": "/api/v2/dataset-urls?ds_id=3304e775-5f5f-435a-b68e-d98c9f5fb72a&return_metadata=content&per_page=20&order_by=last_update&order_dir=desc&page=1",
  "last_pg": "/api/v2/dataset-urls?ds_id=3304e775-5f5f-435a-b68e-d98c9f5fb72a&return_metadata=content&per_page=20&order_by=last_update&order_dir=desc&page=1",
  "dataset_urls": [
    {
      "url": "https://github.com/psychoinformatics-de/studyforrest-data-freesurfer.git",
      "id": 4500,
      "ds_id": "3304e775-5f5f-435a-b68e-d98c9f5fb72a",
      "describe": "194f0422ca",
      "annex_key_count": 0,
      "annexed_files_in_wt_count": 7100,
      "annexed_files_in_wt_size": 7901879582,
      "last_update": "2023-05-17T07:03:05.174447+00:00",
      "git_objects_kb": 38944,
      "processed": true,
      "metadata": [
        {
          "extractor_name": "datacite_gin",
          "dataset_describe": "194f0422ca",
          "dataset_version": "194f0422ca5630adf10c52159188ee2599c93706",
          "extractor_version": "0.0.1",
          "extraction_parameter": {},
          "extracted_metadata": {
            "authors": [
              {
                "firstname": "Michael",
                "lastname": "Hanke",
                "id": "ORCID:0000-0001-6398-6370"
              },
              {
                "firstname": "Adina",
                "lastname": "Wagner",
                "id": "ORCID:0000-0003-2917-3450"
              },
              {
                "firstname": "Laura",
                "lastname": "Waite",
                "id": "ORCID:0000-0003-2213-7465"
              },
              {
                "firstname": "Christian",
                "lastname": "Mönch",
                "id": "ORCID:0000-0002-3092-0612"
              }
            ],
            "title": "Cortical Surface Freesurfer",
            "description": "\"High-resolution structural images were used to generate the cortical\n surfaces using FreeSurfer (v5.3.0, freely available at\n http://surfer.nmr.mgh.harvard.edu, [Dale et al., 1999]). Additional\n high-resolution T2w images were included in the reconstruction\n (recon-all -T2 t2file).\n\n The surface quality was checked by inspecting the slice screenshots\n of QATool (v1.1, freely available at\n http://ftp.nmr.mgh.harvard.edu/fswiki/QATools). The QATool was adopted\n to take sreenshots of the high-resoluton pial surface.\"",
            "keywords": [
              "Neuroscience",
              "Studyforrest"
            ],
            "license": {
              "name": "Open Data Commons Public Domain Dedication and License (PDDL)",
              "url": "https://opendatacommons.org/licenses/pddl/"
            },
            "resourcetype": "Dataset",
            "templateversion": 1.2,
            "@context": {
              "@id": "https://gin.g-node.org/G-Node/Info/src/master/datacite.yml",
              "description": "ad-hoc vocabulary for the DataCite GIN yml format",
              "type": "http://purl.org/dc/dcam/VocabularyEncodingScheme"
            }
          }
        },
        {
          "extractor_name": "metalad_core",
          "dataset_describe": "194f0422ca",
          "dataset_version": "194f0422ca5630adf10c52159188ee2599c93706",
          "extractor_version": "1",
          "extraction_parameter": {},
          "extracted_metadata": {
            "@context": {
              "@vocab": "http://schema.org/",
              "datalad": "http://dx.datalad.org/"
            },
            "@graph": [
              {
                "@id": "d3765bf6e3a68497b42584fa0774695e",
                "@type": "agent",
                "name": "Adina Wagner",
                "email": "adina.wagner@t-online.de"
              },
              {
                "@id": "59286713dacabfbce1cecf4c865fff5a",
                "@type": "agent",
                "name": "Christian Mönch",
                "email": "christian.moench@web.de"
              },
              {
                "@id": "a2505ceb12765ca64567ea3669144deb",
                "@type": "agent",
                "name": "Christian Olaf Häusler",
                "email": "der.haeusler@gmx.net"
              },
              {
                "@id": "eb31842ad876d85c9d8ab744ecb81ac9",
                "@type": "agent",
                "name": "Laura Waite",
                "email": "laura@waite.eu"
              },
              {
                "@id": "ffa915b768c7d3096081265387bdaa4b",
                "@type": "agent",
                "name": "Michael Hanke",
                "email": "michael.hanke@gmail.com"
              },
              {
                "@id": "194f0422ca5630adf10c52159188ee2599c93706",
                "identifier": "3304e775-5f5f-435a-b68e-d98c9f5fb72a",
                "@type": "Dataset",
                "version": "0-11-g194f0422ca",
                "dateCreated": "2016-02-11T14:24:11+01:00",
                "dateModified": "2021-08-30T14:43:31+02:00",
                "hasContributor": [
                  {
                    "@id": "d3765bf6e3a68497b42584fa0774695e"
                  },
                  {
                    "@id": "59286713dacabfbce1cecf4c865fff5a"
                  },
                  {
                    "@id": "a2505ceb12765ca64567ea3669144deb"
                  },
                  {
                    "@id": "eb31842ad876d85c9d8ab744ecb81ac9"
                  },
                  {
                    "@id": "ffa915b768c7d3096081265387bdaa4b"
                  }
                ],
                "hasPart": [
                  {
                    "@id": "datalad:54e555acea857acdb2a8b713cc4c7c7b09607b3d",
                    "@type": "Dataset",
                    "name": "src/structural"
                  }
                ],
                "distribution": [
                  {
                    "@id": "05fc79a8-451d-4786-bc31-8634c47aa5db"
                  },
                  {
                    "@id": "3dd02e1b-954e-4f67-a1ef-faa238ef6a17"
                  },
                  {
                    "@id": "42062f79-73c1-43f2-b930-dae987e9e071"
                  },
                  {
                    "@id": "7c374b09-ac2b-4af3-bf05-8915b1b40d90"
                  },
                  {
                    "@id": "9536f86d-eb34-42ed-8ffc-fafd63a2b87e"
                  },
                  {
                    "@id": "98580387-2071-4d25-ba5e-e947443ea9be"
                  },
                  {
                    "@id": "b6e9399b-b95a-4325-8f29-0e41ca779df8"
                  },
                  {
                    "@id": "d0bd443a-f54e-4e9c-91b3-7b0e8089b4c6"
                  },
                  {
                    "name": "mddatasrc",
                    "url": "https://datapub.fz-juelich.de/studyforrest/studyforrest/freesurfer/.git",
                    "@id": "datalad:db2e8480-0894-4e67-93b3-28d0d64d629b"
                  },
                  {
                    "@id": "df6def2f-be2b-4bc5-9db0-9fa5997316a5"
                  },
                  {
                    "@id": "ea71d260-a10c-42ce-8864-826598ebc5d0"
                  },
                  {
                    "@id": "f1954a8f-e146-4f25-91fc-590509511321"
                  },
                  {
                    "@id": "fb94e9d2-35de-4ef9-91e1-af7235d16858"
                  },
                  {
                    "@id": "fdb6d635-1e66-44c6-9644-08ab5010b108"
                  },
                  {
                    "name": "gin",
                    "url": "https://gin.g-node.org/studyforrest/visual-areas.git"
                  },
                  {
                    "name": "origin",
                    "url": "https://github.com/psychoinformatics-de/studyforrest-data-freesurfer.git"
                  }
                ]
              }
            ]
          }
        }
      ]
    }
  ]
}

One can also choose to access the full representation of a dataset URL by using the "internal" ID of the dataset URL from an initial query by UUID. For example, from any one of the queries above, we can use the ID of 4500 to perform the following query.

GET /api/v2/dataset-urls/4500

and the response to that is

{
  "url": "https://github.com/psychoinformatics-de/studyforrest-data-freesurfer.git",
  "id": 4500,
  "ds_id": "3304e775-5f5f-435a-b68e-d98c9f5fb72a",
  "describe": "194f0422ca",
  "annex_key_count": 0,
  "annexed_files_in_wt_count": 7100,
  "annexed_files_in_wt_size": 7901879582,
  "last_update": "2023-05-17T07:03:05.174447+00:00",
  "git_objects_kb": 38944,
  "processed": true,
  "metadata": [
    {
      "extractor_name": "datacite_gin",
      "dataset_describe": "194f0422ca",
      "dataset_version": "194f0422ca5630adf10c52159188ee2599c93706",
      "extractor_version": "0.0.1",
      "extraction_parameter": {},
      "extracted_metadata": {
        "authors": [
          {
            "firstname": "Michael",
            "lastname": "Hanke",
            "id": "ORCID:0000-0001-6398-6370"
          },
          {
            "firstname": "Adina",
            "lastname": "Wagner",
            "id": "ORCID:0000-0003-2917-3450"
          },
          {
            "firstname": "Laura",
            "lastname": "Waite",
            "id": "ORCID:0000-0003-2213-7465"
          },
          {
            "firstname": "Christian",
            "lastname": "Mönch",
            "id": "ORCID:0000-0002-3092-0612"
          }
        ],
        "title": "Cortical Surface Freesurfer",
        "description": "\"High-resolution structural images were used to generate the cortical\n surfaces using FreeSurfer (v5.3.0, freely available at\n http://surfer.nmr.mgh.harvard.edu, [Dale et al., 1999]). Additional\n high-resolution T2w images were included in the reconstruction\n (recon-all -T2 t2file).\n\n The surface quality was checked by inspecting the slice screenshots\n of QATool (v1.1, freely available at\n http://ftp.nmr.mgh.harvard.edu/fswiki/QATools). The QATool was adopted\n to take sreenshots of the high-resoluton pial surface.\"",
        "keywords": [
          "Neuroscience",
          "Studyforrest"
        ],
        "license": {
          "name": "Open Data Commons Public Domain Dedication and License (PDDL)",
          "url": "https://opendatacommons.org/licenses/pddl/"
        },
        "resourcetype": "Dataset",
        "templateversion": 1.2,
        "@context": {
          "@id": "https://gin.g-node.org/G-Node/Info/src/master/datacite.yml",
          "description": "ad-hoc vocabulary for the DataCite GIN yml format",
          "type": "http://purl.org/dc/dcam/VocabularyEncodingScheme"
        }
      }
    },
    {
      "extractor_name": "metalad_core",
      "dataset_describe": "194f0422ca",
      "dataset_version": "194f0422ca5630adf10c52159188ee2599c93706",
      "extractor_version": "1",
      "extraction_parameter": {},
      "extracted_metadata": {
        "@context": {
          "@vocab": "http://schema.org/",
          "datalad": "http://dx.datalad.org/"
        },
        "@graph": [
          {
            "@id": "d3765bf6e3a68497b42584fa0774695e",
            "@type": "agent",
            "name": "Adina Wagner",
            "email": "adina.wagner@t-online.de"
          },
          {
            "@id": "59286713dacabfbce1cecf4c865fff5a",
            "@type": "agent",
            "name": "Christian Mönch",
            "email": "christian.moench@web.de"
          },
          {
            "@id": "a2505ceb12765ca64567ea3669144deb",
            "@type": "agent",
            "name": "Christian Olaf Häusler",
            "email": "der.haeusler@gmx.net"
          },
          {
            "@id": "eb31842ad876d85c9d8ab744ecb81ac9",
            "@type": "agent",
            "name": "Laura Waite",
            "email": "laura@waite.eu"
          },
          {
            "@id": "ffa915b768c7d3096081265387bdaa4b",
            "@type": "agent",
            "name": "Michael Hanke",
            "email": "michael.hanke@gmail.com"
          },
          {
            "@id": "194f0422ca5630adf10c52159188ee2599c93706",
            "identifier": "3304e775-5f5f-435a-b68e-d98c9f5fb72a",
            "@type": "Dataset",
            "version": "0-11-g194f0422ca",
            "dateCreated": "2016-02-11T14:24:11+01:00",
            "dateModified": "2021-08-30T14:43:31+02:00",
            "hasContributor": [
              {
                "@id": "d3765bf6e3a68497b42584fa0774695e"
              },
              {
                "@id": "59286713dacabfbce1cecf4c865fff5a"
              },
              {
                "@id": "a2505ceb12765ca64567ea3669144deb"
              },
              {
                "@id": "eb31842ad876d85c9d8ab744ecb81ac9"
              },
              {
                "@id": "ffa915b768c7d3096081265387bdaa4b"
              }
            ],
            "hasPart": [
              {
                "@id": "datalad:54e555acea857acdb2a8b713cc4c7c7b09607b3d",
                "@type": "Dataset",
                "name": "src/structural"
              }
            ],
            "distribution": [
              {
                "@id": "05fc79a8-451d-4786-bc31-8634c47aa5db"
              },
              {
                "@id": "3dd02e1b-954e-4f67-a1ef-faa238ef6a17"
              },
              {
                "@id": "42062f79-73c1-43f2-b930-dae987e9e071"
              },
              {
                "@id": "7c374b09-ac2b-4af3-bf05-8915b1b40d90"
              },
              {
                "@id": "9536f86d-eb34-42ed-8ffc-fafd63a2b87e"
              },
              {
                "@id": "98580387-2071-4d25-ba5e-e947443ea9be"
              },
              {
                "@id": "b6e9399b-b95a-4325-8f29-0e41ca779df8"
              },
              {
                "@id": "d0bd443a-f54e-4e9c-91b3-7b0e8089b4c6"
              },
              {
                "name": "mddatasrc",
                "url": "https://datapub.fz-juelich.de/studyforrest/studyforrest/freesurfer/.git",
                "@id": "datalad:db2e8480-0894-4e67-93b3-28d0d64d629b"
              },
              {
                "@id": "df6def2f-be2b-4bc5-9db0-9fa5997316a5"
              },
              {
                "@id": "ea71d260-a10c-42ce-8864-826598ebc5d0"
              },
              {
                "@id": "f1954a8f-e146-4f25-91fc-590509511321"
              },
              {
                "@id": "fb94e9d2-35de-4ef9-91e1-af7235d16858"
              },
              {
                "@id": "fdb6d635-1e66-44c6-9644-08ab5010b108"
              },
              {
                "name": "gin",
                "url": "https://gin.g-node.org/studyforrest/visual-areas.git"
              },
              {
                "name": "origin",
                "url": "https://github.com/psychoinformatics-de/studyforrest-data-freesurfer.git"
              }
            ]
          }
        ]
      }
    }
  ]
}

These examples above all produce dataset URLs at which reside a dataset with the given UUID. I don't think they are exactly what you want for there can be multiple URLs serving the same dataset. However, base on what our system is already capable of, we can implement an endpoint of

GET /api/v2/dl-dataset?ds_id={UUID}

which would return a representation of a dataset instead of a representation of a dataset URL by compiling the data we have of different dataset URLs that are serving the same dataset with the given UUID.

The implementation of this GET /api/v2/dl-dataset?ds_id={UUID} endpoint has not been planned nor discussed previously. It is just something I came up with in response to your question, but we can certainly do it.

mih commented 1 year ago

Thank you for the detailed overview!

From the example metadata report I see that we already settled on a location to host such a resolver: http://dx.datalad.org/ (likely in https fashion, eventually).

So we could set it up in a way that

https://dx.datalad.org/dataset/<ds_uuid> retrieves the result of /api/v2/dataset-urls?ds_id=<ds_uuid> (with "Accept: application/json"), or renders an HTML page with the equivalent.

This would make https://dx.datalad.org/dataset/<ds_uuid> a great "concept IRI" for an DataLad dataset (i.e. an unversioned dataset record). This would match the summary-level Dataset concept of the HCLSDatasetDescription.

For the version-level description, it would also be useful to have an identifier that matches a query (endpoint). We could use the gitsha directly (it is precise and globally unique). Alternatively, we could use <ds_uuid>@<gitsha>. This would make the identifier rather long, not more unique, but maybe "nicer" for a human to process. Unsure of it is worth the effort.

The query GET /api/v2/dataset-urls/4500 seem to match the distribution-level description (where can I get it from). In the HCLSDatasetDescription this is a particular materialization of a version-level description. For us this would be a bit different (we generally have many versions in the same "distribution"). Yet, having a unique ID would be very useful here too. The 4500 seem to be a registry endpoint-specific counter (more than a unique ID). What about using a reported annex UUID for a location, or turning the effective download URL into a UUID?

uuid.uuid5(uuid.NAMESPACE_URL, 'https://example.com/myds1')

candleindark commented 1 year ago

For the version-level description, it would also be useful to have an identifier that matches a query (endpoint). We could use the gitsha directly (it is precise and globally unique). Alternatively, we could use <ds_uuid>@<gitsha>. This would make the identifier rather long, not more unique, but maybe "nicer" for a human to process. Unsure of it is worth the effort.

I prefer <ds_uuid>@<gitsha>. It is indeed nicer for human. It also makes any kind of search easier as well.

The query GET /api/v2/dataset-urls/4500 seem to match the distribution-level description (where can I get it from). In the HCLSDatasetDescription this is a particular materialization of a version-level description. For us this would be a bit different (we generally have many versions in the same "distribution"). Yet, having a unique ID would be very useful here too. The 4500 seem to be a registry endpoint-specific counter (more than a unique ID). What about using a reported annex UUID for a location, or turning the effective download URL into a UUID?

uuid.uuid5(uuid.NAMESPACE_URL, 'https://example.com/myds1')

You are right. The 4500 is not a really a unique identifier in a global sense. It is just the primary key of the internal representation of that particular dataset URL.

I think we may have to use the uuid.uuid5(uuid.NAMESPACE_URL, 'https://example.com/myds1') option because some of the datasets we are dealing with are not an annexed dataset.

mih commented 1 year ago

I prefer <ds_uuid>@<gitsha>. It is indeed nicer for human. It also makes any kind of search easier as well.

It appears that just <gitsha> will be the way to go. https://github.com/psychoinformatics-de/datalad-tabby/issues/76#issuecomment-1643747142

Could we consider having

https://dx.datalad.org/dataset-version/<gitsha>

?