Brainstorming optimized queries

mih commented 2 years ago

Ultimately all files to-be-downloaded are associated with an "experiment" -- which is an acquisition for a subject within an experiment. If I browse the XNAT UI, I need to click on a project -> subject -> experiment to see its accession number -- on a page like this

The experiment accession number can also be queried via the subject accession id, like so:

% curl -H "Content-Type: application/json" 'https://www.nitrc.org/ir/data/experiments?subject_ID=xnat_S00001&format=json'| jq
{
  "ResultSet": {
    "Result": [
      {
        "subject_ID": "xnat_S00001",
        "date": "",
        "xsiType": "xnat:mrSessionData",
        "xnat:subjectassessordata/id": "xnat_E00001",
        "subject_label": "AnnArbor_sub04111",
        "insert_date": "2011-06-06 04:44:48.0",
        "project": "fcon_1000",
        "ID": "xnat_E00001",
        "label": "AnnArbor_sub04111",
        "URI": "/data/experiments/xnat_E00001"
      }
    ],
    "totalRecords": "1",
    "title": "Matching experiments"
  }
}

because an experiment is unique to the scope of a subject (as far as I can tell).

Similarly, all experiments (acquisitions for subjects in a project) can be discovered via the project accession number:

% curl -H "Content-Type: application/json" 'https://www.nitrc.org/ir/data/experiments?project=fcon_1000&format=json'| jq            
{
  "ResultSet": {
    "Result": [
      {
        "date": "",
        "xsiType": "xnat:mrSessionData",
        "insert_date": "2011-06-06 04:44:56.0",
        "project": "fcon_1000",
        "ID": "xnat_E00002",
        "label": "AnnArbor_sub04619",
        "URI": "/data/experiments/xnat_E00002"
      },
...

These queries together cover the two main use cases

get all acquisitions in a project
get all acquisitions for a subject in a project

in contrast to the current implementation all accession numbers can be determined in a single query, not successive queries.

Give a single experiment accession number, I can now get ALL files associated with it:

% curl -v -H "Content-Type: application/json" https://www.nitrc.org/ir/data/experiments/xnat_E00001/scans/ALL/files?format=json|jq
{
  "ResultSet": {
    "Columns": [
      {
        "key": "URI",
        "serverRoot": "/ir"
      }
    ],
    "Result": [
      {
        "file_content": "",
        "Size": "64860721",
        "file_tags": "",
        "cat_ID": "11914",
        "digest": "9f99db6fcbf3a4ea9cba8be829d86bbb",
        "collection": "NIfTI",
        "URI": "/data/experiments/xnat_E00001/scans/func_rest/resources/11914/files/scan_rest.nii.gz",
        "file_format": "",
        "Name": "scan_rest.nii.gz"
      },
      {
        "file_content": "",
        "Size": "64860721",
        "file_tags": "",
        "cat_ID": "3896",
        "digest": "9f99db6fcbf3a4ea9cba8be829d86bbb",
        "collection": "BRIK",
        "URI": "/data/experiments/xnat_E00001/scans/func_rest/resources/3896/files/scan_rest.nii.gz",
        "file_format": "",
        "Name": "scan_rest.nii.gz"
      },
...

This is a comprehensive list, that includes multiple "resources" (named "collection" here).

Importantly, this list contains the direct URLs and a "digest" plus file "Size" in bytes. This is sufficient information for calling git annex registerurl:

a URL
and a key: MD5E-s<size-in-bytes>--<md5sum>.<file-extension>

In summary, using one query per experiment, we can generate a complete dataset for single file access (capable of content verification) without performing any file downloads.

This should be fast ;-)

mih commented 2 years ago

To clarify: we will still use datalad addurls it does all we need. There is no point in wrapping around git annex registerurl

mih commented 2 years ago

Sadly, not all XNAT instances provide a digest i.e. md5sum.

mih commented 2 years ago

This type of querying is implemented now

datalad / datalad-xnat

Brainstorming optimized queries #90