[ENH] Implement dataset description for CAPS datasets

NicolasGensollen commented 5 months ago

Closes #1101

After a little bit of thinking, I made some changes to the structure of the dataset_description.json file for CAPS datasets compared to what I originally described in #1101.

Having the "Name" key used as a way to encode the different processing wasn't a good idea (i.e. something like "t1-linear + pet-linear"). Actually it was a very bad idea as it will probably lead to very complicated logic to understand which pipelines were run on a given CAPS.

The dynamic nature of CAPS datasets where additional processing pipelines can be run means that the dataset_description.json file is also going to change to incorporate the necessary metadata describing these processings.

This PR proposes to have a field named "Processing" which is a list of objects describing the different processing pipelines that were run. Each processing has a name, a date, an author, a machine on which it was executed, and a path to an input dataset.

Here is an example of a dataset_description.json file obtained when running t1-linear and pet-linear on a CI machine:

{
    "Name": "e6719ef6-2411-4ad2-8abd-da1fd8fbdf32",
    "BIDSVersion": "1.6.0",
    "CAPSVersion": "1.0.0",
    "DatasetType": "derivative",
    "Processing": [
        {
            "Name": "t1-linear",
            "Date": "2024-08-06T10:28:21.848950",
            "Author": "ci",
            "Machine": "ubuntu",
            "InputPath": "/mnt/data_ci/T1Linear/in/bids"
        },
        {
            "Name": "pet-linear",
            "Date": "2024-08-06T10:36:27.403373",
            "Author": "ci",
            "Machine": "ubuntu",
            "InputPath": "/mnt/data_ci/PETLinear/in/bids"
        }
    ]
}

The name can be any user-provided string, and defaults to a random identifier (as the one shown above). It cannot change, meaning that re-running a pipeline with the same CAPS dataset will not change the name of the dataset (even if the user explicitly asks for another name). It will only update the "processing" entry if needed.

A processing is identified by default by its name and its input path. This means that:

if you run the same pipeline (ex: t1-linear) twice on the same BIDS input, the corresponding processing metadata will be replaced with a new one.
if you run the same pipeline on different BIDS inputs (not super recommended but possible...), there will be a processing metadata for each.

The two version fields "BIDSVersion" and "CAPSVersion" are delicate because they are supposed to version the metadata models used. Theoretically, when using an existing CAPS dataset as an output for a pipeline, the versions of BIDS and CAPS specifications of the file should match the ones used by Clinica, otherwise there is no guarantee that things will not break. For this reason, this PR proposes to raise an error when it happens. I'm still debating on this as it could easily be perceived as annoying (for example in the CI data we have tons of BIDS dataset_description.json with old versions that were never updated...), but it will force us to have meaningful metadata with our datasets.

Finally, this PR also proposes to impose the presence of the dataset_description.json file in CAPS folders. When trying to run a pipeline with a new folder for CAPS, the file will automatically be generated. When trying to run a pipeline with an existing folder without a dataset_description.json, Clinica will raise an error with a suggestion of a minimum file that should be added.

Link to the documentation page: https://aramislab.paris.inria.fr/clinica/docs/public/PR-1158/CAPS/Specifications/

Requires data PR: https://github.com/aramis-lab/clinica_data_ci/pull/68

AliceJoubert commented 1 month ago

Unless I missed it the documentation would need changes too but otherwise LGTM ! Thanks @NicolasGensollen :)

NicolasGensollen commented 1 month ago

Unless I missed it the documentation would need changes too but otherwise LGTM ! Thanks @NicolasGensollen :)

Absolutely ! I couldn't find the time to do it this afternoon, but will have a go at it tomorrow. Thanks for the review @AliceJoubert !

aramis-lab / clinica

[ENH] Implement dataset description for CAPS datasets #1158