YODA-friendly BIDS-derivatives

Remi-Gau commented 3 years ago

Quick question to make sure on speaking terms YODA and I are. :wink: (I am sure this kind of "joke" must SUPER old for the datalad team).

The ~~BIble~~ BIDS specification suggests several ways to organize one's derivative data.

I am trying to figure out which ones are "YODA friendly".

Example of derivatives with one directory per pipeline:
```
<dataset>/raw/sub-0001
<dataset>/raw/sub-0002
...
```

/derivatives/fmriprep-v1.4.1/sub-0001 /derivatives/fmriprep-v1.4.1/sub-0002 ... /derivatives/spm/sub-0001 /derivatives/spm/sub-0002 ... /derivatives/vbm/sub-0001 /derivatives/vbm/sub-0002 ... ``` 2. Example of a pipeline with split derivative directories: ``` /raw/sub-0001 /raw/sub-0002 ... /derivatives/spm-preproc/sub-0001 /derivatives/spm-preproc/sub-0002 ... /derivatives/spm-stats/sub-0001 /derivatives/spm-stats/sub-0002 ... ``` 3. Example of a pipeline with nested derivative directories: ``` /raw/sub-0001 /raw/sub-0002 ... /derivatives/spm-preproc/sub-0001 /derivatives/spm-preproc/sub-0002 ... /derivatives/spm-preproc/derivatives/spm-stats/sub-0001 /derivatives/spm-preproc/derivatives/spm-stats/sub-0002 ... ``` As far as I understand YODA cares more about provenance than actual folder structure, correct? So technically all those options are potentially YODA friendly provided that: - in all case provenance is actually tracked - in example 1 we don't dump `preprocessing`, `subject GLM`, `group GLM` are not dumped all together in there in a way that breaks "data modularity". The "advantage" of example 3 is that in this case the provenance is also a bit more explicit in the folder structure itself. FYI: I am trying to figure out the best "practical" way to organize some analysis pipeline with the idea to throw datalad into the mix. So I want to make sure I got the idea right.

mih commented 3 years ago

Unfortunately, the BIDS spec is rather impractical in this regard and all three alternatives are equally bad or even one worse than the other. They all work when the datasets are small, or just 1 or 2 pipelines, but they fall apart at larger scales.

Here is what I would recommend:

make sure that each modular unit of data becomes its own DataLad dataset (more on the modularity below)
make sure that datasets are nested according to their provenance (each dataset contains all its sources)

Both principles are violated by all three examples you gave above. Neither pipeline output contains the raw data, all types of data (raw and derivatives are in a single dataset).

Now imagine a use case where you want to consume the outputs of spm-stats on an HPC system. Maybe you only need three files, but you are forced to deploy datasets with potentially 100k files. That is slow and will limit you.

Consider further that you write a paper on the spm-stats. The paper will be its own dataset (why would you want to host your raw fmri data on overleaf, right?). It will have the dataset with spm-stats as a subdataset to make clear which state of the results you describe in the paper. If that same dataset also contains additional pipeline output, it will continue to accumulate changes that have nothing to do with the manuscript. It will be up to you to manually determine each tie, whether that manuscript would need an update.

Again, these are all not problems that become significant if you work alone and the data you work on are small, and the processing strategies few. That is how most people work, and that is why what BIDS recommends works for them.

But imagine a large dataset (UKB or HCP) that is processed in many many ways by loads of people for all kinds of things all the time. The cost to keep track of all that movement that would need to be paid by an individual and the underlying technical infrastructure is way too high.

If you follow the two principles that I outlines above, you can avoid all these issues and have technology work for you and not against you. Here are a few rough guidelines and what you data "modules" should be (pulled from a 30min talk just on these aspects, so please forgive me that some of those might seem a little far fetched in the context of this issue).

Remi-Gau commented 3 years ago

the BIDS spec is rather impractical in this regard and all three alternatives are equally bad or even one worse than the other. They all work when the datasets are small, or just 1 or 2 pipelines, but they fall apart at larger scales.

Both principles are violated by all three examples you gave above.

One moment you think you understood something, the next you get told it does not work that way at all. :rofl:

Thanks @mih for the detailed reply. That really helps.

OK I think I will need to reflect on that for a bit (also... this is a busy week).

I forgot to mention another example from the BIDS specs where the derivatives does contain the raw and seems to not break the principles you mentioned (but now I am not sure of anything anymore).

my_processed_data/                      # could be a datalad dataset for a given pipeline 
  code/
    processing_pipeline-1.0.0.img
    hpc_submitter.sh
    ...
  sourcedata/                                  # this could be a datalad sub-dataset with the raw data if we are talking about just preprocessing
    dataset_description.json
    participants.tsv
    sub-01/
    sub-02/
    ...
  dataset_description.json
  sub-01/
  sub-02/
  ...

datalad-handbook / book

YODA-friendly BIDS-derivatives #618