Open Remi-Gau opened 3 years ago
Unfortunately, the BIDS spec is rather impractical in this regard and all three alternatives are equally bad or even one worse than the other. They all work when the datasets are small, or just 1 or 2 pipelines, but they fall apart at larger scales.
Here is what I would recommend:
make sure that each modular unit of data becomes its own DataLad dataset (more on the modularity below)
make sure that datasets are nested according to their provenance (each dataset contains all its sources)
Both principles are violated by all three examples you gave above. Neither pipeline output contains the raw data, all types of data (raw and derivatives are in a single dataset).
Now imagine a use case where you want to consume the outputs of spm-stats
on an HPC system. Maybe you only need three files, but you are forced to deploy datasets with potentially 100k files. That is slow and will limit you.
Consider further that you write a paper on the spm-stats
. The paper will be its own dataset (why would you want to host your raw fmri data on overleaf, right?). It will have the dataset with spm-stats
as a subdataset to make clear which state of the results you describe in the paper. If that same dataset also contains additional pipeline output, it will continue to accumulate changes that have nothing to do with the manuscript. It will be up to you to manually determine each tie, whether that manuscript would need an update.
Again, these are all not problems that become significant if you work alone and the data you work on are small, and the processing strategies few. That is how most people work, and that is why what BIDS recommends works for them.
But imagine a large dataset (UKB or HCP) that is processed in many many ways by loads of people for all kinds of things all the time. The cost to keep track of all that movement that would need to be paid by an individual and the underlying technical infrastructure is way too high.
If you follow the two principles that I outlines above, you can avoid all these issues and have technology work for you and not against you. Here are a few rough guidelines and what you data "modules" should be (pulled from a 30min talk just on these aspects, so please forgive me that some of those might seem a little far fetched in the context of this issue).
the BIDS spec is rather impractical in this regard and all three alternatives are equally bad or even one worse than the other. They all work when the datasets are small, or just 1 or 2 pipelines, but they fall apart at larger scales.
Both principles are violated by all three examples you gave above.
One moment you think you understood something, the next you get told it does not work that way at all. :rofl:
Thanks @mih for the detailed reply. That really helps.
OK I think I will need to reflect on that for a bit (also... this is a busy week).
I forgot to mention another example from the BIDS specs where the derivatives
does contain the raw
and seems to not break the principles you mentioned (but now I am not sure of anything anymore).
my_processed_data/ # could be a datalad dataset for a given pipeline
code/
processing_pipeline-1.0.0.img
hpc_submitter.sh
...
sourcedata/ # this could be a datalad sub-dataset with the raw data if we are talking about just preprocessing
dataset_description.json
participants.tsv
sub-01/
sub-02/
...
dataset_description.json
sub-01/
sub-02/
...
Quick question to make sure on speaking terms YODA and I are. :wink: (I am sure this kind of "joke" must SUPER old for the datalad team).
The
BIbleBIDS specification suggests several ways to organize one's derivative data.I am trying to figure out which ones are "YODA friendly".