bids-standard / bids-specification

Brain Imaging Data Structure (BIDS) Specification
https://bids-specification.readthedocs.io/
Creative Commons Attribution 4.0 International
275 stars 157 forks source link

Symbolic linking within datasets #526

Closed tsalo closed 4 years ago

tsalo commented 4 years ago

508 proposes a number of new suffixes meant for qMRI workflows. These suffixes all require multiple files, and in some cases some of those files may be equivalent to existing suffixes. For example, one file from a multi-parametric mapping (MPM) scheme may be the same as a T1w scan, and if the dataset curator knows this, they could identify it as such.

508 also introduces the idea of symbolically linking dataset files to derivatives, in cases where the scanner automatically generates what would typically be considered a derivative (e.g., a T2map).

Would it be reasonable for the curator to symbolically link files within a dataset?

So, for example, we could have the two following:

sub-X/
    anat/
        sub-X_fa-1_mt-on_MPM.nii.gz ---
        sub-X_fa-1_mt-off_MPM.nii.gz  |
        sub-X_fa-2_mt-on_MPM.nii.gz   |  symbolic link
        sub-X_fa-2_mt-off_MPM.nii.gz  | 
        sub-X_T1w.nii.gz  <------------

Tagging @yarikoptic and @adswa to get Datalad-related thoughts, as well as @agahkarakuzu and @emdupre because they were involved in the initial conversation that spawned this issue.

This issue is related to #508 and #512.

effigies commented 4 years ago

Symlinks and deduplication seem like problems for the filesystem or a storage system like datalad, and should not be part of the specification. Not all filesystems support symlinks, so I think it would be unwise for us to recommend or require them in the spec.

Do we currently have a principle in which we say files must not be duplicated?

tsalo commented 4 years ago

I haven't seen anything about duplication in the spec, but I could have missed it. Are your concerns specifically about symlinks, or about duplicate files in general? I don't think having copies of files would be a problem for Datalad, but then I think there'd be a need for unique identifiers stored in the sidecars. Perhaps this could just be reflected in the scans file, which, at least with heudiconv, generally has some random string that is unique to each file?

Since symlinks are a part of BEP001, should they be replaced with file duplication?

tsalo commented 4 years ago

Per https://github.com/bids-standard/bids-2-devel/issues/43#issuecomment-674988786, @satra agrees that symlinking would not be compatible with common storage systems.

Does anyone have any ideas for a good alternative that will work well with scanner-generated "derivatives"?

effigies commented 4 years ago

My suggestion would be to generate your dataset as a compliant derivatives dataset, and stick derivatives side-by-side with raw files. I'm not sure if this is anything like a consensus position, but given that derivatives datasets may contain raw filenames IFF they are raw files, I think it's a kind of nice way to handle the case. If it becomes common behavior, it drives us toward the end state where we acknowledge that all datasets are derivative.

tsalo commented 4 years ago

To tie it back to #508, the BEP001 team has proposed the following format for a dataset with scanner-generated derivatives and sufficient provenance (with minor adjustments to add functional data):

ds-example/
 ├── derivatives/
 |   └── qMRI-software/
 |       └── sub-01/
 |           └── anat/
 |               ├── sub-01_T1map.nii.gz ─────────┐ L
 |               ├── sub-01_T1map.json   ───────┐ | I
 |               ├── sub-01_MTsat.nii.gz ─────┐ | | N
 |               └── sub-01_MTsat.json   ───┐ | | | K
 └── sub-01/                                | | | |
     ├── anat/                              | | | |
     |   ├── sub-01_fa-1_mt-on_MTS.nii.gz   | | | | T
     |   ├── sub-01_fa-1_mt-on_MTS.json     | | | | O
     |   ├── sub-01_fa-1_mt-off_MTS.nii.gz  | | | |
     |   ├── sub-01_fa-1_mt-off_MTS.json    | | | | A
     |   ├── sub-01_fa-2_mt-off_MTS.nii.gz  | | | | N
     |   ├── sub-01_fa-2_mt-off_MTS.json    | | | | A
     |   ├── sub-01_T1map.nii.gz <──────────├─├─├─┘ T
     |   ├── sub-01_T1map.json   <──────────├─├─┘
     |   ├── sub-01_MTsat.nii.gz <──────────├─┘
     |   └── sub-01_MTsat.json   <──────────┘
     └── func/
         ├── sub-01_task-rest_bold.nii.gz
         └── sub-01_task-rest_bold.json

In the case of scanner-generated derivatives without provenance, I believe that their proposal is to simply have the data in the raw data folder:

ds-example/
 └── sub-01/
     ├── anat/
     |   ├── sub-01_fa-1_mt-on_MTS.nii.gz
     |   ├── sub-01_fa-1_mt-on_MTS.json
     |   ├── sub-01_fa-1_mt-off_MTS.nii.gz
     |   ├── sub-01_fa-1_mt-off_MTS.json
     |   ├── sub-01_fa-2_mt-off_MTS.nii.gz
     |   ├── sub-01_fa-2_mt-off_MTS.json
     |   ├── sub-01_T1map.nii.gz
     |   ├── sub-01_T1map.json
     |   ├── sub-01_MTsat.nii.gz
     |   └── sub-01_MTsat.json
     └── func/
         ├── sub-01_task-rest_bold.nii.gz
         └── sub-01_task-rest_bold.json

If I understand correctly, you're proposing that folks do almost the opposite- put everything in the derivatives folder? Like this:

ds-example/
 ├── derivatives/
 |   └── qMRI-software/
 |       └── sub-01/
 |           └── anat/
 |               ├── sub-01_fa-1_mt-on_MTS.nii.gz
 |               ├── sub-01_fa-1_mt-on_MTS.json
 |               ├── sub-01_fa-1_mt-off_MTS.nii.gz
 |               ├── sub-01_fa-1_mt-off_MTS.json
 |               ├── sub-01_fa-2_mt-off_MTS.nii.gz
 |               ├── sub-01_fa-2_mt-off_MTS.json
 |               ├── sub-01_T1map.nii.gz
 |               ├── sub-01_T1map.json
 |               ├── sub-01_MTsat.nii.gz
 |               └── sub-01_MTsat.json
 └── sub-01/
     └── func/
         ├── sub-01_task-rest_bold.nii.gz
         └── sub-01_task-rest_bold.json
effigies commented 4 years ago

No, I'm proposing:

ds-example/
 └── sub-01/
     ├── anat/
     |   ├── sub-01_fa-1_mt-on_MTS.nii.gz
     |   ├── sub-01_fa-1_mt-on_MTS.json
     |   ├── sub-01_fa-1_mt-off_MTS.nii.gz
     |   ├── sub-01_fa-1_mt-off_MTS.json
     |   ├── sub-01_fa-2_mt-off_MTS.nii.gz
     |   ├── sub-01_fa-2_mt-off_MTS.json
     |   ├── sub-01_T1map.nii.gz
     |   ├── sub-01_T1map.json
     |   ├── sub-01_MTsat.nii.gz
     |   └── sub-01_MTsat.json
     └── func/
         ├── sub-01_task-rest_bold.nii.gz
         └── sub-01_task-rest_bold.json

With dataset_description.json:

{
  ...
  "DatasetType": "derivatives",
  "GeneratedBy": [...]
}
tsalo commented 4 years ago

Ohhhh okay. Thanks! Now that there's a symlink-less solution on the table, I'll feed it back into the BEP001 review.

tsalo commented 4 years ago

I commented on the BEP001 PR with the proposed solution, so I'm going to close this.