bids-standard / bids-matlab

MATLAB / Octave tools for BIDS datasets
https://bids-matlab.readthedocs.io
MIT License
52 stars 32 forks source link

Return layout subset #122

Open Remi-Gau opened 3 years ago

Remi-Gau commented 3 years ago

Make query return subset of layout instead of flat list

In current state querry returns flat lists of found data, subjects etc. This may create some difficulties during data processing, for ex. if one need to access to metadata in json file, he need to reparse the filenames.

I propose that querry returns instead the either subset subject field, or full layout containing only selected data files. Retrieve a flat list of files from it is trivial, so it will nod add complexity if one just need data files, but it will provide additional metadata together with data files, if they are needed.

Originally posted by @nbeliy in https://github.com/bids-standard/bids-matlab/issues/60#issuecomment-772338286

Remi-Gau commented 3 years ago

In some modalities (meg/eeg. func) images are accompanied by tabular data, like events.tsv. In actual state, one need to extract data file names and then replace the suffix to access these tabular files.

Technically the content of those files is loaded in the layout: https://github.com/bids-standard/bids-matlab/blob/6c1dac65880df28b69ec131983cd8bbbfce1c258/%2Bbids/layout.m#L250

But it seems that we don't have the code in query to access it:

BIDS = bids.layout(fullfile(pth_bids_example, 'ds001'));

bids.query(BIDS, 'metadata', 'sub', '01', 'run', '01', 'type', 'events')

ans = 
  struct with no fields.

 % But we get this:

 BIDS.subjects(1).func(4).meta

ans = 

  struct with fields:

                   onset: [158×1 double]
                duration: [158×1 double]
              trial_type: {158×1 cell}
             cash_demean: [158×1 double]
    control_pumps_demean: [158×1 double]
          explode_demean: [158×1 double]
            pumps_demean: [158×1 double]
           response_time: [158×1 double]

(Will add a test for this, because I doubt this is the intended behavior.)

So I would not necessarily think this is needed for events but definitely, we need a way to deal with the intendedFor for fieldmaps, especially that it is being reused for other data types as well (ASL, PET...) .

See #66

nbeliy commented 3 years ago

Technically the content of those files is loaded in the layout:

Yes, I've seen it. But you load them only for some modalities, and parce it. I propose to treat them as independent data files, for example querry for func will return something like:

sub-xxx/ses-yyy/sub-xxx_ses-yyy_task-zzz_bold.nii.gz
sub-xxx/ses-yyy/sub-xxx_ses-yyy_task-zzz_events.tsv

So if one want to run some scripts on all fMRI, he has the list of both image and events, instead to try to generate filename from image name.

Remi-Gau commented 3 years ago

In current state querry returns flat lists of found data, subjects etc. This may create some difficulties during data processing, for ex. if one need to access to metadata in json file, he need to reparse the filenames.

I am not sure I follow.

From the test for query: https://github.com/bids-standard/bids-matlab/blob/6c1dac65880df28b69ec131983cd8bbbfce1c258/tests/test_bids_query.m#L63

BIDS = bids.layout(fullfile(pth_bids_example, 'ds007'));

  md = bids.query(BIDS, 'metadata', ...
                  'sub', '05', ...
                  'run', '02', ...
                  'task', 'stopsignalwithmanualresponse', ...
                  'type', 'bold');

  assert(isstruct(md) & isfield(md, 'RepetitionTime') & isfield(md, 'TaskName'));
  assert(md.RepetitionTime == 2);
  assert(strcmp(md.TaskName, 'stop signal with manual response'));

Sure there are still some issues with the getting metadata (#23, #79) and sure some reparsing is happening under the hood when querying for metada, but I am not sure to see where the issue is.

Remi-Gau commented 3 years ago

Technically the content of those files is loaded in the layout:

Yes, I've seen it. But you load them only for some modalities, and parce it. I propose to treat them as independent data files, for example querry for func will return something like:

sub-xxx/ses-yyy/sub-xxx_ses-yyy_task-zzz_bold.nii.gz
sub-xxx/ses-yyy/sub-xxx_ses-yyy_task-zzz_events.tsv

So if one want to run some scripts on all fMRI, he has the list of both image and events, instead to try to generate filename from image name.

But you can already do that, no?

BIDS = bids.layout(fullfile(pth_bids_example, 'ds001'));
bids.query(BIDS, 'data', 'sub', '01', 'task', 'balloonanalogrisktask', 'run', '01')

ans =

  2×1 cell array

    '/home/remi/github/BIDS-matlab/tests/bids-examples/ds001/sub-01/func/sub-01_task-balloonanalogrisktask_run-01_bold.nii.gz'
    '/home/remi/github/BIDS-matlab/tests/bids-examples/ds001/sub-01/func/sub-01_task-balloonanalogrisktask_run-01_events.tsv'
nbeliy commented 3 years ago

Simple example: someone want to process T1w image with his script, which also needs some metadata from json.

In actual state: He get the list of files using querry, and retrive needed parameters eithr using second querry, or just changing extention and parcing json file in his script.

If query returns list of subject structure: he loops over these structures wich contains the path to file and the path to json (and probably other useful information).

nbeliy commented 3 years ago

But you can already do that, no?

Probably (I didn't looked to much in query part). But it works only if you define it for given modality. If someone want to do a task during a structural mri (I don't discuss the usefulness!) the query will ignore the events.tsv, because you impose that tsv files exists only for some modalities.

Remi-Gau commented 3 years ago

In actual state layout function checks for comliency with current (hardcoded) bids schema.

Technically we are not "yet" following the schema (see #104) but I see what you mean. Official schema support is on its way though see PR #121)

Which will automatically made it unusable if dataset do not follow defined schema for whatever reason. For example if one uses an old bidsified dataset, not compatible with actual state.

Yup we actually already have this problem in the PR #121 as the FLASH suffix is now deprecated and not encoded in the BIDS schema.

So I propose to move the checks for modalities, file extentions, entities to validate function, and make layout working on any dataset which follows bids syntax: sub-<>_(<key>-<value>)*_<suffix>.<extention>.

Actually, we try to leave the extensive validation to the validator but there is a bit of entity list validation at parsing.

It will not only simplify the code (all complex ifs will move to validate), but also make core code independent of version of BIDS, simplifying the maintenance.

Actually using the official schema is already a way to simplify the code (but work in progress). The issue of how to deal with several BIDS schema version though is an important one.

One thing that could be done is to use "tolerant" option of the "layout" to actually allow parsing of dataset with almost no rule on filenaming except the sub-<>_(<key>-<value>)*_<suffix>.<extention> you mentioned.

nbeliy commented 3 years ago

One thing that could be done is to use "tolerant" option of the "layout" to actually allow parsing of dataset with almost no rule on filenaming except the sub-<>(-)*. you mentioned.

Or just load the layout and run validation afterwards. It will significally simplify the loading part, and allow more complex verifications in validator part. For example check for magintude1 if phasediff is present, or presence of specific json fields and values. The main functional code will remain simple not chainging (once debugged), while the validator can increase the complexity folowwing the BIDS requirements. Validator can even automatically launched after the layout.

Remi-Gau commented 3 years ago

One thing that could be done is to use "tolerant" option of the "layout" to actually allow parsing of dataset with almost no rule on filenaming except the sub-<>(-)*. you mentioned.

Or just load the layout and run validation afterwards. It will significally simplify the loading part, and allow more complex verifications in validator part. For example check for magintude1 if phasediff is present, or presence of specific json fields and values. The main functional code will remain simple not chainging (once debugged), while the validator can increase the complexity folowwing the BIDS requirements. Validator can even automatically launched after the layout.

I am actually reluctant to have layout return a structure that is not about a valid BIDS dataset by default. I would rather have the non-bids-compliance be somethings users have to put extra effort to opt-in.

Having defaults follow "best practice" is actually a good way to nudge good behaviors.

nbeliy commented 3 years ago

I am actually reluctant to have layout return a structure that is not about a valid BIDS dataset by default. I would rather have the non-bids-compliance be somethings users have to put extra effort to opt-in.

Easy:

function result = layout(...)
 <cration of layout in ''tolerant way'>

 if ~result.validate()
    error(...)
end
nbeliy commented 3 years ago

Having defaults follow "best practice" is actually a good way to nudge good behaviors.

Only if the best practices are not restrictive and not changing every month. Integration of new modalities can take years (qMRI and PET).

Remi-Gau commented 3 years ago

@nbeliy moved your comment to a different issue. :-)

Remi-Gau commented 3 years ago

The PR #124 will at least start taking care of the first point and allow to parse datasets in a schemaless fashion.

Just need to switch the tolerant parameter of layout to true.

https://github.com/Remi-Gau/bids-matlab/blob/5a6a598d495d2189964b4dfb0f020f75e2d4be58/tests/test_layout_derivatives.m#L37