jsheunis commented 2 years ago

This PR is in response to #94. It adds a dataset-level extractor for BIDS datasets, called bids_dataset that:

uses gen4 metadata handling with datalad-metalad (and introduces this dependency, datalad-metalad>=0.3.1)
ensures the required files dataset_description.json and participants.tsv are available locally before proceeding with the extraction process
does not require locally available file content, other than the above mentioned or README text file content, which it makes part of the extraction output (automatically running get where applicable)
is compatible with pybids>=0.15.1 and BIDS v1.6.0
does not change the existing bids extractor in any way
extracts extra information about the BIDS dataset (compared to the existing bids extractor), including information about subjects, sessions, runs, tasks, entities, and variables.

Old PR comment:

This WIP PR intended to address #94 introduces a new extractor bids_dataset that:

builds on the new/next generation datalad-metalad functionality
only extracts metadata from content that does not require getting annexed data
replaces dataset-level metadata extraction functionality of the existing bids.py extractor

Main changes:

adds bids_dataset.py extractor
adds test_bids_dataset.py test
adds new extractor entrypoint to setup.cfg
adds metalad requirement to setup.cfg

yarikoptic commented 2 years ago

only extracts metadata from content that does not require getting annexed data

you are killing my dream of using glorious datalad-fuse! ;-) and how do you know that those files would not be annexed?

codecov[bot] commented 2 years ago

Codecov Report

Merging #104 (9cd3a3b) into master (caec500) will increase coverage by 0.69%. The diff coverage is 90.44%.

:exclamation: Current head 9cd3a3b differs from pull request most recent head 73cc22f. Consider uploading reports for the commit 73cc22f to get more accurate results

@@            Coverage Diff             @@
##           master     #104      +/-   ##
==========================================
+ Coverage   85.12%   85.82%   +0.69%     
==========================================
  Files          21       23       +2     
  Lines        1049     1206     +157     
==========================================
+ Hits          893     1035     +142     
- Misses        156      171      +15

Impacted Files	Coverage Δ
datalad_neuroimaging/extractors/bids_dataset.py	`88.00% <88.00%> (ø)`
...neuroimaging/extractors/tests/test_bids_dataset.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update caec500...73cc22f. Read the comment docs.

jsheunis commented 2 years ago

you are killing my dream of using glorious datalad-fuse! ;-) and how do you know that those files would not be annexed?

good point. I don't know that they won't be annexed.

Something that I think is important with the use of metalad is to make a distinction between dataset and file level extractors. With this update focusing on the dataset level, does it perhaps make sense to require a specific set of BIDS-compatible files that would be necessary for extraction of dataset-level metadata (e.g. participants.tsv, any READMEs, dataset_description.json) and ignore the rest?

I realise that there could be any number of files in the BIDS dataset from which to extract file-level metadata but which, if combined/aggregated/derived, could turn into metadata that is interpretable on the dataset-level. But the amount or combination of such files is arbitrary and not easily definable on the dataset-level. And I think even without that information, the currently proposed dataset-level extractor could still extract useful information.

jsheunis commented 2 years ago

TODO for @jsheunis:

[x] ensure dataset_description.json file is available locally, if not get it.
[x] decide if README extraction is handled as is, or by tbd helper
[x] UUID
[x] proper logging, result yielding, error handling, and user messaging

jsheunis commented 2 years ago

you are killing my dream of using glorious datalad-fuse! ;-) and how do you know that those files would not be annexed?

good point. I don't know that they won't be annexed.

Something that I think is important with the use of metalad is to make a distinction between dataset and file level extractors. With this update focusing on the dataset level, does it perhaps make sense to require a specific set of BIDS-compatible files that would be necessary for extraction of dataset-level metadata (e.g. participants.tsv, any READMEs, dataset_description.json) and ignore the rest?

I realise that there could be any number of files in the BIDS dataset from which to extract file-level metadata but which, if combined/aggregated/derived, could turn into metadata that is interpretable on the dataset-level. But the amount or combination of such files is arbitrary and not easily definable on the dataset-level. And I think even without that information, the currently proposed dataset-level extractor could still extract useful information.

The challenge of whether required files are annexed or not is addressed by correctly specifying the get_required_content functionality, as per https://github.com/datalad/datalad-neuroimaging/pull/104/commits/f5655da754c86dee299e3324f68810019b541d7f

jsheunis commented 2 years ago

Ok, this PR has been open for a long time and I want to merge it. I'm going to finish whatever is remaining and easy to complete, and will then merge unless there is strong disagreement @datalad/developers

jsheunis commented 2 years ago

only extracts metadata from content that does not require getting annexed data

you are killing my dream of using glorious datalad-fuse! ;-) and how do you know that those files would not be annexed?

This initial statement is not correct anymore. Any annexed content that is necessary for the extraction process will be fetched before extraction starts via metalad's get_required_content functionality, or where applicable on a case by case basis (e.g. readme content).

jsheunis commented 2 years ago

Remaining issue is the failing test, will sort that now.

jsheunis commented 2 years ago

Remaining test failure seems to be related to some dataset not containing gen4 metadata (probably outdated testdata?, or could be related to the metalad version bump?), while test_bids_dataset.py now succeeds. Will create a separate issue for remaining failure.

datalad / datalad-neuroimaging

NF: Create updated dataset-level extractor for BIDS datasets #104

Codecov Report