datalad / datalad-neuroimaging

DataLad extension for neuroimaging research
http://datalad.org
Other
17 stars 14 forks source link

NF: Create updated dataset-level extractor for BIDS datasets #104

Closed jsheunis closed 2 years ago

jsheunis commented 2 years ago

This PR is in response to #94. It adds a dataset-level extractor for BIDS datasets, called bids_dataset that:


Old PR comment:

This WIP PR intended to address #94 introduces a new extractor bids_dataset that:

Main changes:

yarikoptic commented 2 years ago
  • only extracts metadata from content that does not require getting annexed data

you are killing my dream of using glorious datalad-fuse! ;-) and how do you know that those files would not be annexed?

codecov[bot] commented 2 years ago

Codecov Report

Merging #104 (9cd3a3b) into master (caec500) will increase coverage by 0.69%. The diff coverage is 90.44%.

:exclamation: Current head 9cd3a3b differs from pull request most recent head 73cc22f. Consider uploading reports for the commit 73cc22f to get more accurate results

@@            Coverage Diff             @@
##           master     #104      +/-   ##
==========================================
+ Coverage   85.12%   85.82%   +0.69%     
==========================================
  Files          21       23       +2     
  Lines        1049     1206     +157     
==========================================
+ Hits          893     1035     +142     
- Misses        156      171      +15     
Impacted Files Coverage Δ
datalad_neuroimaging/extractors/bids_dataset.py 88.00% <88.00%> (ø)
...neuroimaging/extractors/tests/test_bids_dataset.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update caec500...73cc22f. Read the comment docs.

jsheunis commented 2 years ago

you are killing my dream of using glorious datalad-fuse! ;-) and how do you know that those files would not be annexed?

good point. I don't know that they won't be annexed.

Something that I think is important with the use of metalad is to make a distinction between dataset and file level extractors. With this update focusing on the dataset level, does it perhaps make sense to require a specific set of BIDS-compatible files that would be necessary for extraction of dataset-level metadata (e.g. participants.tsv, any READMEs, dataset_description.json) and ignore the rest?

I realise that there could be any number of files in the BIDS dataset from which to extract file-level metadata but which, if combined/aggregated/derived, could turn into metadata that is interpretable on the dataset-level. But the amount or combination of such files is arbitrary and not easily definable on the dataset-level. And I think even without that information, the currently proposed dataset-level extractor could still extract useful information.

jsheunis commented 2 years ago

TODO for @jsheunis:

jsheunis commented 2 years ago

you are killing my dream of using glorious datalad-fuse! ;-) and how do you know that those files would not be annexed?

good point. I don't know that they won't be annexed.

Something that I think is important with the use of metalad is to make a distinction between dataset and file level extractors. With this update focusing on the dataset level, does it perhaps make sense to require a specific set of BIDS-compatible files that would be necessary for extraction of dataset-level metadata (e.g. participants.tsv, any READMEs, dataset_description.json) and ignore the rest?

I realise that there could be any number of files in the BIDS dataset from which to extract file-level metadata but which, if combined/aggregated/derived, could turn into metadata that is interpretable on the dataset-level. But the amount or combination of such files is arbitrary and not easily definable on the dataset-level. And I think even without that information, the currently proposed dataset-level extractor could still extract useful information.

The challenge of whether required files are annexed or not is addressed by correctly specifying the get_required_content functionality, as per https://github.com/datalad/datalad-neuroimaging/pull/104/commits/f5655da754c86dee299e3324f68810019b541d7f

jsheunis commented 2 years ago

Ok, this PR has been open for a long time and I want to merge it. I'm going to finish whatever is remaining and easy to complete, and will then merge unless there is strong disagreement @datalad/developers

jsheunis commented 2 years ago
  • only extracts metadata from content that does not require getting annexed data

you are killing my dream of using glorious datalad-fuse! ;-) and how do you know that those files would not be annexed?

This initial statement is not correct anymore. Any annexed content that is necessary for the extraction process will be fetched before extraction starts via metalad's get_required_content functionality, or where applicable on a case by case basis (e.g. readme content).

jsheunis commented 2 years ago

Remaining issue is the failing test, will sort that now.

jsheunis commented 2 years ago

Remaining test failure seems to be related to some dataset not containing gen4 metadata (probably outdated testdata?, or could be related to the metalad version bump?), while test_bids_dataset.py now succeeds. Will create a separate issue for remaining failure.