dandi / dandi-cli

DANDI command line client to facilitate common operations
https://dandi.readthedocs.io/
Apache License 2.0
21 stars 25 forks source link

add bids metadata extraction #432

Open satra opened 3 years ago

satra commented 3 years ago

as we deal with non-nwb dandisets, would be good to add bids metadata extraction, which may require parsing the tree (to get age, sex from participants.tsv, etc.,.).

yarikoptic commented 3 years ago

I think this functionality should as much as possible align/reuse with https://github.com/datalad/datalad-neuroimaging/blob/master/datalad_neuroimaging/extractors/bids.py which ATM is just a dump of metadata as provided by pybids.
But IIRC @mih mentioned that in the scope of ebrains openminds he is consider (or just advising?) to provide more "tight" harmonization. @mih could you briefly chime in on the plans on that end here? (or just add references)

satra commented 3 years ago

alignment is good, but we will want to fill in the fields of our asset metadata structure as well about participants and biosamples.

mih commented 3 years ago

What I was talking about in that meeting was that a bids2openminds conversion is taking place outside the scope of a metadata extractor. An extractor should report "as-is". If the metadata source (like BIDS), it not "semantically clean", a subsequent (and updatable) transformation can be used to yield a "better" (or just different) record.

I realized at some point that doing the standardization at the level of an extractor implies that any application of updates to that standardization requires actual data access, and also makes metadata extraction an inherently open-ended process. Adding the possibility to for customizable transformations of metadata seems much more practical, when data access is complicated (which it seems to be for most datasets).

satra commented 3 years ago

@yarikoptic - perhaps we can add some bids support in the short term with respect to participant id and a few other things.

@mih - in our case metadata extraction is performed at the point of validation/upload so access is there. in the future we may want to extend the schema, for which we would indeed need to pull in the directory structure (especially for bids, where the inheritance principle does apply for some metadata).

yarikoptic commented 3 years ago

yeah, I guess we shouldn't postpone for too long. I do not think we should at this point anyhow to amalgam data + sidecar files into a single asset, so we will keep it KISS and have an asset per each file, be it a data, sidecar, or metadata. dataset_description.json will also be a first-class-citizen and have an asset.

Do you know perspective datasets which would be uploaded and should be BIDS?