Upgrade `bids.py` metadata extractor

jsheunis commented 2 years ago

It would be useful if the bids extractor (and for some points, eventually all other extractors in this extension) could:

be compatible with the new generation of metadata handling, i.e.:
- inherit metadata classes from datalad-metalad (not datalad.metadata.*)
- make use of the distinction between dataset-level and file-level metadata extraction
extract everything there is to extract from a datalad dataset (i.e. not from annexed data) before datalad getting any file content that might be necessary for further extraction. (currently the extraction starts by getting all required file content...)
be compatible with, and use updated functionality of, the latest stable version of pybids

I've made a start at this. I'm working on this within the context of the catalog: likely many of our future users will be working with BIDS data and would want to extract BIDS metadata and have it rendered in the catalog. So I have an idea of the BIDS-related metadata that would be useful in the catalog, but I'm keen to get input from other @datalad/developers if there are features that you think will be useful to include.

jsheunis commented 2 years ago

Also tagging @surchs and @CPernet based on likely shared interest.

yarikoptic commented 2 years ago

I Wholeheartedly support this initiative!

FWIW, Regarding getting content: for use/integration with datalad-fuse I am thinking of adding some config variable to metalad to bypass getting content by metalad since it would be available (only needed portions of the file) via fuse. Might need though a mode which would first query datalad-fuse on either it can access that file data, and get only when it can't (eg file on some fancy special remote datalad-fuse has no clue of how to access via fsspec)

jsheunis commented 2 years ago

Oh wow, didn't know about datalad-fuse and now I do, great! I'm guessing that functionality (check whether possible to get partial content via fuse, otherwise datalad get) would probably live in metalad? The extension extractor could possibly specify whether it needs access to no / partial / full file content.

yarikoptic commented 2 years ago

I'm guessing that functionality (check whether possible to get partial content via fuse, otherwise datalad get) would probably live in metalad?

May be ;-) in principle it could be a mode of operation on datalad-fuse , so that is it fails to access file via fsspec, it would just get it in full. Then metalad wouldn't need to deal with that.

The extension extractor could possibly specify whether it needs access to no / partial / full file content.

I thought about the same but still hope we could avoid that.

jsheunis commented 2 years ago

Here's a sample of output from the BOLD5000 dataset that I ran locally:

datalad clone https://github.com/OpenNeuroDatasets/ds001499
cd ds001499
datalad -f json_pp meta-extract -d . bids_dataset

yields:

{
  "action": "meta_extract",
  "metadata_record": {
    "agent_email": "s.heunis@fz-juelich.de",
    "agent_name": "Stephan Heunis",
    "dataset_id": "3e874376-b053-11e8-b9ac-0242ac130026",
    "dataset_version": "5be66b27ab5e033e9163caa94cec882bd4cee1d0",
    "extracted_metadata": {
      "@context": {
        "age(years)": {
          "@id": "pato:0000011",
          "description": "age of a sample (organism) at the time of data acquisition in years",
          "unit": "uo:0000036",
          "unit_label": "year"
        },
        "bids": {
          "@id": "http://bids.neuroimaging.io/bids_spec1.0.2.pdf#",
          "description": "ad-hoc vocabulary for the Brain Imaging Data Structure (BIDS) standard",
          "type": "http://purl.org/dc/dcam/VocabularyEncodingScheme"
        }
      },
      "Acknowledgements": "We thank Scott Kurdilla for his patience as our MRI technologist throughout all data collection. We would also like to thank Austin Marcus for his assistance in various stages of this project, Jayanth Koushik for his assistance in AlexNet feature extractions, and Ana Van Gulick for her assistance with public data distribution and open science issues. Finally, we thank our participants for their participation and patience, without them this dataset would not have been possible.",
      "BIDSVersion": "1.0.2",
      "DatasetDOI": "10.18112/openneuro.ds001499.v1.3.1",
      "HowToAcknowledge": "Please cite our paper available on arXiv: http://arxiv.org/abs/1809.01281",
      "author": [
        "Nadine Chang",
        "John A. Pyles",
        "Austin Marcus",
        "Abhinav Gupta",
        "Michael J. Tarr",
        "Elissa M. Aminoff"
      ],
      "citation": [
        "https://bold5000.org"
      ],
      "conformsto": "http://bids.neuroimaging.io/bids_spec1.0.2.pdf",
      "entities": {
        "acquisition": [
          "spinecho",
          "spinechopf68",
          "PA",
          "AP"
        ],
        "datatype": [
          "anat",
          "func",
          "fmap",
          "dwi"
        ],
        "direction": [
          "PA",
          "AP"
        ],
        "extension": [
          ".tsv",
          ".bval",
          ".nii.gz",
          ".tsv.gz",
          ".bvec",
          ".json"
        ],
        "fmap": [
          "epi"
        ],
        "recording": [
          "cardiac",
          "respiratory",
          "trigger"
        ],
        "run": [
          1,
          2,
          3,
          4,
          5,
          6,
          7,
          8,
          9,
          10
        ],
        "session": [
          "12",
          "15",
          "04",
          "06",
          "13",
          "14",
          "10",
          "03",
          "08",
          "05",
          "11",
          "07",
          "02",
          "01",
          "09",
          "16"
        ],
        "subject": [
          "CSI1",
          "CSI4",
          "CSI2",
          "CSI3"
        ],
        "suffix": [
          "events",
          "participants",
          "sessions",
          "T2w",
          "physio",
          "T1w",
          "epi",
          "dwi",
          "bold",
          "description"
        ],
        "task": [
          "localizer",
          "5000scenes"
        ]
      },
      "fundedby": "This dataset was collected with the support of NSF Award BCS-1439237 to Elissa M. Aminoff and Michael J. Tarr, ONR MURI N000141612007 and Sloan, Okawa Fellowship to Abhinav Gupta, and NSF Award BSC-1640681 to Michael Tarr.",
      "license": "CC0",
      "name": "BOLD5000",
      "readme": "BOLD5000: Brains, Objects, Landscapes Dataset\n\nFor details please refer to BOLD5000.org and our paper on arXiv (http://arxiv.org/abs/1809.01281)\n\n*Participant Directories Content*\n1) Four participants: CSI1, CSI2, CSI3, & CSI4\n2) Functional task data acquisition sessions: sessions #1-15\nEach functional session includes:\n-3 sets of fieldmaps (EPI opposite phase encoding; spin-echo opposite phase encoding pairs with partial & non-partial Fourier)\n-9 or 10 functional scans of slow event-related 5000 scene data (5000scenes)\n-1 or 0 functional localizer scans used to define scene selective regions (localizer)\n-each event.json file lists each stimulus, the onset time, and the participant\u2019s response (participants performed a simple valence task) \n3) Anatomical data acquisition session: #16\nAnatomical Data: T1 weighted MPRAGE scan, a T2 weighted SPACE, diffusion spectrum imaging   \n\nNotes:\n-All MRI and fMRI data provided is with Siemens pre-scan normalization filter.  \n-CSI4 only participated in 10 MRI sessions: 1-9 were functional acquisition sessions, and 10 was the anatomical data acquisition session.\n\n*Derivatives Directory Content*\n1) fMRIprep: \n-Preprocessed data for all functional data of CSI1 through CSI4 (listed in folders for each participant: derivatives/fmriprep/sub-CSIX). Data was preprocessed both in T1w image space and on surface space. Functional data was motion corrected, susceptibility distortion corrected, and aligned to the anatomical data using bbregister. Please refer to the paper for the details on preprocessing.\n-Reports resulting from fMRI prep, which include the success of anatomical alignment and distortion correction, among other measures of preprocessing success are all listed in the sub-CSIX.html files.  \n2) Freesurfer: Freesurfer reconstructions as a result of fMRIprep preprocessing stream. \n3) MRIQC: Image quality metrics (IQMs) of the dataset using MRIQC. \n-CSIX-func.csv files are text files with a list of all IQMs for each session, for each run.\n-CSIX-anat.csv files are text files with a list of all IQMs for the scans acquired in the anatomical session (e.g., MPRAGE). \n-CSIX_IQM.xls an excel workbook, each sheet of workbook lists the IQMs for a single run. This is the same data as CSIX-func.csv, except formatted differently. \n-sub-CSIX/derivatives: contain .json with the MRIQC/IQM results for each run. \n-sub-CSIX/reports: contains .html file with MRIQC/IQM results for each run along with mean signal and standard deviation maps. \n4)spm: A directory that contains the masks used to define each region of interest (ROI) in each participant. There were 10 ROIs: early visual (EarlyVis), lateral occipital cortex (LOC), occipital place area (OPA), parahippocampal place area (PPA), retrosplenial complex (RSC) for the left hemisphere (LH) and right hemisphere (RH).",
      "variables": {
        "dataset": [
          "subject",
          "age",
          "handedness",
          "sex",
          "suffix"
        ],
        "subject": [
          "session",
          "subject",
          "At this point in the day, you have eaten...",
          "Date",
          "Did you work out today?",
          "Do you drink alcoholic beverages?",
          "Do you drink caffeinated beverages (i.e. coffee, tea, coke, etc.)?",
          "Do you smoke?",
          "Duration (in seconds)",
          "End Date",
          "Have you taken ibuprofen today (e.g. Advil, Motrin)?",
          "How long ago was your last meal?",
          "How many hours of sleep did you get last night?",
          "If so, what was the activity, and how long ago?",
          "If so, when was the last time you had a caffeinated beverage?",
          "If so, when was the last time you had an alcoholic beverage?",
          "If so, when was the last time you smoked?",
          "If so, when was the last time you took it?",
          "In the scanner today I was mentally: - Click to write Choice 1",
          "In the scanner today I was physically: - Click to write Choice 1",
          "Is there anything you think we should know about your experience in the MRI (e.g. were you tired, confused about task instructions, etc)?",
          "Is this...",
          "Is this....1",
          "Is this....2",
          "Is this....3",
          "Is this....4",
          "Is this....5",
          "Please comment on which of these things, if any, are particularly different today and/or if you think they have affected your performance (for better or for worse):",
          "Progress",
          "Recorded Date",
          "Response ID",
          "Session",
          "Start Date",
          "Subject ID",
          "Time",
          "suffix"
        ]
      }
    },
    "extraction_parameter": {},
    "extraction_time": 1643231572.23382,
    "extractor_name": "bids_dataset",
    "extractor_version": "0.0.1",
    "type": "dataset"
  },
  "path": "/Users/jsheunis/Documents/psyinf/Data/ds001499",
  "status": "ok",
  "type": "dataset"
}

It contains all the fields extracted by the existing bids extractor, and additionally:

entities: the results of bids.BIDSLayout.get_[entity] (where entity could be subject, run, etc).
variables: the results of bids.BIDSLayout.get_collections(level=[entity])[0].to_df().columns (where entity is one of run, session, subject, dataset). The variable collection from entity='dataset' gives the column headings of the participants.tsv file. The variable collection from entity='run' will only yield a result if annexed content is available locally.
readme: populated from the readme.md file, if available; not replacing the description field.

The extra info from entities and variables could all be useful tags on the dataset level, especially when filtering/quering.

Lastly, the above all runs on the light-weight datalad dataset, i.e. does not require local access to annexed content (assuming a text2git configuration).

jsheunis commented 2 years ago

New bids_dataset extractor that works with next generation metalad has been added here in the update-bids-extractor branch: https://github.com/datalad/datalad-neuroimaging/tree/update-bids-extractor

No PR yet since I am uncertain about:

This update adds the metalad dependency, which would eventually be the goal but is that something we want to introduce into master now already?

jsheunis commented 2 years ago

TODO for @jsheunis:

merge updated master
Create WIP PR

christian-monch commented 2 years ago

Now that metalad 0.3.0 is released. You could add your bids extractor to the datalad-metalad repo, if you want. Feel free to create a PR against master, I will cherry-pick or merge it into the maint-0.3-branch and it will be contained in the next release.

jsheunis commented 2 years ago

Thanks. AFAIK it probably still makes sense to keep the bids_dataset extractor as part of datalad-neuroimaging, but happy to hear arguments for the alternative.

datalad / datalad-neuroimaging

Upgrade `bids.py` metadata extractor #94