BUG: metadata extraction for superdataset reports results for subdataset

jsheunis commented 1 year ago

The context

I'm running into a weird issue. I have a superdataset (https://github.com/jsheunis/datalad-catalog-demo-super) which has several subdatasets, including the one at data/ds001499 which is a BIDS dataset. I am running metadata extraction on the superdataset using multiple extractors.

The problem

when I run the bids_dataset extractor (from datalad-neuroimaging), meta-extract goes into the subdataset and extracts BIDS metadata, and then reports that for the superdataset.

Here you can see the call and the full debug output:

datalad -f json -l debug meta-extract -d ../datalad-catalog-demo-super bids_dataset

Command output with level set to debug:

``` datalad -f json -l debug meta-extract -d . bids_dataset [DEBUG ] Command line args 1st pass for DataLad 0.18.3. Parsed: Namespace(common_result_renderer='json') Unparsed: ['meta-extract', '-d', '../datalad-catalog-demo-super', 'bids_dataset'] [DEBUG ] Processing entrypoints [DEBUG ] Loading entrypoint deprecated from datalad.extensions [DEBUG ] Loaded entrypoint deprecated from datalad.extensions [DEBUG ] Loading entrypoint metalad from datalad.extensions [DEBUG ] Loaded entrypoint metalad from datalad.extensions [DEBUG ] Loading entrypoint neuroimaging from datalad.extensions [DEBUG ] Loaded entrypoint neuroimaging from datalad.extensions [DEBUG ] Loading entrypoint catalog from datalad.extensions [DEBUG ] Loaded entrypoint catalog from datalad.extensions [DEBUG ] Loading entrypoint wackyextra from datalad.extensions [DEBUG ] Loaded entrypoint wackyextra from datalad.extensions [DEBUG ] Done processing entrypoints [DEBUG ] Building doc for [DEBUG ] Parsing known args among ['/Users/jsheunis/opt/miniconda3/envs/catalog-demo/bin/datalad', '-f', 'json', '-l', 'debug', 'meta-extract', '-d', '../datalad-catalog-demo-super', 'bids_dataset'] [DEBUG ] Determined class of decorated function: [DEBUG ] Run ['git', 'config', '-z', '-l', '--show-origin'] (protocol_class=StdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super) [DEBUG ] Finished ['git', 'config', '-z', '-l', '--show-origin'] with status 0 [DEBUG ] Run ['git', 'config', '-z', '-l', '--show-origin', '--file', '/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/.datalad/config'] (protocol_class=StdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super) [DEBUG ] Finished ['git', 'config', '-z', '-l', '--show-origin', '--file', '/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/.datalad/config'] with status 0 [DEBUG ] Resolved dataset to extract metadata: /Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super [DEBUG ] Run ['git', '-c', 'diff.ignoreSubmodules=none', 'rev-parse', '--quiet', '--verify', 'HEAD^{commit}'] (protocol_class=GeneratorStdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super) [DEBUG ] Using metadata extractor bids_dataset from distribution datalad-neuroimaging [DEBUG ] performing dataset-level metadata extraction (bids_dataset) for dataset at /Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super [DEBUG ] Importing datalad.api to possibly discover possibly not yet bound method 'get' [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Failed to import requests_ftp, thus no ftp support: ModuleNotFoundError(No module named 'requests_ftp') [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Processing entrypoints [DEBUG ] Loading entrypoint deprecated from datalad.extensions [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Loaded entrypoint deprecated from datalad.extensions [DEBUG ] Loading entrypoint metalad from datalad.extensions [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Loaded entrypoint metalad from datalad.extensions [DEBUG ] Loading entrypoint neuroimaging from datalad.extensions [DEBUG ] Building doc for [DEBUG ] Loaded entrypoint neuroimaging from datalad.extensions [DEBUG ] Loading entrypoint catalog from datalad.extensions [DEBUG ] Building doc for [DEBUG ] Loaded entrypoint catalog from datalad.extensions [DEBUG ] Loading entrypoint wackyextra from datalad.extensions [DEBUG ] Building doc for [DEBUG ] Loaded entrypoint wackyextra from datalad.extensions [DEBUG ] Done processing entrypoints [DEBUG ] Determined class of decorated function: [DEBUG ] Resolved dataset to get content of <>: /Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super [DEBUG ] Determined class of decorated function: [DEBUG ] Resolved dataset to report on subdataset(s): /Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super [DEBUG ] Query subdatasets of Dataset(/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super) [DEBUG ] Run ['git', '-c', 'diff.ignoreSubmodules=none', 'config', '-z', '-l', '--file', '.gitmodules'] (protocol_class=GeneratorStdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super) [DEBUG ] Run ['git', '-c', 'diff.ignoreSubmodules=none', 'ls-files', '--stage', '-z', '--', 'data/AIDAqc_test_data', 'data/ds001499', 'data/human-connectome-project-openaccess', 'data/machinelearning-books', 'data/studyforrest-data'] (protocol_class=GeneratorStdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super) [DEBUG ] Run ['git', 'config', '-z', '-l', '--show-origin'] (protocol_class=StdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499) [DEBUG ] Finished ['git', 'config', '-z', '-l', '--show-origin'] with status 0 [DEBUG ] Run ['git', 'config', '-z', '-l', '--show-origin', '--file', '/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499/.datalad/config'] (protocol_class=StdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499) [DEBUG ] Finished ['git', 'config', '-z', '-l', '--show-origin', '--file', '/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499/.datalad/config'] with status 0 [DEBUG ] Determine what files match the query to work with [DEBUG ] Run ['git', 'annex', 'version', '--raw'] (protocol_class=StdOutErrCapture) (cwd=None) [DEBUG ] Finished ['git', 'annex', 'version', '--raw'] with status 0 [DEBUG ] Run ['git', '-c', 'diff.ignoreSubmodules=none', '-c', 'annex.merge-annex-branches=false', 'annex', 'find', '--not', '--in', '.', '--json', '--json-error-messages', '-c', 'annex.dotfiles=true', '--', 'dataset_description.json'] (protocol_class=AnnexJsonProtocol) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499) [DEBUG ] Finished ['git', '-c', 'diff.ignoreSubmodules=none', '-c', 'annex.merge-annex-branches=false', 'annex', 'find', '--not', '--in', '.', '--json', '--json-error-messages', '-c', 'annex.dotfiles=true', '--', 'dataset_description.json'] with status 0 [DEBUG ] No files found needing fetching. [DEBUG ] already present [get(/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499/dataset_description.json)] action summary: get (notneeded: 2) [DEBUG ] Determined class of decorated function: [DEBUG ] Resolved dataset to get content of <>: /Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super [DEBUG ] Determined class of decorated function: [DEBUG ] Resolved dataset to report on subdataset(s): /Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super [DEBUG ] Query subdatasets of Dataset(/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super) [DEBUG ] Run ['git', '-c', 'diff.ignoreSubmodules=none', 'config', '-z', '-l', '--file', '.gitmodules'] (protocol_class=GeneratorStdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super) [DEBUG ] Run ['git', '-c', 'diff.ignoreSubmodules=none', 'ls-files', '--stage', '-z', '--', 'data/AIDAqc_test_data', 'data/ds001499', 'data/human-connectome-project-openaccess', 'data/machinelearning-books', 'data/studyforrest-data'] (protocol_class=GeneratorStdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super) [DEBUG ] Run ['git', 'config', '-z', '-l', '--show-origin'] (protocol_class=StdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499) [DEBUG ] Finished ['git', 'config', '-z', '-l', '--show-origin'] with status 0 [DEBUG ] Run ['git', 'config', '-z', '-l', '--show-origin', '--file', '/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499/.datalad/config'] (protocol_class=StdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499) [DEBUG ] Finished ['git', 'config', '-z', '-l', '--show-origin', '--file', '/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499/.datalad/config'] with status 0 [DEBUG ] Determine what files match the query to work with [DEBUG ] Run ['git', '-c', 'diff.ignoreSubmodules=none', '-c', 'annex.merge-annex-branches=false', 'annex', 'find', '--not', '--in', '.', '--json', '--json-error-messages', '-c', 'annex.dotfiles=true', '--', 'participants.tsv'] (protocol_class=AnnexJsonProtocol) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499) [DEBUG ] Finished ['git', '-c', 'diff.ignoreSubmodules=none', '-c', 'annex.merge-annex-branches=false', 'annex', 'find', '--not', '--in', '.', '--json', '--json-error-messages', '-c', 'annex.dotfiles=true', '--', 'participants.tsv'] with status 0 [DEBUG ] No files found needing fetching. [DEBUG ] already present [get(/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499/participants.tsv)] action summary: get (notneeded: 2) bids_dataset metadata extraction: 0%| /Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/bids/layout/validation.py:131: UserWarning: Derivative indexing was requested, but no valid derivative datasets were found in the specified locations ([PosixPath('/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499/derivatives')]). Note that all BIDS-Derivatives datasets must meet all the requirements for BIDS-Raw datasets (a common problem is to fail to include a 'dataset_description.json' file in derivatives datasets). Example contents of 'dataset_description.json': {"Name": "Example dataset", "BIDSVersion": "1.0.2", "GeneratedBy": [{"Name": "Example pipeline"}]} warnings.warn("Derivative indexing was requested, but no valid " {"action": "meta_extract", "metadata_record": {"agent_email": "s.heunis@fz-juelich.de", "agent_name": "Stephan Heunis", "dataset_id": "ff750e89-09bf-48cc-b21c-fe94f071da00", "dataset_version": "75ce5bfa9380ff05a2046473cfa292f98f754596", "extracted_metadata": {"@context": {"@id": "https://doi.org/10.5281/zenodo.4710751", "description": "ad-hoc vocabulary for the Brain Imaging Data Structure (BIDS) standard v1.6.0", "type": "http://purl.org/dc/dcam/VocabularyEncodingScheme"}, "Acknowledgements": "We thank Scott Kurdilla for his patience as our MRI technologist throughout all data collection. We would also like to thank Austin Marcus for his assistance in various stages of this project, Jayanth Koushik for his assistance in AlexNet feature extractions, and Ana Van Gulick for her assistance with public data distribution and open science issues. Finally, we thank our participants for their participation and patience, without them this dataset would not have been possible.", "Authors": ["Nadine Chang", "John A. Pyles", "Austin Marcus", "Abhinav Gupta", "Michael J. Tarr", "Elissa M. Aminoff"], "BIDSVersion": "1.0.2", "DatasetDOI": "10.18112/openneuro.ds001499.v1.3.1", "Funding": "This dataset was collected with the support of NSF Award BCS-1439237 to Elissa M. Aminoff and Michael J. Tarr, ONR MURI N000141612007 and Sloan, Okawa Fellowship to Abhinav Gupta, and NSF Award BSC-1640681 to Michael Tarr.", "HowToAcknowledge": "Please cite our paper available on arXiv: http://arxiv.org/abs/1809.01281", "License": "CC0", "Name": "BOLD5000", "ReferencesAndLinks": ["https://bold5000.org"], "description": null, "entities": {"acquisition": ["spinecho", "spinechopf68", "AP", "PA"], "datatype": ["fmap", "func", "anat", "dwi"], "direction": ["AP", "PA"], "extension": [".json", ".tsv", ".nii.gz", ".tsv.gz", ".bval", ".bvec"], "fmap": ["epi"], "recording": ["cardiac", "respiratory", "trigger"], "run": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "session": ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12", "13", "14", "15", "16"], "subject": ["CSI1", "CSI2", "CSI3", "CSI4"], "suffix": ["description", "participants", "epi", "bold", "events", "physio", "T2w", "T1w", "dwi", "sessions"], "task": ["5000scenes", "localizer"]}, "variables": {"dataset": ["subject", "age", "handedness", "sex", "suffix"], "subject": ["subject", "session", "At this point in the day, you have eaten...", "Date", "Did you work out today?", "Do you drink alcoholic beverages?", "Do you drink caffeinated beverages (i.e. coffee, tea, coke, etc.)?", "Do you smoke?", "Duration (in seconds)", "End Date", "Have you taken ibuprofen today (e.g. Advil, Motrin)?", "How long ago was your last meal?", "How many hours of sleep did you get last night?", "If so, what was the activity, and how long ago?", "If so, when was the last time you had a caffeinated beverage?", "If so, when was the last time you had an alcoholic beverage?", "If so, when was the last time you smoked?", "If so, when was the last time you took it?", "In the scanner today I was mentally: - Click to write Choice 1", "In the scanner today I was physically: - Click to write Choice 1", "Is there anything you think we should know about your experience in the MRI (e.g. were you tired, confused about task instructions, etc)?", "Is this...", "Is this....1", "Is this....2", "Is this....3", "Is this....4", "Is this....5", "Please comment on which of these things, if any, are particularly different today and/or if you think they have affected your performance (for better or for worse):", "Progress", "Recorded Date", "Response ID", "Session", "Start Date", "Subject ID", "Time", "suffix"]}}, "extraction_parameter": {}, "extraction_time": 1680166013.927886, "extractor_name": "bids_dataset", "extractor_version": "0.0.1", "type": "dataset"}, "path": "/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super", "status": "ok", "type": "dataset"} ```

More info:

The superdataset ID and VERSION are shown in the json output:

"dataset_id": "ff750e89-09bf-48cc-b21c-fe94f071da00", "dataset_version": "75ce5bfa9380ff05a2046473cfa292f98f754596",

The datalad ID of the subdataset:

[submodule "data/ds001499"]
    path = data/ds001499
    url = https://github.com/OpenNeuroDatasets/ds001499.git
    datalad-id = 3e874376-b053-11e8-b9ac-0242ac130026

Relevant comments

Comment 1

This same problem occurs when I run meta-conduct on the superdatset with traverser.traverse_sub_datasets=True (I actually came across the issue the first time when using meta-conduct):

#!/bin/zsh

EXTRACTOR=$1
DATASET_PATH=`pwd`
PIPELINE_PATH="$DATASET_PATH/code/extract_single_pipeline.json"
touch "$DATASET_PATH/outputs/dataset_metadata_$EXTRACTOR.jsonl"
touch "$DATASET_PATH/outputs/dataset_metadata_$EXTRACTOR.err"
datalad -f json meta-conduct "$PIPELINE_PATH" \
    traverser.top_level_dir=$DATASET_PATH \
    traverser.item_type=dataset \
    traverser.traverse_sub_datasets=True \
    extractor1.extractor_type=dataset \
    extractor1.extractor_name=$EXTRACTOR \
    > "$DATASET_PATH/outputs/dataset_metadata_$EXTRACTOR.jsonl" \
    2> "$DATASET_PATH/outputs/dataset_metadata_$EXTRACTOR.err"

Note in the output that there are two extraction results containing BIDS metadata, one for the superdataset and one for the subdataset. Note also that these objects differ in their content, specifically that the superdataset object has field description equal to null, and the subdataset object has field description equal to a json string:

{"action": "meta_conduct", "message": "concurrent.futures.process._RemoteTraceback: \n\"\"\"\nTraceback (most recent call last):\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/process.py\", line 246, in _process_worker\n    r = call_item.fn(*call_item.args, **call_item.kwargs)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/pipeline/processor/base.py\", line 25, in execute\n    return context, self.process(pipeline_data)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/pipeline/processor/extract.py\", line 116, in process\n    for extract_result in meta_extract(**kwargs):\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 773, in eval_func\n    return return_func(*args, **kwargs)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 763, in return_func\n    results = list(results)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 873, in _execute_command_\n    for r in _process_results(\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/utils.py\", line 319, in _process_results\n    for res in results:\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 321, in __call__\n    yield from do_extraction(ep=extraction_arguments)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 414, in do_extraction\n    yield from perform_metadata_extraction(ep, extractor)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 435, in perform_metadata_extraction\n    res = extractor.get_required_content()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad_neuroimaging/extractors/bids_dataset.py\", line 150, in get_required_content\n    bids_dir = _find_bids_root(self.dataset.path)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad_neuroimaging/extractors/bids_dataset.py\", line 308, in _find_bids_root\n    raise FileNotFoundError(msg)\nFileNotFoundError: The file 'participants.tsv' should be part of the BIDS dataset in order for the 'bids_dataset' extractor to function correctly\n\"\"\"\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/conduct.py\", line 378, in process_parallel\n    source_index, pipeline_data = future.result()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/_base.py\", line 451, in result\n    return self.__get_result()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/_base.py\", line 403, in __get_result\n    raise self._exception\nFileNotFoundError: The file 'participants.tsv' should be part of the BIDS dataset in order for the 'bids_dataset' extractor to function correctly\n", "status": "error"}
{"action": "meta_conduct", "message": "concurrent.futures.process._RemoteTraceback: \n\"\"\"\nTraceback (most recent call last):\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/process.py\", line 246, in _process_worker\n    r = call_item.fn(*call_item.args, **call_item.kwargs)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/pipeline/processor/base.py\", line 25, in execute\n    return context, self.process(pipeline_data)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/pipeline/processor/extract.py\", line 116, in process\n    for extract_result in meta_extract(**kwargs):\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 773, in eval_func\n    return return_func(*args, **kwargs)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 763, in return_func\n    results = list(results)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 873, in _execute_command_\n    for r in _process_results(\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/utils.py\", line 319, in _process_results\n    for res in results:\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 321, in __call__\n    yield from do_extraction(ep=extraction_arguments)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 414, in do_extraction\n    yield from perform_metadata_extraction(ep, extractor)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 435, in perform_metadata_extraction\n    res = extractor.get_required_content()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad_neuroimaging/extractors/bids_dataset.py\", line 150, in get_required_content\n    bids_dir = _find_bids_root(self.dataset.path)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad_neuroimaging/extractors/bids_dataset.py\", line 308, in _find_bids_root\n    raise FileNotFoundError(msg)\nFileNotFoundError: The file 'participants.tsv' should be part of the BIDS dataset in order for the 'bids_dataset' extractor to function correctly\n\"\"\"\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/conduct.py\", line 378, in process_parallel\n    source_index, pipeline_data = future.result()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/_base.py\", line 451, in result\n    return self.__get_result()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/_base.py\", line 403, in __get_result\n    raise self._exception\nFileNotFoundError: The file 'participants.tsv' should be part of the BIDS dataset in order for the 'bids_dataset' extractor to function correctly\n", "status": "error"}
{"action": "meta_conduct", "message": "concurrent.futures.process._RemoteTraceback: \n\"\"\"\nTraceback (most recent call last):\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/process.py\", line 246, in _process_worker\n    r = call_item.fn(*call_item.args, **call_item.kwargs)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/pipeline/processor/base.py\", line 25, in execute\n    return context, self.process(pipeline_data)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/pipeline/processor/extract.py\", line 116, in process\n    for extract_result in meta_extract(**kwargs):\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 773, in eval_func\n    return return_func(*args, **kwargs)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 763, in return_func\n    results = list(results)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 873, in _execute_command_\n    for r in _process_results(\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/utils.py\", line 319, in _process_results\n    for res in results:\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 321, in __call__\n    yield from do_extraction(ep=extraction_arguments)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 414, in do_extraction\n    yield from perform_metadata_extraction(ep, extractor)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 435, in perform_metadata_extraction\n    res = extractor.get_required_content()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad_neuroimaging/extractors/bids_dataset.py\", line 150, in get_required_content\n    bids_dir = _find_bids_root(self.dataset.path)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad_neuroimaging/extractors/bids_dataset.py\", line 308, in _find_bids_root\n    raise FileNotFoundError(msg)\nFileNotFoundError: The file 'participants.tsv' should be part of the BIDS dataset in order for the 'bids_dataset' extractor to function correctly\n\"\"\"\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/conduct.py\", line 378, in process_parallel\n    source_index, pipeline_data = future.result()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/_base.py\", line 451, in result\n    return self.__get_result()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/_base.py\", line 403, in __get_result\n    raise self._exception\nFileNotFoundError: The file 'participants.tsv' should be part of the BIDS dataset in order for the 'bids_dataset' extractor to function correctly\n", "status": "error"}
{"action": "meta_conduct", "message": "concurrent.futures.process._RemoteTraceback: \n\"\"\"\nTraceback (most recent call last):\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/process.py\", line 246, in _process_worker\n    r = call_item.fn(*call_item.args, **call_item.kwargs)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/pipeline/processor/base.py\", line 25, in execute\n    return context, self.process(pipeline_data)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/pipeline/processor/extract.py\", line 116, in process\n    for extract_result in meta_extract(**kwargs):\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 773, in eval_func\n    return return_func(*args, **kwargs)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 763, in return_func\n    results = list(results)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 873, in _execute_command_\n    for r in _process_results(\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/utils.py\", line 319, in _process_results\n    for res in results:\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 321, in __call__\n    yield from do_extraction(ep=extraction_arguments)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 414, in do_extraction\n    yield from perform_metadata_extraction(ep, extractor)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 435, in perform_metadata_extraction\n    res = extractor.get_required_content()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad_neuroimaging/extractors/bids_dataset.py\", line 150, in get_required_content\n    bids_dir = _find_bids_root(self.dataset.path)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad_neuroimaging/extractors/bids_dataset.py\", line 308, in _find_bids_root\n    raise FileNotFoundError(msg)\nFileNotFoundError: The file 'participants.tsv' should be part of the BIDS dataset in order for the 'bids_dataset' extractor to function correctly\n\"\"\"\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/conduct.py\", line 378, in process_parallel\n    source_index, pipeline_data = future.result()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/_base.py\", line 451, in result\n    return self.__get_result()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/_base.py\", line 403, in __get_result\n    raise self._exception\nFileNotFoundError: The file 'participants.tsv' should be part of the BIDS dataset in order for the 'bids_dataset' extractor to function correctly\n", "status": "error"}
{"action": "meta_conduct", "path": "/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499", "pipeline_data": {"result": {"dataset-traversal-record": [{"state": "SUCCESS"}], "metadata": [{"metadata_record": {"agent_email": "s.heunis@fz-juelich.de", "agent_name": "Stephan Heunis", "dataset_id": "3e874376-b053-11e8-b9ac-0242ac130026", "dataset_version": "5be66b27ab5e033e9163caa94cec882bd4cee1d0", "extracted_metadata": {"@context": {"@id": "https://doi.org/10.5281/zenodo.4710751", "description": "ad-hoc vocabulary for the Brain Imaging Data Structure (BIDS) standard v1.6.0", "type": "http://purl.org/dc/dcam/VocabularyEncodingScheme"}, "Acknowledgements": "We thank Scott Kurdilla for his patience as our MRI technologist throughout all data collection. We would also like to thank Austin Marcus for his assistance in various stages of this project, Jayanth Koushik for his assistance in AlexNet feature extractions, and Ana Van Gulick for her assistance with public data distribution and open science issues. Finally, we thank our participants for their participation and patience, without them this dataset would not have been possible.", "Authors": ["Nadine Chang", "John A. Pyles", "Austin Marcus", "Abhinav Gupta", "Michael J. Tarr", "Elissa M. Aminoff"], "BIDSVersion": "1.0.2", "DatasetDOI": "10.18112/openneuro.ds001499.v1.3.1", "Funding": "This dataset was collected with the support of NSF Award BCS-1439237 to Elissa M. Aminoff and Michael J. Tarr, ONR MURI N000141612007 and Sloan, Okawa Fellowship to Abhinav Gupta, and NSF Award BSC-1640681 to Michael Tarr.", "HowToAcknowledge": "Please cite our paper available on arXiv: http://arxiv.org/abs/1809.01281", "License": "CC0", "Name": "BOLD5000", "ReferencesAndLinks": ["https://bold5000.org"], "description": [{"extension": "", "text": "BOLD5000: Brains, Objects, Landscapes Dataset\n\nFor details please refer to BOLD5000.org and our paper on arXiv (http://arxiv.org/abs/1809.01281)\n\n*Participant Directories Content*\n1) Four participants: CSI1, CSI2, CSI3, & CSI4\n2) Functional task data acquisition sessions: sessions datalad/datalad-metalad#1-15\nEach functional session includes:\n-3 sets of fieldmaps (EPI opposite phase encoding; spin-echo opposite phase encoding pairs with partial & non-partial Fourier)\n-9 or 10 functional scans of slow event-related 5000 scene data (5000scenes)\n-1 or 0 functional localizer scans used to define scene selective regions (localizer)\n-each event.json file lists each stimulus, the onset time, and the participant\u2019s response (participants performed a simple valence task) \n3) Anatomical data acquisition session: datalad/datalad-metalad#16\nAnatomical Data: T1 weighted MPRAGE scan, a T2 weighted SPACE, diffusion spectrum imaging   \n\nNotes:\n-All MRI and fMRI data provided is with Siemens pre-scan normalization filter.  \n-CSI4 only participated in 10 MRI sessions: 1-9 were functional acquisition sessions, and 10 was the anatomical data acquisition session.\n\n*Derivatives Directory Content*\n1) fMRIprep: \n-Preprocessed data for all functional data of CSI1 through CSI4 (listed in folders for each participant: derivatives/fmriprep/sub-CSIX). Data was preprocessed both in T1w image space and on surface space. Functional data was motion corrected, susceptibility distortion corrected, and aligned to the anatomical data using bbregister. Please refer to the paper for the details on preprocessing.\n-Reports resulting from fMRI prep, which include the success of anatomical alignment and distortion correction, among other measures of preprocessing success are all listed in the sub-CSIX.html files.  \n2) Freesurfer: Freesurfer reconstructions as a result of fMRIprep preprocessing stream. \n3) MRIQC: Image quality metrics (IQMs) of the dataset using MRIQC. \n-CSIX-func.csv files are text files with a list of all IQMs for each session, for each run.\n-CSIX-anat.csv files are text files with a list of all IQMs for the scans acquired in the anatomical session (e.g., MPRAGE). \n-CSIX_IQM.xls an excel workbook, each sheet of workbook lists the IQMs for a single run. This is the same data as CSIX-func.csv, except formatted differently. \n-sub-CSIX/derivatives: contain .json with the MRIQC/IQM results for each run. \n-sub-CSIX/reports: contains .html file with MRIQC/IQM results for each run along with mean signal and standard deviation maps. \n4)spm: A directory that contains the masks used to define each region of interest (ROI) in each participant. There were 10 ROIs: early visual (EarlyVis), lateral occipital cortex (LOC), occipital place area (OPA), parahippocampal place area (PPA), retrosplenial complex (RSC) for the left hemisphere (LH) and right hemisphere (RH)."}], "entities": {"acquisition": ["spinecho", "spinechopf68", "AP", "PA"], "datatype": ["fmap", "func", "anat", "dwi"], "direction": ["AP", "PA"], "extension": [".json", ".tsv", ".nii.gz", ".tsv.gz", ".bval", ".bvec"], "fmap": ["epi"], "recording": ["cardiac", "respiratory", "trigger"], "run": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "session": ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12", "13", "14", "15", "16"], "subject": ["CSI1", "CSI2", "CSI3", "CSI4"], "suffix": ["description", "participants", "epi", "bold", "events", "physio", "T2w", "T1w", "dwi", "sessions"], "task": ["5000scenes", "localizer"]}, "variables": {"dataset": ["subject", "age", "handedness", "sex", "suffix"], "subject": ["session", "subject", "At this point in the day, you have eaten...", "Date", "Did you work out today?", "Do you drink alcoholic beverages?", "Do you drink caffeinated beverages (i.e. coffee, tea, coke, etc.)?", "Do you smoke?", "Duration (in seconds)", "End Date", "Have you taken ibuprofen today (e.g. Advil, Motrin)?", "How long ago was your last meal?", "How many hours of sleep did you get last night?", "If so, what was the activity, and how long ago?", "If so, when was the last time you had a caffeinated beverage?", "If so, when was the last time you had an alcoholic beverage?", "If so, when was the last time you smoked?", "If so, when was the last time you took it?", "In the scanner today I was mentally: - Click to write Choice 1", "In the scanner today I was physically: - Click to write Choice 1", "Is there anything you think we should know about your experience in the MRI (e.g. were you tired, confused about task instructions, etc)?", "Is this...", "Is this....1", "Is this....2", "Is this....3", "Is this....4", "Is this....5", "Please comment on which of these things, if any, are particularly different today and/or if you think they have affected your performance (for better or for worse):", "Progress", "Recorded Date", "Response ID", "Session", "Start Date", "Subject ID", "Time", "suffix"]}}, "extraction_parameter": {}, "extraction_time": 1680124486.907684, "extractor_name": "bids_dataset", "extractor_version": "0.0.1", "type": "dataset"}, "path": "/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499", "state": "SUCCESS"}], "path": "/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499"}, "state": "CONTINUE"}, "status": "ok"}
action summary:
  get (notneeded: 2)
action summary:
  get (notneeded: 2)
{"action": "meta_conduct", "path": "/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super", "pipeline_data": {"result": {"dataset-traversal-record": [{"state": "SUCCESS"}], "metadata": [{"metadata_record": {"agent_email": "s.heunis@fz-juelich.de", "agent_name": "Stephan Heunis", "dataset_id": "ff750e89-09bf-48cc-b21c-fe94f071da00", "dataset_version": "75ce5bfa9380ff05a2046473cfa292f98f754596", "extracted_metadata": {"@context": {"@id": "https://doi.org/10.5281/zenodo.4710751", "description": "ad-hoc vocabulary for the Brain Imaging Data Structure (BIDS) standard v1.6.0", "type": "http://purl.org/dc/dcam/VocabularyEncodingScheme"}, "Acknowledgements": "We thank Scott Kurdilla for his patience as our MRI technologist throughout all data collection. We would also like to thank Austin Marcus for his assistance in various stages of this project, Jayanth Koushik for his assistance in AlexNet feature extractions, and Ana Van Gulick for her assistance with public data distribution and open science issues. Finally, we thank our participants for their participation and patience, without them this dataset would not have been possible.", "Authors": ["Nadine Chang", "John A. Pyles", "Austin Marcus", "Abhinav Gupta", "Michael J. Tarr", "Elissa M. Aminoff"], "BIDSVersion": "1.0.2", "DatasetDOI": "10.18112/openneuro.ds001499.v1.3.1", "Funding": "This dataset was collected with the support of NSF Award BCS-1439237 to Elissa M. Aminoff and Michael J. Tarr, ONR MURI N000141612007 and Sloan, Okawa Fellowship to Abhinav Gupta, and NSF Award BSC-1640681 to Michael Tarr.", "HowToAcknowledge": "Please cite our paper available on arXiv: http://arxiv.org/abs/1809.01281", "License": "CC0", "Name": "BOLD5000", "ReferencesAndLinks": ["https://bold5000.org"], "description": null, "entities": {"acquisition": ["spinecho", "spinechopf68", "AP", "PA"], "datatype": ["fmap", "func", "anat", "dwi"], "direction": ["AP", "PA"], "extension": [".json", ".tsv", ".nii.gz", ".tsv.gz", ".bval", ".bvec"], "fmap": ["epi"], "recording": ["cardiac", "respiratory", "trigger"], "run": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "session": ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12", "13", "14", "15", "16"], "subject": ["CSI1", "CSI2", "CSI3", "CSI4"], "suffix": ["description", "participants", "epi", "bold", "events", "physio", "T2w", "T1w", "dwi", "sessions"], "task": ["5000scenes", "localizer"]}, "variables": {"dataset": ["subject", "age", "handedness", "sex", "suffix"], "subject": ["subject", "session", "At this point in the day, you have eaten...", "Date", "Did you work out today?", "Do you drink alcoholic beverages?", "Do you drink caffeinated beverages (i.e. coffee, tea, coke, etc.)?", "Do you smoke?", "Duration (in seconds)", "End Date", "Have you taken ibuprofen today (e.g. Advil, Motrin)?", "How long ago was your last meal?", "How many hours of sleep did you get last night?", "If so, what was the activity, and how long ago?", "If so, when was the last time you had a caffeinated beverage?", "If so, when was the last time you had an alcoholic beverage?", "If so, when was the last time you smoked?", "If so, when was the last time you took it?", "In the scanner today I was mentally: - Click to write Choice 1", "In the scanner today I was physically: - Click to write Choice 1", "Is there anything you think we should know about your experience in the MRI (e.g. were you tired, confused about task instructions, etc)?", "Is this...", "Is this....1", "Is this....2", "Is this....3", "Is this....4", "Is this....5", "Please comment on which of these things, if any, are particularly different today and/or if you think they have affected your performance (for better or for worse):", "Progress", "Recorded Date", "Response ID", "Session", "Start Date", "Subject ID", "Time", "suffix"]}}, "extraction_parameter": {}, "extraction_time": 1680124488.3530781, "extractor_name": "bids_dataset", "extractor_version": "0.0.1", "type": "dataset"}, "path": "/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super", "state": "SUCCESS"}], "path": "/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super"}, "state": "CONTINUE"}, "status": "ok"}

Comment 2

The problem seems to only occur for bids_dataset and not other extractors. I created an analogous test with metalad_studyminimeta (i.e. superdataset with no metadata, subdataset with a .studyminimeta.yaml file). And this only reported that extraction was not possible since there is no required metadata file:

datalad -f json -l debug meta-extract -d minimeta_test_super metalad_studyminimeta

Meta-extract output:

``` [DEBUG ] Command line args 1st pass for DataLad 0.18.3. Parsed: Namespace(common_result_renderer='json') Unparsed: ['meta-extract', '-d', '.', 'metalad_studyminimeta'] [DEBUG ] Processing entrypoints [DEBUG ] Loading entrypoint deprecated from datalad.extensions [DEBUG ] Loaded entrypoint deprecated from datalad.extensions [DEBUG ] Loading entrypoint metalad from datalad.extensions [DEBUG ] Loaded entrypoint metalad from datalad.extensions [DEBUG ] Loading entrypoint neuroimaging from datalad.extensions [DEBUG ] Loaded entrypoint neuroimaging from datalad.extensions [DEBUG ] Loading entrypoint catalog from datalad.extensions [DEBUG ] Loaded entrypoint catalog from datalad.extensions [DEBUG ] Loading entrypoint wackyextra from datalad.extensions [DEBUG ] Loaded entrypoint wackyextra from datalad.extensions [DEBUG ] Done processing entrypoints [DEBUG ] Building doc for [DEBUG ] Parsing known args among ['/Users/jsheunis/opt/miniconda3/envs/catalog-demo/bin/datalad', '-f', 'json', '-l', 'debug', 'meta-extract', '-d', '.', 'metalad_studyminimeta'] [DEBUG ] Determined class of decorated function: [DEBUG ] Run ['git', 'config', '-z', '-l', '--show-origin'] (protocol_class=StdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test) [DEBUG ] Finished ['git', 'config', '-z', '-l', '--show-origin'] with status 0 [DEBUG ] Run ['git', 'config', '-z', '-l', '--show-origin', '--file', '/Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test/.datalad/config'] (protocol_class=StdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test) [DEBUG ] Finished ['git', 'config', '-z', '-l', '--show-origin', '--file', '/Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test/.datalad/config'] with status 0 [DEBUG ] Resolved dataset to extract metadata: /Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test [DEBUG ] Run ['git', '-c', 'diff.ignoreSubmodules=none', 'rev-parse', '--quiet', '--verify', 'HEAD^{commit}'] (protocol_class=GeneratorStdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test) [DEBUG ] Using metadata extractor metalad_studyminimeta from distribution datalad-metalad [DEBUG ] performing legacy dataset-level metadata extraction (metalad_studyminimeta) for dataset at /Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test [DEBUG ] Importing datalad.api to possibly discover possibly not yet bound method 'subdatasets' [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Failed to import requests_ftp, thus no ftp support: ModuleNotFoundError(No module named 'requests_ftp') [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Processing entrypoints [DEBUG ] Loading entrypoint deprecated from datalad.extensions [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Loaded entrypoint deprecated from datalad.extensions [DEBUG ] Loading entrypoint metalad from datalad.extensions [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Building doc for [DEBUG ] Loaded entrypoint metalad from datalad.extensions [DEBUG ] Loading entrypoint neuroimaging from datalad.extensions [DEBUG ] Building doc for [DEBUG ] Loaded entrypoint neuroimaging from datalad.extensions [DEBUG ] Loading entrypoint catalog from datalad.extensions [DEBUG ] Building doc for [DEBUG ] Loaded entrypoint catalog from datalad.extensions [DEBUG ] Loading entrypoint wackyextra from datalad.extensions [DEBUG ] Building doc for [DEBUG ] Loaded entrypoint wackyextra from datalad.extensions [DEBUG ] Done processing entrypoints [DEBUG ] Determined class of decorated function: [DEBUG ] Resolved dataset to report on subdataset(s): /Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test [DEBUG ] Query subdatasets of Dataset(/Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test) [DEBUG ] Run ['git', '-c', 'diff.ignoreSubmodules=none', 'config', '-z', '-l', '--file', '.gitmodules'] (protocol_class=GeneratorStdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test) [DEBUG ] Run ['git', '-c', 'diff.ignoreSubmodules=none', 'ls-files', '--stage', '-z', '--', 'super-duper-octo-engine'] (protocol_class=GeneratorStdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test) {"action": "meta_extract", "message": "file /Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test/.studyminimeta.yaml could not be opened", "path": "/Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test", "status": "error", "type": "dataset"} [DEBUG ] could not perform all requested actions: IncompleteResultsError(Command did not complete successfully. 1 failed: [{'action': 'meta_extract', 'message': 'file ' '/Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test/.studyminimeta.yaml ' 'could not be opened', 'path': '/Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test', 'status': 'error', 'type': 'dataset'}]) Studyminimeta metadata extraction: 0%| ```

Comment 3

The above comment suggests the problem lies in the extractor code. But something that confuses me from the initial meta-extract debug logs is when the process dives into the subdatasets:

[DEBUG  ] Query subdatasets of Dataset(/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super)
[DEBUG  ] Run ['git', '-c', 'diff.ignoreSubmodules=none', 'config', '-z', '-l', '--file', '.gitmodules'] (protocol_class=GeneratorStdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super)
[DEBUG  ] Run ['git', '-c', 'diff.ignoreSubmodules=none', 'ls-files', '--stage', '-z', '--', 'data/AIDAqc_test_data', 'data/ds001499', 'data/human-connectome-project-openaccess', 'data/machinelearning-books', 'data/studyforrest-data'] (protocol_class=GeneratorStdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super)
[DEBUG  ] Run ['git', 'config', '-z', '-l', '--show-origin'] (protocol_class=StdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499)
[DEBUG  ] Finished ['git', 'config', '-z', '-l', '--show-origin'] with status 0
[DEBUG  ] Run ['git', 'config', '-z', '-l', '--show-origin', '--file', '/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499/.datalad/config'] (protocol_class=StdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499)
[DEBUG  ] Finished ['git', 'config', '-z', '-l', '--show-origin', '--file', '/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499/.datalad/config'] with status 0
[DEBUG  ] Determine what files match the query to work with
[DEBUG  ] Run ['git', 'annex', 'version', '--raw'] (protocol_class=StdOutErrCapture) (cwd=None)
[DEBUG  ] Finished ['git', 'annex', 'version', '--raw'] with status 0
[DEBUG  ] Run ['git', '-c', 'diff.ignoreSubmodules=none', '-c', 'annex.merge-annex-branches=false', 'annex', 'find', '--not', '--in', '.', '--json', '--json-error-messages', '-c', 'annex.dotfiles=true', '--', 'dataset_description.json'] (protocol_class=AnnexJsonProtocol) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499)
[DEBUG  ] Finished ['git', '-c', 'diff.ignoreSubmodules=none', '-c', 'annex.merge-annex-branches=false', 'annex', 'find', '--not', '--in', '.', '--json', '--json-error-messages', '-c', 'annex.dotfiles=true', '--', 'dataset_description.json'] with status 0

I'm not sure why/how this happens.

jsheunis commented 1 year ago

Ok it looks like the issue results from this functionality in the bids_dataset extractor that is called inside get_required_content:

def _find_bids_root(dataset_path) -> Path:
    """
    Find relative location of BIDS directory within datalad dataset
    """
    participant_paths = list(Path(dataset_path).glob("**/participants.tsv"))
    # 1 - if more than one, select first and output warning
    # 2 - if zero, output error
    # 3 - if 1, add to dataset path and set ats bids root dir
    if len(participant_paths) == 0:
        msg = ("The file 'participants.tsv' should be part of the BIDS dataset "
        "in order for the 'bids_dataset' extractor to function correctly")
        raise FileNotFoundError(msg)
    elif len(participant_paths) > 1:
        msg = (f"Multiple 'participants.tsv' files ({len(participant_paths)}) "
        f"were found in the recursive filetree of {dataset_path}, selecting "
        "first path.")
        lgr.warning(msg)
        return Path(participant_paths[0]).parent
    else:
        return Path(participant_paths[0]).parent

So according to the code it is possible for the root bids directory to be further down in the tree of the specified dataset for which extraction is to be done. This causes the problems observed above. I'm not sure if we should actually support this use case implicitly, since it causes these problems. IMO the ideal way of running an extractor is on the specified dataset, and if it doesn't contain the required files/metadata, then the extractor returns something like an impossible result and the result handling continues on with whatever is next in line.

We might want to support the use case explicitly e.g. with an extraction argument (like allow_relative_root = True | False) but IMO this should come with a serious disclaimer saying what the result would be if this is run on data similar to the super- and subdataset setup that is presented above.

Alternatively, there could be a check whether the relative root directory (if found) is actually a subdataset, in which case extraction should not continue.

(TODO: if a version of the code remains, the file that is searched for should be changed to dataset_description.json, due to https://github.com/datalad/datalad-neuroimaging/issues/121)

Interested in what others would regards as sensible behaviour here.

jsheunis commented 1 year ago

FTR, the BIDS specification does support BIDS-compliant directories that are further down in the tree of the BIDS dataset root directory: https://bids-specification.readthedocs.io/en/stable/02-common-principles.html#source-vs-raw-vs-derived-data. So this might be the case for some BIDS datasets in the wild.

yarikoptic commented 1 year ago

thank you @jsheunis for digging! unrelated to the issue: I would like also to check how much/far we can get with https://github.com/ANCPLabOldenburg/ancp-bids which should be more lightweight, represent/use current bids schema , and be coinstallable in modern system (unlike pybids with upper bound on sqlalchemy ATM)

yarikoptic commented 1 year ago

as for nested bids datasets -- yeah, we need to see on best way to decouple the notion of BIDS dataset from DataLad dataset, since

as you mentioned DataLad dataset can contain multiple BIDS datasets -- we would need to stick nested datasets metadata description somewhere too ideally, and then traverse files there
- corner case -- top level of the DataLad dataset is not BIDS dataset (e.g. consider YODA style results but without dedicated subdataset for rawdata/), and then BIDS dataset nested within e.g. rawdata/
and BIDS dataset can be represented by multiple DataLad datasets (e.g. per subject etc) - dataset level metadata is kinda easy, but then where do we stick per file metadata -- into superdataset or individual ones (but those cannot even extract it on their own, rely on hierarchy)

jsheunis commented 1 year ago

Thanks for the input

as you mentioned DataLad dataset can contain multiple BIDS datasets -- we would need to stick nested datasets metadata description somewhere too ideally, and then traverse files there corner case -- top level of the DataLad dataset is not BIDS dataset (e.g. consider YODA style results but without dedicated subdataset for rawdata/), and then BIDS dataset nested within e.g. rawdata/

Exactly. Ideally an extractor would be able to figure out what and where to extract automatically, but I think with the combination of (1) datalad dataset nesting and (2) BIDS allowing flexibility in where the dataset directory is located, we cannot leave it up to the extractor to decide. I think this would need some extra user input via extraction parameters.

and BIDS dataset can be represented by multiple DataLad datasets (e.g. per subject etc) - dataset level metadata is kinda easy, but then where do we stick per file metadata -- into superdataset or individual ones (but those cannot even extract it on their own, rely on hierarchy)

Good point, I didn't consider this before. What could perhaps be useful here is to look at the updated genericjson_file extractor and see if that can be inherited from for the BIDS file-level extractor (see issue: https://github.com/datalad/datalad-neuroimaging/issues/120). This allows the user to specify a sidecar file pattern as the source for metadata to be extracted. So combining this with an extractor command argument instructing it to traverse into subdatasets could perhaps be a good direction to investigate.

jsheunis commented 1 year ago

I will merge PR https://github.com/datalad/datalad-neuroimaging/pull/124 that will fix this issue by removing the option to search for the BIDS dataset root location further down in the filetree of the dataset. I have created a new issue to deal with the possibilities of having BIDS dataset(s) embedded in datalad datasets: https://github.com/datalad/datalad-neuroimaging/issues/126

jsheunis commented 1 year ago

Closed by https://github.com/datalad/datalad-neuroimaging/pull/124

datalad / datalad-neuroimaging