Closed jsheunis closed 1 year ago
Ok it looks like the issue results from this functionality in the bids_dataset
extractor that is called inside get_required_content
:
def _find_bids_root(dataset_path) -> Path:
"""
Find relative location of BIDS directory within datalad dataset
"""
participant_paths = list(Path(dataset_path).glob("**/participants.tsv"))
# 1 - if more than one, select first and output warning
# 2 - if zero, output error
# 3 - if 1, add to dataset path and set ats bids root dir
if len(participant_paths) == 0:
msg = ("The file 'participants.tsv' should be part of the BIDS dataset "
"in order for the 'bids_dataset' extractor to function correctly")
raise FileNotFoundError(msg)
elif len(participant_paths) > 1:
msg = (f"Multiple 'participants.tsv' files ({len(participant_paths)}) "
f"were found in the recursive filetree of {dataset_path}, selecting "
"first path.")
lgr.warning(msg)
return Path(participant_paths[0]).parent
else:
return Path(participant_paths[0]).parent
So according to the code it is possible for the root bids directory to be further down in the tree of the specified dataset for which extraction is to be done. This causes the problems observed above. I'm not sure if we should actually support this use case implicitly, since it causes these problems. IMO the ideal way of running an extractor is on the specified dataset, and if it doesn't contain the required files/metadata, then the extractor returns something like an impossible result and the result handling continues on with whatever is next in line.
We might want to support the use case explicitly e.g. with an extraction argument (like allow_relative_root = True | False
) but IMO this should come with a serious disclaimer saying what the result would be if this is run on data similar to the super- and subdataset setup that is presented above.
Alternatively, there could be a check whether the relative root directory (if found) is actually a subdataset, in which case extraction should not continue.
(TODO: if a version of the code remains, the file that is searched for should be changed to dataset_description.json
, due to https://github.com/datalad/datalad-neuroimaging/issues/121)
Interested in what others would regards as sensible behaviour here.
FTR, the BIDS specification does support BIDS-compliant directories that are further down in the tree of the BIDS dataset root directory: https://bids-specification.readthedocs.io/en/stable/02-common-principles.html#source-vs-raw-vs-derived-data. So this might be the case for some BIDS datasets in the wild.
thank you @jsheunis for digging! unrelated to the issue: I would like also to check how much/far we can get with https://github.com/ANCPLabOldenburg/ancp-bids which should be more lightweight, represent/use current bids schema , and be coinstallable in modern system (unlike pybids with upper bound on sqlalchemy ATM)
as for nested bids datasets -- yeah, we need to see on best way to decouple the notion of BIDS dataset from DataLad dataset, since
rawdata/
), and then BIDS dataset nested within e.g. rawdata/
Thanks for the input
- as you mentioned DataLad dataset can contain multiple BIDS datasets -- we would need to stick nested datasets metadata description somewhere too ideally, and then traverse files there corner case -- top level of the DataLad dataset is not BIDS dataset (e.g. consider YODA style results but without dedicated subdataset for rawdata/), and then BIDS dataset nested within e.g. rawdata/
Exactly. Ideally an extractor would be able to figure out what and where to extract automatically, but I think with the combination of (1) datalad dataset nesting and (2) BIDS allowing flexibility in where the dataset directory is located, we cannot leave it up to the extractor to decide. I think this would need some extra user input via extraction parameters.
- and BIDS dataset can be represented by multiple DataLad datasets (e.g. per subject etc) - dataset level metadata is kinda easy, but then where do we stick per file metadata -- into superdataset or individual ones (but those cannot even extract it on their own, rely on hierarchy)
Good point, I didn't consider this before. What could perhaps be useful here is to look at the updated genericjson_file
extractor and see if that can be inherited from for the BIDS file-level extractor (see issue: https://github.com/datalad/datalad-neuroimaging/issues/120). This allows the user to specify a sidecar file pattern as the source for metadata to be extracted. So combining this with an extractor command argument instructing it to traverse into subdatasets could perhaps be a good direction to investigate.
I will merge PR https://github.com/datalad/datalad-neuroimaging/pull/124 that will fix this issue by removing the option to search for the BIDS dataset root location further down in the filetree of the dataset. I have created a new issue to deal with the possibilities of having BIDS dataset(s) embedded in datalad datasets: https://github.com/datalad/datalad-neuroimaging/issues/126
The context
I'm running into a weird issue. I have a superdataset (https://github.com/jsheunis/datalad-catalog-demo-super) which has several subdatasets, including the one at
data/ds001499
which is a BIDS dataset. I am running metadata extraction on the superdataset using multiple extractors.The problem
when I run the
bids_dataset
extractor (fromdatalad-neuroimaging
),meta-extract
goes into the subdataset and extracts BIDS metadata, and then reports that for the superdataset.Here you can see the call and the full debug output:
Command output with level set to debug:
``` datalad -f json -l debug meta-extract -d . bids_dataset [DEBUG ] Command line args 1st pass for DataLad 0.18.3. Parsed: Namespace(common_result_renderer='json') Unparsed: ['meta-extract', '-d', '../datalad-catalog-demo-super', 'bids_dataset'] [DEBUG ] Processing entrypoints [DEBUG ] Loading entrypoint deprecated from datalad.extensions [DEBUG ] Loaded entrypoint deprecated from datalad.extensions [DEBUG ] Loading entrypoint metalad from datalad.extensions [DEBUG ] Loaded entrypoint metalad from datalad.extensions [DEBUG ] Loading entrypoint neuroimaging from datalad.extensions [DEBUG ] Loaded entrypoint neuroimaging from datalad.extensions [DEBUG ] Loading entrypoint catalog from datalad.extensions [DEBUG ] Loaded entrypoint catalog from datalad.extensions [DEBUG ] Loading entrypoint wackyextra from datalad.extensions [DEBUG ] Loaded entrypoint wackyextra from datalad.extensions [DEBUG ] Done processing entrypoints [DEBUG ] Building doc forMore info:
The superdataset ID and VERSION are shown in the json output:
The datalad ID of the subdataset:
Relevant comments
Comment 1
This same problem occurs when I run
meta-conduct
on the superdatset withtraverser.traverse_sub_datasets=True
(I actually came across the issue the first time when usingmeta-conduct
):Note in the output that there are two extraction results containing BIDS metadata, one for the superdataset and one for the subdataset. Note also that these objects differ in their content, specifically that the superdataset object has field
description
equal tonull
, and the subdataset object has fielddescription
equal to a json string:Comment 2
The problem seems to only occur for
bids_dataset
and not other extractors. I created an analogous test withmetalad_studyminimeta
(i.e. superdataset with no metadata, subdataset with a.studyminimeta.yaml
file). And this only reported that extraction was not possible since there is no required metadata file:Meta-extract output:
``` [DEBUG ] Command line args 1st pass for DataLad 0.18.3. Parsed: Namespace(common_result_renderer='json') Unparsed: ['meta-extract', '-d', '.', 'metalad_studyminimeta'] [DEBUG ] Processing entrypoints [DEBUG ] Loading entrypoint deprecated from datalad.extensions [DEBUG ] Loaded entrypoint deprecated from datalad.extensions [DEBUG ] Loading entrypoint metalad from datalad.extensions [DEBUG ] Loaded entrypoint metalad from datalad.extensions [DEBUG ] Loading entrypoint neuroimaging from datalad.extensions [DEBUG ] Loaded entrypoint neuroimaging from datalad.extensions [DEBUG ] Loading entrypoint catalog from datalad.extensions [DEBUG ] Loaded entrypoint catalog from datalad.extensions [DEBUG ] Loading entrypoint wackyextra from datalad.extensions [DEBUG ] Loaded entrypoint wackyextra from datalad.extensions [DEBUG ] Done processing entrypoints [DEBUG ] Building doc forComment 3
The above comment suggests the problem lies in the extractor code. But something that confuses me from the initial
meta-extract
debug logs is when the process dives into the subdatasets:I'm not sure why/how this happens.