Closed yarikoptic closed 2 years ago
@yarikoptic I'm thinking something like this:
DatasetDescriptionAsset(LocalFileAsset)
class for representing dataset_description.json
files
BIDSFileAsset(LocalFileAsset)
class for representing any other file underneath a directory with a dataset_description.json
; instances will have an attribute storing the corresponding DatasetDescriptionAsset
NWBBidsAsset(NWBAsset, BIDSFileAsset)
class for representing NWBs in BIDS datasets, with any necessary custom behaviorfind_dandi_files()
to check whether each directory contains a dataset_description.json
and, if it does, produce the appropriate asset classes for the files withindandi_file
a dataset_description: Optional[DatasetDescriptionAsset] = None
argument for use when the file is known to reside within a BIDS dataset; if this argument is None
, it is assumed to not be in oneNote that this assumes that, if a Dandiset contains one or more BIDS datasets that aren't at the root, then we want any non-NWB, non-Zarr files outside of the BIDS datasets to not be uploaded as assets by default.
Sounds good for a start! I dislike a little implicit DatasetDescriptionAsset
implying having a BIDS dataset. I guess we can start with that since I don't have a better idea ATM. Let's just make it more explicit BIDSDatasetDescriptionAsset
and bids_dataset_description
.
My main concern is the combinatoric NWBBidsAsset(NWBAsset, BIDSFileAsset)
since we might then need to also do it for any other asset type such as VideoAsset
and get VideoBidsAsset
. I guess we can start doing this until we run into some real explosion would require instead establishing some more flexible construct. But what would you say about having CompositeBIDSFileAsset(BIDSFileAsset)
which would be just parametrized at instantiation with an instance of underlying NWBAsset
(or VideoAsset
) which could be asked for its metadata/validation. This way it would just delegate to some "underlying" data type consistently across all such combinations without requiring multiple inheritance. WDYT?
NB I would prefer consistent capitalization of BIDS
so to have NWBBIDSAsset
even though parsing between two acronyms becomes impossible.
@yarikoptic I'd prefer multiple inheritance to delegating to an attribute. For one thing, given an instance of an NWB in a BIDS dataset, I would expect isinstance(nwb, NWBAsset)
and isinstance(nwb, BIDSFileAsset)
to both be true.
Let's just make it more explicit ...
bids_dataset_description
.
Are you suggesting changing the name of the file that denotes a BIDS dataset? I don't think that's up to us.
Let's just make it more explicit ...
bids_dataset_description
.Are you suggesting changing the name of the file that denotes a BIDS dataset? I don't think that's up to us.
no, I am just suggesting adding BIDS to our class/option names as e.g. in dataset_description: Optional[DatasetDescriptionAsset] = None
argument you suggested.
@yarikoptic I'd prefer multiple inheritance to delegating to an attribute. For one thing, given an instance of an NWB in a BIDS dataset, I would expect
isinstance(nwb, NWBAsset)
andisinstance(nwb, BIDSFileAsset)
to both be true.
hm... ok... but then let's have still have the logic of behavior centralized, may be it should be a mixin NWBBidsAsset(NWBAsset, BIDSFileAsset, CompositeBIDSFileAssetMixIn)
where it would be the CompositeBIDSFileAssetMixIn
centralizing that logic of preferring BIDSFileAsset extracted metadata and doing the validation for consistency?
@jwodder will work on RFing code per above, @TheChymera please point him to the interfaces for BIDS validation and metadata extraction he should use.
@yarikoptic @TheChymera Question: What happens if a directory contains a dataset_description.json
file and one of its (possibly nested) subdirectories also contains a dataset_description.json
file? Does the inner file start a new BIDS dataset, or is it treated as just another file of the outer dataset?
treat it as a "start of a new BIDS dataset". It will be for validator to decide if that is a permitted location to have such nested BIDS dataset (some, like rawdata/
or any directory under derivatives/
folder are ok to be nested BIDS datasets, and in turn have other nested BIDS datasets)
@jwodder from the point of view of the pilot DANDI support we have just merged, both would be detected as candidates, and the validation would be run for both. However, running the validator on the parent directory will also index the child directory and will flag any duplicate “top level” files (such as README) as invalid.
This is a repercussion of top level matches being necessarily unique, i.e. unlike entity files there should only ever be one file matching one pattern...
Ultimately this is a shortcoming of the validator path selection and not of our DANDI wrapper. I'll work on fixing it in the validator code.
@TheChymera So, if I have a dataset_description.json
file and a list of all files in a BIDS dataset:
@jwodder if the dataset_description.json
file is inside the BIDS directory, you would just just point the validator to the BIDS directory, e.g.:
chymera@decohost ~/src/bids-examples $ dandi validate-bids asl003/
2022-07-19 16:53:10,234 [ WARNING] A newer version (0.45.1) of dandi/dandi-cli is available. You are using 0.44.1+45.g2ed8929.dirty
2022-07-19 16:53:10,384 [ INFO] Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2022-07-19 16:53:10,384 [ INFO] NumExpr defaulting to 8 threads.
2022-07-19 16:53:11,608 [ WARNING] BIDSVersion 1.5.0 is less than the minimal working 1.7.0+369. Falling back to 1.7.0+369. To force the usage of earlier versions specify them explicitly when calling the validator.
All filenames are BIDS-valid and no mandatory files are missing.
2022-07-19 16:53:12,824 [ INFO] Logs saved in /home/chymera/.cache/dandi-cli/log/20220719205309Z-9230.log
If you do not have the directory at hand and just a list of paths which do not exist on your system this will only be possible after we have merge dummy_path
support:
How does the DANDI metadata for the assets differ from non-BIDS assets?
Currently it differs in that the keys are not fully homogenized with the DANDI nomenclature. I did a bit of standardizing here, but there are a few more keys coming in a new PR soon.
@TheChymera
@jwodder
I meant, how would I do the validation using the Python API, not the CLI?
currently that can't be done on :master because we need the actual paths, however, as soon as the dummy_paths
parameter is in, you could do it as I have in this PR (it's the same PR which implements dummy path support ahead of upstream):
https://github.com/dandi/dandi-cli/blob/1ae0a3b2a8ab407b9f7347ba0d7e33c3b627fc87/dandi/bids_utils.py#L116-L131
How should I generate the metadata for BIDS assets?
The validator returns a list of dictionaries for all matches found, where the keys are BIDS-standard (but not DANDI-standard) metadata fields, that is the standardization I previously mentioned: https://github.com/dandi/dandi-cli/blob/1ae0a3b2a8ab407b9f7347ba0d7e33c3b627fc87/dandi/bids_utils.py#L134
I meant, how would I do the validation using the Python API, not the CLI?
currently that can't be done on :master because we need the actual paths
AFAIK @jwodder would have access to actual paths for any purpose of validation or upload
AFAIK @jwodder would have access to actual paths for any purpose of validation or upload
Then it's as simple as:
chymera@decohost ~ $ cat lala.py
from dandi.validate import validate_bids
result = validate_bids("~/src/bids-examples/asl003")
for i in result["match_listing"]:
print(i["path"])
try:
print(i["subject"])
print(i["session"])
except:
pass
chymera@decohost ~ $ python lala.py
2022-07-19 17:48:55,647 [ WARNING] BIDSVersion 1.5.0 is less than the minimal working 1.7.0+369. Falling back to 1.7.0+369. To force the usage of earlier versions specify them explicitly when calling the validator.
/home/chymera/src/bids-examples/asl003/CHANGES
/home/chymera/src/bids-examples/asl003/dataset_description.json
/home/chymera/src/bids-examples/asl003/README
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_m0scan.nii.gz
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_aslcontext.tsv
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_m0scan.json
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.json
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.nii.gz
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asllabeling.jpg
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/anat/sub-Sub1_T1w.json
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/anat/sub-Sub1_T1w.nii.gz
Sub1
None
@yarikoptic Should the validation errors for a BIDS dataset be attached to the dataset_description.json
object or to the relevant assets (for those that are asset-specific)? Note that the latter will be much harder to do when processing upload assets in parallel.
@yarikoptic @satra @TheChymera Also, the get_validation_errors()
method of asset objects takes a schema_version
argument denoting the version of the Dandi metadata schema to use. Is there a way to map this version to a BIDS schema version?
Is there a way to map this version to a BIDS schema version?
@jwodder - there isn't, since those are two distinct schemas. just like NWB has a version, i would keep the bids version as described in the dataset_description.json. and if you are referring to the bids schema as implemented in dandi, one could add validation information, including dandi-specific bids schema version, in the provenance section of the metadata (wasGeneratedBy
) since that is a list of activities.
@yarikoptic Should the validation errors for a BIDS dataset be attached to the
dataset_description.json
object or to the relevant assets (for those that are asset-specific)? Note that the latter will be much harder to do when processing upload assets in parallel.
Short overall answer before I outline a possible design I see: I think errors associated with paths should be available (assigned or query-able) for those specific paths and it should be possible to get all errors across all paths of the dataset (#943 relates as might need tune up depending on what we come up with here or there ;)).
ATM, upload
allows to skip "invalid" files while uploading "valid" files. Originally it was done because validation was done "in-line" with upload -- it was taking time to validate, so we did not pre-validate all files to say that entire dataset is bad, but discovered/reported bad assets in the course of the upload. Moreover, validation of dandiset.yaml
was done independently from all other files, although it is more of a "Dandiset level validation". For BIDS we still can do the same since current validation can take individual path at a time I believe. But overall we might want to step away from that practice.
With that in mind, even though in short term we could associate with some superficial DatasetDescriptionAsset
to signal that it is a BIDS dataset, we might better provide more explicit structure.
@yarikoptic @satra @TheChymera Also, the
get_validation_errors()
method of asset objects takes aschema_version
argument denoting the version of the Dandi metadata schema to use. Is there a way to map this version to a BIDS schema version?
don't worry about BIDS version here! And looking at how we "use" schema_version
in get_validation_errors
it seems that ability to provide version is "superficial" since we simply fail to validate if specified version is not current... And in general relying on dandischema
to do a possible upgrade for current dandischema version if version in the record is outdated. So I feel that may be ability to specify version is "legacy" (@satra -- do you remember anything on that?).
For validate-bids
I believe we provided ability to specify BIDS schema version, since we keep working on BIDS spec to fit our needs so some times "cheat" in that we provide another (our) version of BIDS schema to use.
@yarikoptic I'm thinking that such a restructuring of the classes should happen in a separate issue/PR. For right now, if we want individual BIDS assets to have their own validation errors (with BIDS dataset-wide errors just associated with the dataset_description.json
asset for now), at what point in the upload process should this validation occur? Should all BIDS datasets be validated before processing any individual file uploads, with asset-specific errors attached to the respective assets for reporting once the upload code gets around to them, or should the individual BIDS assets be validated in the course of the upload by calling some method (that is both cached and thread-locked) of the dataset_description.json
asset?
@yarikoptic I'm thinking that such a restructuring of the classes should happen in a separate issue/PR.
done - #1082
For right now, if we want individual BIDS assets to have their own validation errors (with BIDS dataset-wide errors just associated with the
dataset_description.json
asset for now), at what point in the upload process should this validation occur? Should all BIDS datasets be validated before processing any individual file uploads, with asset-specific errors attached to the respective assets for reporting once the upload code gets around to them, or should the individual BIDS assets be validated in the course of the upload by calling some method (that is both cached and thread-locked) of thedataset_description.json
asset?
For consistency with how nwbs are handled and since ATM it is allowed -- easiest is to do at the point of getting to that file during upload. It would cause some duplicate avoidable "find the root of BIDS dataset" but it could (if not already) be cached so shouldn't add too much penalty hopefully. In the course of #1082 refactoring, as I have mentioned, for BIDS such validation might need to happen at the level of the BIDS dataset first before any file from it gets considered for upload. We might even later introduce policy/behavior to disallow upload of DANDI or BIDS datasets altogether by default if any file fails validation (it is not current behavior).
@TheChymera Does the structure returned by validate_bids()
currently contain any information on asset-specific validation errors, or are all the errors dataset-wide things like required files being missing? If the former, how do I extract the relevant information from the return value?
ping @TheChymera . After all current BIDS work is merged into BIDS we really need to work out to establish more "accessible" interface to BIDS validation (proper "error records", #943) -- currently returned dict
is hard to "understand".
@jwodder - consider following keys in the returned dict, and @TheChymera please expand/fix if I am incomplete/wrong in any of that
path_listing
-- should have paths which were "tested". So if you gave a folder for validation, this would list all paths which were testedpath_tracking
-- if path is listed in that list -- it is invalid. So it is per-asset errorschema_tracking
-- dataset-level errors for missing required assets among those which were given to be validated.That deep meaning over those keys could be gathered by looking at the reporting code at https://github.com/dandi/dandi-cli/blob/HEAD/dandi/bids_validator_xs.py#L513
@TheChymera @yarikoptic Should files listed in path_tracking
be treated as failing validation? The code here implies that such a state is just a warning rather than an error.
Yes AFAIK, treat as error
@TheChymera please expand/fix if I am incomplete/wrong in any of that
the description is correct. Sorry if it seemed convoluted, what it's doing is that the “tracking” list is eroded with each match.
@TheChymera If I already have the paths of all the BIDS assets in the dataset, is there a preferred way to call the validation code with those paths so that it doesn't fetch the paths again, or do I have to modify what's already in bids_validator_xs.py
?
@jwodder you can pass a list of files to validate_bids()
under the bids_paths
parameter:
The get file logic will be run, but it will bypass most the checks: https://github.com/dandi/dandi-cli/blob/468e39c9c3f7962e5078e9bdc4b6ad1f16fa4ee8/dandi/support/bids/validator.py#L61-L63
If you want to not run the get file logic at all (in which case you can also use paths which don't exist locally) you can use the dummy_path
mechanic I described earlier.
I believe I've properly integrated validation in b83ba3f.
@TheChymera @satra @yarikoptic Could someone please explain exactly how to acquire BIDS metadata and integrate it into Dandi metadata?
@jwodder I had some basic support for that here:
Basically the keys in the result[match_listing]
dictionaries returned by the bundled module are just changed to the DANDI equivalent.
@TheChymera That involves running validate_bids()
on just the individual path. If I've already run it on all the assets in the BIDS dataset at once, how do I get an individual asset's metadata out of the result?
@jwodder to do that you need a set of keys, the values of which unambiguously identify the file, path
is of course the obvious one, though it can be larger sets as well.
chymera@decohost ~ $ cat /tmp/lala.py
from dandi.validate import validate_bids
result = validate_bids("~/src/bids-examples/asl003")
print("These are the valid paths:")
for i in result["match_listing"]:
print(i["path"])
selection = "/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.nii.gz"
print(f"Now let's take one and get the info: {selection}")
print([i for i in result["match_listing"] if i["path"] == selection])
chymera@decohost ~ $ python /tmp/lala.py
2022-07-27 16:15:57,799 [ WARNING] BIDSVersion 1.5.0 is less than the minimal working 1.7.0+369. Falling back to 1.7.0+369. To force the usage of earlier versions specify them explicitly when calling the validator.
These are the valid paths:
/home/chymera/src/bids-examples/asl003/CHANGES
/home/chymera/src/bids-examples/asl003/dataset_description.json
/home/chymera/src/bids-examples/asl003/README
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_m0scan.nii.gz
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_aslcontext.tsv
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_m0scan.json
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.json
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.nii.gz
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asllabeling.jpg
/home/chymera/src/bids-examples/asl003/sub-Sub1/anat/sub-Sub1_T1w.json
/home/chymera/src/bids-examples/asl003/sub-Sub1/anat/sub-Sub1_T1w.nii.gz
Now let's take one and get the info: /home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.nii.gz
[{'subject': 'Sub1', 'session': None, 'acquisition': None, 'reconstruction': None, 'direction': None, 'run': None, 'path': '/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.nii.gz'}]
@TheChymera @yarikoptic @satra The code linked by Horea above only applies BIDS metadata to NWB files, and it replaces the normal NWB metadata. Are either of those things that we want to do/keep doing?
the normal metadata for NWB should not be replaced nor should the bids metadata be applied to NWB files.
if it's bids, then there is the possibility of an NWB file inside it, so it should also do the related NWB processing for those files.
however, setting that aside, any bids path should result in metadata that augments all the standard metadata we have for an asset. so similar to nwb2asset
, we will want a bidspath2asset
. this latter function also needs to handle a participants.tsv
file and, if present, a sessions.tsv
and a samples.tsv
files to fill in the relevant pieces of the asset metadata,
the original use of the metadata dictionary was to support organize, and that would be irrelevant for a bids dandiset.
@jwodder
only applies BIDS metadata to NWB files,
From my local testing with ls
it seemed to source the data of other files as well, e.g. .nii.gz
.
it replaces the normal NWB metadata
Yes, this is true, It doesn't replace it, though, it just doesn't go on to source the nwb data at all if it's a BIDS dataset. Perhaps this isn't such a bad idea? i.e. if it's a BIDS dataset we source the information via the BIDS standard.
@satra So:
@TheChymera
From my local testing with
ls
it seemed to source the data of other files as well, e.g..nii.gz
.
dandi ls
does not accurately reflect metadata for non-NWB files, as it uses the get_metadata()
function, which the upload routines only use for NWBs.
nwb file have had two notions of metadata, one that is extracted from the file and then the Asset structure (which is what i would call metadata). are you talking about the former or both?
for current flow: nwb --> metadata --> Asset metadata
for bids it would be: bidspath --> metadata --> asset metadata bidspath + nwb --> metadata (joint info from both processors) --> asset metadata
perhaps the refactoring can simply create one metadata structure, the Asset structure, which can then be used for other things (ls, organize, validate, etc.,.).
@satra
nwb file have had two notions of metadata, one that is extracted from the file and then the Asset structure (which is what i would call metadata). are you talking about the former or both?
I'm talking about the final dandischema.BareAsset
value.
for bids it would be:
bidspath --> metadata --> asset metadata
bidspath + nwb --> metadata (joint info from both processors) --> asset metadata
Describe these steps in detail.
Describe these steps in detail.
i would leave that to @TheChymera and can review it, but describing those steps would be almost equivalent to writing the code. essentially one has to follow the details of the bids-specification here and map each field or piece of information into the bareasset structure.
@satra So:
- Given a BIDS asset that is not an NWB file, how should I determine the metadata?
This would be for the @TheChymera to clarify.
- Given a BIDS asset that is also an NWB file, how should I determine the metadata?
Given the current (absent yet overall) agreement in https://github.com/bids-standard/bids-specification/pull/761#issuecomment-1183710164 about overloading of metadata in side-car files (and thus in filepath), I would say: for extraction (not validation), take metadata contained in .nwb file, and only complement (add fields which have no non-None value) with the metadata from the BIDS filename (we aren't even getting anything from sidecar yet, right @TheChymera ?)
@yarikoptic @TheChymera What about things like BIDS' "subject" field, which the code currently maps to our "subject_id" field? Should the metadata for a BIDS NWB asset use the subject_id from the NWB, from BIDS, or what?
@yarikoptic @TheChymera What about things like BIDS' "subject" field, which the code currently maps to our "subject_id" field? Should the metadata for a BIDS NWB asset use the subject_id from the NWB, from BIDS, or what?
per my comment above let's do from .nwb for now for consistency (unless empty there). It would be for (our) validator to complain if there is inconsistency, and it will be for .nwb overloading (WiP) to be used to "fix it" if needed.
@yarikoptic @TheChymera Based on what Horea's posted so far, the BIDS metadata contains the following fields:
subject
(currently renamed to subject_id
)session
(currently renamed to session_id
)path
acquisition
reconstruction
direction
run
bids_schema_version
Just where in a dandischema.BareAsset
should these fields be mapped to? Aside from the first three fields, none of these seem to have any presence in the Dandi schema, and I believe they're currently discarded when uploading.
@yarikoptic
we aren't even getting anything from sidecar yet,
yes, no files are being opened.
I don't have any strong opinions either way, really, but the reason why I suggested going BIDS-first (or rather path-first) is for what I thought was consistency/usability. We don't get any metadata from the NIfTI headers either, so in addition to privileging NWB over BIDS we'd be privileging NWB over NIfTI. Not least of all, path metadata can be easily sourced without downloading any files... But if you disagree, don't mind me :) just explaining why I suggested BIDS first.
@yarikoptic @TheChymera Based on what Horea's posted so far, the BIDS metadata contains the following fields: ...
Those are the fields for a file with that specific suffix.
In total BIDS files can have any of the following entities, though they might be None
if none is given (most are optional) or not present as keys in the output if they would be invalid for the specific suffix.
chymera@darkhost ~/src/bids-specification/src/schema $ ag "^[a-z]*?:" objects/entities.yaml
6:acquisition:
28:atlas:
38:ceagent:
49:chunk:
58:density:
69:description:
79:direction:
88:echo:
101:flip:
113:hemisphere:
125:inversion:
137:label:
148:modality:
157:mtransfer:
171:part:
195:processing:
209:reconstruction:
218:recording:
228:resolution:
239:run:
256:sample:
268:session:
289:space:
307:split:
329:stain:
341:subject:
348:task:
358:tracer:
@jwodder I do not think we map much from BIDS to our metadata ATM. Just use what we have at https://github.com/dandi/dandi-cli/blob/HEAD/dandi/metadata.py#L124
_meta = validate_bids(path)
meta = _meta["match_listing"][0]
meta["bids_schema_version"] = _meta["bids_schema_version"]
meta = _rename_bids_keys(meta)
may be RF it into a function get_bids_raw_metadata
. Then eventually we would add more keys mappings into that _rename_bids_keys
I guess. And also extraction of sex
is not there yet at all.
@yarikoptic - for mapping to bareasset the function should map subject, session, modality info. i would suggest a bids2asset function similar to the nwb2asset function.
@yarikoptic Using that code will result in only subject and session being set, as those are the only keys that are mapped to something that prepare_metadata()
(née metadata2asset()
) recognizes. Are you sure that's what you want?
@satra
for mapping to bareasset the function should map subject, session, modality info.
Where/how is modality represented in the Dandi schema?
i would suggest a bids2asset function similar to the nwb2asset function.
Assuming this function returns a BareAsset
instance, do you have a recommendation for how to combine a BIDS BareAsset
and an NWB BareAsset
in accordance with this comment of Yarik's?
1011 is adding basic metadata extraction for BIDS datasets. Introduced in #1011 is more of a "hack" than proper addition of support for BIDS datasets. It is 'ad-hoc' in part due to the clear separation of "asset types" in https://github.com/dandi/dandi-cli/blob/HEAD/dandi/files.py e.g. to
NWBAsset
(with custom metadata extraction and validation) vsVideoAsset
(nothing special ATM) toGenericAsset
(really nothing special ;) ). With introduction of support of BIDS datasets it gets tricky:dataset_description.json
withBIDSVersion
in itso a dandiset can contain multiple BIDS (sb)datasets
sub-1_slice-1.nwb
it would likely to come fromsub-1_slice-1.overwrite.json
(if present).sub-1_slice-1.nwb
it could come fromsub-1_slice-1.json
validator
to complain whenever there is incongruence between different sources of metadata@jwodder -- how
files.py
and anything else needed should be refactored so we support such multiple sources of metadata: file format based + BIDS