dandi / dandi-cli

DANDI command line client to facilitate common operations
https://dandi.readthedocs.io/
Apache License 2.0
22 stars 26 forks source link

RF "files.py" validation/metadata-loading to support BIDS #1044

Closed yarikoptic closed 2 years ago

yarikoptic commented 2 years ago

1011 is adding basic metadata extraction for BIDS datasets. Introduced in #1011 is more of a "hack" than proper addition of support for BIDS datasets. It is 'ad-hoc' in part due to the clear separation of "asset types" in https://github.com/dandi/dandi-cli/blob/HEAD/dandi/files.py e.g. to NWBAsset (with custom metadata extraction and validation) vs VideoAsset (nothing special ATM) to GenericAsset (really nothing special ;) ). With introduction of support of BIDS datasets it gets tricky:

derivatives/<some subdatasets some of which are BIDS>/
rawdata/ - BIDS dataset

so a dandiset can contain multiple BIDS (sb)datasets

@jwodder -- how files.py and anything else needed should be refactored so we support such multiple sources of metadata: file format based + BIDS

jwodder commented 2 years ago

@yarikoptic I'm thinking something like this:

Note that this assumes that, if a Dandiset contains one or more BIDS datasets that aren't at the root, then we want any non-NWB, non-Zarr files outside of the BIDS datasets to not be uploaded as assets by default.

yarikoptic commented 2 years ago

Sounds good for a start! I dislike a little implicit DatasetDescriptionAsset implying having a BIDS dataset. I guess we can start with that since I don't have a better idea ATM. Let's just make it more explicit BIDSDatasetDescriptionAsset and bids_dataset_description.

My main concern is the combinatoric NWBBidsAsset(NWBAsset, BIDSFileAsset) since we might then need to also do it for any other asset type such as VideoAsset and get VideoBidsAsset. I guess we can start doing this until we run into some real explosion would require instead establishing some more flexible construct. But what would you say about having CompositeBIDSFileAsset(BIDSFileAsset) which would be just parametrized at instantiation with an instance of underlying NWBAsset (or VideoAsset) which could be asked for its metadata/validation. This way it would just delegate to some "underlying" data type consistently across all such combinations without requiring multiple inheritance. WDYT?

NB I would prefer consistent capitalization of BIDS so to have NWBBIDSAsset even though parsing between two acronyms becomes impossible.

jwodder commented 2 years ago

@yarikoptic I'd prefer multiple inheritance to delegating to an attribute. For one thing, given an instance of an NWB in a BIDS dataset, I would expect isinstance(nwb, NWBAsset) and isinstance(nwb, BIDSFileAsset) to both be true.

Let's just make it more explicit ... bids_dataset_description.

Are you suggesting changing the name of the file that denotes a BIDS dataset? I don't think that's up to us.

yarikoptic commented 2 years ago

Let's just make it more explicit ... bids_dataset_description.

Are you suggesting changing the name of the file that denotes a BIDS dataset? I don't think that's up to us.

no, I am just suggesting adding BIDS to our class/option names as e.g. in dataset_description: Optional[DatasetDescriptionAsset] = None argument you suggested.

@yarikoptic I'd prefer multiple inheritance to delegating to an attribute. For one thing, given an instance of an NWB in a BIDS dataset, I would expect isinstance(nwb, NWBAsset) and isinstance(nwb, BIDSFileAsset) to both be true.

hm... ok... but then let's have still have the logic of behavior centralized, may be it should be a mixin NWBBidsAsset(NWBAsset, BIDSFileAsset, CompositeBIDSFileAssetMixIn) where it would be the CompositeBIDSFileAssetMixIn centralizing that logic of preferring BIDSFileAsset extracted metadata and doing the validation for consistency?

yarikoptic commented 2 years ago

@jwodder will work on RFing code per above, @TheChymera please point him to the interfaces for BIDS validation and metadata extraction he should use.

jwodder commented 2 years ago

@yarikoptic @TheChymera Question: What happens if a directory contains a dataset_description.json file and one of its (possibly nested) subdirectories also contains a dataset_description.json file? Does the inner file start a new BIDS dataset, or is it treated as just another file of the outer dataset?

yarikoptic commented 2 years ago

treat it as a "start of a new BIDS dataset". It will be for validator to decide if that is a permitted location to have such nested BIDS dataset (some, like rawdata/ or any directory under derivatives/ folder are ok to be nested BIDS datasets, and in turn have other nested BIDS datasets)

TheChymera commented 2 years ago

@jwodder from the point of view of the pilot DANDI support we have just merged, both would be detected as candidates, and the validation would be run for both. However, running the validator on the parent directory will also index the child directory and will flag any duplicate “top level” files (such as README) as invalid.

This is a repercussion of top level matches being necessarily unique, i.e. unlike entity files there should only ever be one file matching one pattern...

Ultimately this is a shortcoming of the validator path selection and not of our DANDI wrapper. I'll work on fixing it in the validator code.

TheChymera commented 2 years ago

Fix incoming: https://github.com/bids-standard/bids-specification/pull/1145

jwodder commented 2 years ago

@TheChymera So, if I have a dataset_description.json file and a list of all files in a BIDS dataset:

TheChymera commented 2 years ago

@jwodder if the dataset_description.json file is inside the BIDS directory, you would just just point the validator to the BIDS directory, e.g.:

chymera@decohost ~/src/bids-examples $ dandi validate-bids asl003/
2022-07-19 16:53:10,234 [ WARNING] A newer version (0.45.1) of dandi/dandi-cli is available. You are using 0.44.1+45.g2ed8929.dirty
2022-07-19 16:53:10,384 [    INFO] Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2022-07-19 16:53:10,384 [    INFO] NumExpr defaulting to 8 threads.
2022-07-19 16:53:11,608 [ WARNING] BIDSVersion 1.5.0 is less than the minimal working 1.7.0+369. Falling back to 1.7.0+369. To force the usage of earlier versions specify them explicitly when calling the validator.
All filenames are BIDS-valid and no mandatory files are missing.
2022-07-19 16:53:12,824 [    INFO] Logs saved in /home/chymera/.cache/dandi-cli/log/20220719205309Z-9230.log

If you do not have the directory at hand and just a list of paths which do not exist on your system this will only be possible after we have merge dummy_path support:

How does the DANDI metadata for the assets differ from non-BIDS assets?

Currently it differs in that the keys are not fully homogenized with the DANDI nomenclature. I did a bit of standardizing here, but there are a few more keys coming in a new PR soon.

jwodder commented 2 years ago

@TheChymera

TheChymera commented 2 years ago

@jwodder

I meant, how would I do the validation using the Python API, not the CLI?

currently that can't be done on :master because we need the actual paths, however, as soon as the dummy_paths parameter is in, you could do it as I have in this PR (it's the same PR which implements dummy path support ahead of upstream): https://github.com/dandi/dandi-cli/blob/1ae0a3b2a8ab407b9f7347ba0d7e33c3b627fc87/dandi/bids_utils.py#L116-L131

How should I generate the metadata for BIDS assets?

The validator returns a list of dictionaries for all matches found, where the keys are BIDS-standard (but not DANDI-standard) metadata fields, that is the standardization I previously mentioned: https://github.com/dandi/dandi-cli/blob/1ae0a3b2a8ab407b9f7347ba0d7e33c3b627fc87/dandi/bids_utils.py#L134

yarikoptic commented 2 years ago

I meant, how would I do the validation using the Python API, not the CLI?

currently that can't be done on :master because we need the actual paths

AFAIK @jwodder would have access to actual paths for any purpose of validation or upload

TheChymera commented 2 years ago

AFAIK @jwodder would have access to actual paths for any purpose of validation or upload

Then it's as simple as:

chymera@decohost ~ $ cat lala.py
from dandi.validate import validate_bids
result = validate_bids("~/src/bids-examples/asl003")
for i in result["match_listing"]:
    print(i["path"])
    try:
        print(i["subject"])
        print(i["session"])
    except:
        pass
chymera@decohost ~ $ python lala.py
2022-07-19 17:48:55,647 [ WARNING] BIDSVersion 1.5.0 is less than the minimal working 1.7.0+369. Falling back to 1.7.0+369. To force the usage of earlier versions specify them explicitly when calling the validator.
/home/chymera/src/bids-examples/asl003/CHANGES
/home/chymera/src/bids-examples/asl003/dataset_description.json
/home/chymera/src/bids-examples/asl003/README
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_m0scan.nii.gz
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_aslcontext.tsv
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_m0scan.json
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.json
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.nii.gz
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asllabeling.jpg
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/anat/sub-Sub1_T1w.json
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/anat/sub-Sub1_T1w.nii.gz
Sub1
None
jwodder commented 2 years ago

@yarikoptic Should the validation errors for a BIDS dataset be attached to the dataset_description.json object or to the relevant assets (for those that are asset-specific)? Note that the latter will be much harder to do when processing upload assets in parallel.

jwodder commented 2 years ago

@yarikoptic @satra @TheChymera Also, the get_validation_errors() method of asset objects takes a schema_version argument denoting the version of the Dandi metadata schema to use. Is there a way to map this version to a BIDS schema version?

satra commented 2 years ago

Is there a way to map this version to a BIDS schema version?

@jwodder - there isn't, since those are two distinct schemas. just like NWB has a version, i would keep the bids version as described in the dataset_description.json. and if you are referring to the bids schema as implemented in dandi, one could add validation information, including dandi-specific bids schema version, in the provenance section of the metadata (wasGeneratedBy) since that is a list of activities.

yarikoptic commented 2 years ago

@yarikoptic Should the validation errors for a BIDS dataset be attached to the dataset_description.json object or to the relevant assets (for those that are asset-specific)? Note that the latter will be much harder to do when processing upload assets in parallel.

Short overall answer before I outline a possible design I see: I think errors associated with paths should be available (assigned or query-able) for those specific paths and it should be possible to get all errors across all paths of the dataset (#943 relates as might need tune up depending on what we come up with here or there ;)).

ATM, upload allows to skip "invalid" files while uploading "valid" files. Originally it was done because validation was done "in-line" with upload -- it was taking time to validate, so we did not pre-validate all files to say that entire dataset is bad, but discovered/reported bad assets in the course of the upload. Moreover, validation of dandiset.yaml was done independently from all other files, although it is more of a "Dandiset level validation". For BIDS we still can do the same since current validation can take individual path at a time I believe. But overall we might want to step away from that practice. With that in mind, even though in short term we could associate with some superficial DatasetDescriptionAsset to signal that it is a BIDS dataset, we might better provide more explicit structure.

Extracted as a separate https://github.com/dandi/dandi-cli/issues/1082 now I thought I had described it here or elsewhere but I failed to find it, so coming up with my thinking here. I think we overall have to work with a following structure. I was not thinking yet on how and whether it should relate to existing Dandiset interfaces, may be consider them all below `Local` as in `files.py`: - `Dandiset` - `Dandiset` must have `dandiset.yaml` (`DandisetMetadataFile`) which we know how to validate, it represents metadata about entire Dandiset - `Dandiset` contains either DANDIDataset or BIDSDataset (AFAIK **cannot** be both, @satra please confirm if our thinking aligned). I might even say we must **not** have any "abstract" Dataset - they all should be DANDIDataset or BIDSDataset(s). Default (if not BIDS) is DANDIDataset. - I decided for "contains" instead of just subclassing from Dandiset because nested BIDSDataset might not to "contain" `dandiset.yaml`. But may be those specific ones could be subclasses (thuse `DANDIDandiset` and `BIDSDandiset`), and there would be PlainBIDSDataset implementing Dandiset independent logic (such as validation), and then `BIDSDandiset(Dandiset, PlainBIDSDataset)` to mix in. For now I will continue with `Dataset` delegated flavors below: - `DANDIDataset` is the one `dandi organize` renames into, and the layout (directory/filenames) of which we do not yet validate AFAIK although should (I noted that in my reply here https://github.com/dandi/dandi-cli/issues/1074#issuecomment-1192866526). Errors will be associated with specific paths. - `BIDSDataset` (like 000026) allows for nested BIDSDatasets underneath (e.g. under `derivatives/{name}` or `rawdata/` or somewhere under `sourcedata/`). Currently code finds them anywhere (not necessarily in those locations) and validates them independently. That it is a BIDSDatasets (and not DANDIDatasets) decided by having `dataset_description.json`. - Each asset should be associated with some specific (not just abstract Dataset) BIDSDataset or DANDIDataset, even if that asset is not "valid". - BIDSDataset might be a part of another BIDSDataset. `BIDSDataset.subdatasets: dict` could provide mapping. - each Dataset could have some validation errors associated purely with it, which cannot relate to specific asset(s). An example -- error about *missing* required asset (e.g. `README` with some extension, required in BIDS, is missing) - some validation errors might relate to groups of assets (e.g. some inconsistency across files). I am not sure yet how/where we should keep them, probably within Dataset level, just record containing all those paths. - we should be able to tell what errors are for an asset. - given all the above, I feel that we should collect errors at the level of Dataset and allow for querying for errors associated with a specific sub-path (might be within subdataset in BIDS). So, I am thinking, that for upload/validation, given the optional paths (limiting what we can find) we first figure out the Dandiset (location of `dandiset.yaml`) and its Dataset (BIDS or DANDI), and subdatasets (if paths provided, for those paths). If there is no Dataset -- nothing to upload. If no paths given, it will be for the type of the dataset to figure out which paths to pick up (like we get .nwb and video files for DANDIDataset). Then before upload `Dandiset.validate` would do its `self.DandisetMetadataFile-instance.validate()` followed by `self.Dataset-instance.validate()` (which for `DANDIDataset` would do nothing ATM; for BIDSDataset -- might do something in the future, like may be full dataset validation). Some method like `.discover_local_assets(path: list[str]=None)` should be available to associate/discover optionally provided paths with corresponding Datasets, might be BIDS vs DANDI dependent, and thus building up the tree. `Dandiset.validate_asset(path: str)` could lazily (unless already full validation ran, and that path was marked "ok" or have already results associated) initiate corresponding Dataset (if needed, e.g. if nested BIDSDataset), trigger corresponding (depending on Dataset type and type of the asset, e.g. NWBAsset) `LocalAsset.validate()`, store result of the validation within corresponding `Dataset`, and return validity status for upload to use. Relating to WiP in #1076, `find_dandi_files` and `dandi_file` should become method of that `Dandiset` (already takes, albeit optional `dandiset_path` . If we figure out dandiset, which must exist, first -- can become methods). There it should figure out corresponding to the path `Dataset` and associate it with that `Asset`. To contain all validation results in corresponding Dataset, may be `LocalAsset.validate` should request to update its record within its `Dataset`. Something like that. WDYT?
yarikoptic commented 2 years ago

@yarikoptic @satra @TheChymera Also, the get_validation_errors() method of asset objects takes a schema_version argument denoting the version of the Dandi metadata schema to use. Is there a way to map this version to a BIDS schema version?

don't worry about BIDS version here! And looking at how we "use" schema_version in get_validation_errors it seems that ability to provide version is "superficial" since we simply fail to validate if specified version is not current... And in general relying on dandischema to do a possible upgrade for current dandischema version if version in the record is outdated. So I feel that may be ability to specify version is "legacy" (@satra -- do you remember anything on that?). For validate-bids I believe we provided ability to specify BIDS schema version, since we keep working on BIDS spec to fit our needs so some times "cheat" in that we provide another (our) version of BIDS schema to use.

jwodder commented 2 years ago

@yarikoptic I'm thinking that such a restructuring of the classes should happen in a separate issue/PR. For right now, if we want individual BIDS assets to have their own validation errors (with BIDS dataset-wide errors just associated with the dataset_description.json asset for now), at what point in the upload process should this validation occur? Should all BIDS datasets be validated before processing any individual file uploads, with asset-specific errors attached to the respective assets for reporting once the upload code gets around to them, or should the individual BIDS assets be validated in the course of the upload by calling some method (that is both cached and thread-locked) of the dataset_description.json asset?

yarikoptic commented 2 years ago

@yarikoptic I'm thinking that such a restructuring of the classes should happen in a separate issue/PR.

done - #1082

For right now, if we want individual BIDS assets to have their own validation errors (with BIDS dataset-wide errors just associated with the dataset_description.json asset for now), at what point in the upload process should this validation occur? Should all BIDS datasets be validated before processing any individual file uploads, with asset-specific errors attached to the respective assets for reporting once the upload code gets around to them, or should the individual BIDS assets be validated in the course of the upload by calling some method (that is both cached and thread-locked) of the dataset_description.json asset?

For consistency with how nwbs are handled and since ATM it is allowed -- easiest is to do at the point of getting to that file during upload. It would cause some duplicate avoidable "find the root of BIDS dataset" but it could (if not already) be cached so shouldn't add too much penalty hopefully. In the course of #1082 refactoring, as I have mentioned, for BIDS such validation might need to happen at the level of the BIDS dataset first before any file from it gets considered for upload. We might even later introduce policy/behavior to disallow upload of DANDI or BIDS datasets altogether by default if any file fails validation (it is not current behavior).

jwodder commented 2 years ago

@TheChymera Does the structure returned by validate_bids() currently contain any information on asset-specific validation errors, or are all the errors dataset-wide things like required files being missing? If the former, how do I extract the relevant information from the return value?

yarikoptic commented 2 years ago

ping @TheChymera . After all current BIDS work is merged into BIDS we really need to work out to establish more "accessible" interface to BIDS validation (proper "error records", #943) -- currently returned dict is hard to "understand".

@jwodder - consider following keys in the returned dict, and @TheChymera please expand/fix if I am incomplete/wrong in any of that

That deep meaning over those keys could be gathered by looking at the reporting code at https://github.com/dandi/dandi-cli/blob/HEAD/dandi/bids_validator_xs.py#L513

jwodder commented 2 years ago

@TheChymera @yarikoptic Should files listed in path_tracking be treated as failing validation? The code here implies that such a state is just a warning rather than an error.

yarikoptic commented 2 years ago

Yes AFAIK, treat as error

TheChymera commented 2 years ago

@TheChymera please expand/fix if I am incomplete/wrong in any of that

the description is correct. Sorry if it seemed convoluted, what it's doing is that the “tracking” list is eroded with each match.

jwodder commented 2 years ago

@TheChymera If I already have the paths of all the BIDS assets in the dataset, is there a preferred way to call the validation code with those paths so that it doesn't fetch the paths again, or do I have to modify what's already in bids_validator_xs.py?

TheChymera commented 2 years ago

@jwodder you can pass a list of files to validate_bids() under the bids_paths parameter:

https://github.com/dandi/dandi-cli/blob/468e39c9c3f7962e5078e9bdc4b6ad1f16fa4ee8/dandi/support/bids/validator.py#L730

The get file logic will be run, but it will bypass most the checks: https://github.com/dandi/dandi-cli/blob/468e39c9c3f7962e5078e9bdc4b6ad1f16fa4ee8/dandi/support/bids/validator.py#L61-L63

If you want to not run the get file logic at all (in which case you can also use paths which don't exist locally) you can use the dummy_path mechanic I described earlier.

jwodder commented 2 years ago

I believe I've properly integrated validation in b83ba3f.

@TheChymera @satra @yarikoptic Could someone please explain exactly how to acquire BIDS metadata and integrate it into Dandi metadata?

TheChymera commented 2 years ago

@jwodder I had some basic support for that here:

https://github.com/dandi/dandi-cli/blob/1c947365311732943753e15199a57c9bfd2759bf/dandi/metadata.py#L124-L128

Basically the keys in the result[match_listing] dictionaries returned by the bundled module are just changed to the DANDI equivalent.

jwodder commented 2 years ago

@TheChymera That involves running validate_bids() on just the individual path. If I've already run it on all the assets in the BIDS dataset at once, how do I get an individual asset's metadata out of the result?

TheChymera commented 2 years ago

@jwodder to do that you need a set of keys, the values of which unambiguously identify the file, path is of course the obvious one, though it can be larger sets as well.

chymera@decohost ~ $ cat /tmp/lala.py 
from dandi.validate import validate_bids
result = validate_bids("~/src/bids-examples/asl003")

print("These are the valid paths:")
for i in result["match_listing"]:
    print(i["path"])

selection = "/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.nii.gz"
print(f"Now let's take one and get the info: {selection}")
print([i for i in result["match_listing"] if i["path"] == selection])
chymera@decohost ~ $ python /tmp/lala.py 
2022-07-27 16:15:57,799 [ WARNING] BIDSVersion 1.5.0 is less than the minimal working 1.7.0+369. Falling back to 1.7.0+369. To force the usage of earlier versions specify them explicitly when calling the validator.
These are the valid paths:
/home/chymera/src/bids-examples/asl003/CHANGES
/home/chymera/src/bids-examples/asl003/dataset_description.json
/home/chymera/src/bids-examples/asl003/README
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_m0scan.nii.gz
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_aslcontext.tsv
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_m0scan.json
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.json
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.nii.gz
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asllabeling.jpg
/home/chymera/src/bids-examples/asl003/sub-Sub1/anat/sub-Sub1_T1w.json
/home/chymera/src/bids-examples/asl003/sub-Sub1/anat/sub-Sub1_T1w.nii.gz
Now let's take one and get the info: /home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.nii.gz
[{'subject': 'Sub1', 'session': None, 'acquisition': None, 'reconstruction': None, 'direction': None, 'run': None, 'path': '/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.nii.gz'}]
jwodder commented 2 years ago

@TheChymera @yarikoptic @satra The code linked by Horea above only applies BIDS metadata to NWB files, and it replaces the normal NWB metadata. Are either of those things that we want to do/keep doing?

satra commented 2 years ago

the normal metadata for NWB should not be replaced nor should the bids metadata be applied to NWB files.

if it's bids, then there is the possibility of an NWB file inside it, so it should also do the related NWB processing for those files.

however, setting that aside, any bids path should result in metadata that augments all the standard metadata we have for an asset. so similar to nwb2asset, we will want a bidspath2asset. this latter function also needs to handle a participants.tsv file and, if present, a sessions.tsv and a samples.tsv files to fill in the relevant pieces of the asset metadata,

the original use of the metadata dictionary was to support organize, and that would be irrelevant for a bids dandiset.

TheChymera commented 2 years ago

@jwodder

only applies BIDS metadata to NWB files,

From my local testing with ls it seemed to source the data of other files as well, e.g. .nii.gz.

it replaces the normal NWB metadata

Yes, this is true, It doesn't replace it, though, it just doesn't go on to source the nwb data at all if it's a BIDS dataset. Perhaps this isn't such a bad idea? i.e. if it's a BIDS dataset we source the information via the BIDS standard.

jwodder commented 2 years ago

@satra So:

jwodder commented 2 years ago

@TheChymera

From my local testing with ls it seemed to source the data of other files as well, e.g. .nii.gz.

dandi ls does not accurately reflect metadata for non-NWB files, as it uses the get_metadata() function, which the upload routines only use for NWBs.

satra commented 2 years ago

nwb file have had two notions of metadata, one that is extracted from the file and then the Asset structure (which is what i would call metadata). are you talking about the former or both?

for current flow: nwb --> metadata --> Asset metadata

for bids it would be: bidspath --> metadata --> asset metadata bidspath + nwb --> metadata (joint info from both processors) --> asset metadata

perhaps the refactoring can simply create one metadata structure, the Asset structure, which can then be used for other things (ls, organize, validate, etc.,.).

jwodder commented 2 years ago

@satra

nwb file have had two notions of metadata, one that is extracted from the file and then the Asset structure (which is what i would call metadata). are you talking about the former or both?

I'm talking about the final dandischema.BareAsset value.

for bids it would be:
bidspath --> metadata --> asset metadata
bidspath + nwb --> metadata (joint info from both processors) --> asset metadata

Describe these steps in detail.

satra commented 2 years ago

Describe these steps in detail.

i would leave that to @TheChymera and can review it, but describing those steps would be almost equivalent to writing the code. essentially one has to follow the details of the bids-specification here and map each field or piece of information into the bareasset structure.

yarikoptic commented 2 years ago

@satra So:

  • Given a BIDS asset that is not an NWB file, how should I determine the metadata?

This would be for the @TheChymera to clarify.

  • Given a BIDS asset that is also an NWB file, how should I determine the metadata?

Given the current (absent yet overall) agreement in https://github.com/bids-standard/bids-specification/pull/761#issuecomment-1183710164 about overloading of metadata in side-car files (and thus in filepath), I would say: for extraction (not validation), take metadata contained in .nwb file, and only complement (add fields which have no non-None value) with the metadata from the BIDS filename (we aren't even getting anything from sidecar yet, right @TheChymera ?)

jwodder commented 2 years ago

@yarikoptic @TheChymera What about things like BIDS' "subject" field, which the code currently maps to our "subject_id" field? Should the metadata for a BIDS NWB asset use the subject_id from the NWB, from BIDS, or what?

yarikoptic commented 2 years ago

@yarikoptic @TheChymera What about things like BIDS' "subject" field, which the code currently maps to our "subject_id" field? Should the metadata for a BIDS NWB asset use the subject_id from the NWB, from BIDS, or what?

per my comment above let's do from .nwb for now for consistency (unless empty there). It would be for (our) validator to complain if there is inconsistency, and it will be for .nwb overloading (WiP) to be used to "fix it" if needed.

jwodder commented 2 years ago

@yarikoptic @TheChymera Based on what Horea's posted so far, the BIDS metadata contains the following fields:

Just where in a dandischema.BareAsset should these fields be mapped to? Aside from the first three fields, none of these seem to have any presence in the Dandi schema, and I believe they're currently discarded when uploading.

TheChymera commented 2 years ago

@yarikoptic

we aren't even getting anything from sidecar yet,

yes, no files are being opened.

I don't have any strong opinions either way, really, but the reason why I suggested going BIDS-first (or rather path-first) is for what I thought was consistency/usability. We don't get any metadata from the NIfTI headers either, so in addition to privileging NWB over BIDS we'd be privileging NWB over NIfTI. Not least of all, path metadata can be easily sourced without downloading any files... But if you disagree, don't mind me :) just explaining why I suggested BIDS first.

TheChymera commented 2 years ago

@yarikoptic @TheChymera Based on what Horea's posted so far, the BIDS metadata contains the following fields: ...

Those are the fields for a file with that specific suffix.

In total BIDS files can have any of the following entities, though they might be None if none is given (most are optional) or not present as keys in the output if they would be invalid for the specific suffix.

chymera@darkhost ~/src/bids-specification/src/schema $ ag "^[a-z]*?:" objects/entities.yaml 
6:acquisition:
28:atlas:
38:ceagent:
49:chunk:
58:density:
69:description:
79:direction:
88:echo:
101:flip:
113:hemisphere:
125:inversion:
137:label:
148:modality:
157:mtransfer:
171:part:
195:processing:
209:reconstruction:
218:recording:
228:resolution:
239:run:
256:sample:
268:session:
289:space:
307:split:
329:stain:
341:subject:
348:task:
358:tracer:
yarikoptic commented 2 years ago

@jwodder I do not think we map much from BIDS to our metadata ATM. Just use what we have at https://github.com/dandi/dandi-cli/blob/HEAD/dandi/metadata.py#L124

        _meta = validate_bids(path)
        meta = _meta["match_listing"][0]
        meta["bids_schema_version"] = _meta["bids_schema_version"]
        meta = _rename_bids_keys(meta)

may be RF it into a function get_bids_raw_metadata. Then eventually we would add more keys mappings into that _rename_bids_keys I guess. And also extraction of sex is not there yet at all.

satra commented 2 years ago

@yarikoptic - for mapping to bareasset the function should map subject, session, modality info. i would suggest a bids2asset function similar to the nwb2asset function.

jwodder commented 2 years ago

@yarikoptic Using that code will result in only subject and session being set, as those are the only keys that are mapped to something that prepare_metadata() (née metadata2asset()) recognizes. Are you sure that's what you want?

@satra

for mapping to bareasset the function should map subject, session, modality info.

Where/how is modality represented in the Dandi schema?

i would suggest a bids2asset function similar to the nwb2asset function.

Assuming this function returns a BareAsset instance, do you have a recommendation for how to combine a BIDS BareAsset and an NWB BareAsset in accordance with this comment of Yarik's?