upload doesn't seem to be checking for valid DANDI (when not BIDS) organization of files/assets

satra commented 1 year ago

it seems that some aspects of dandiset naming/layout may be falling through the cracks. here is another example: https://dandiarchive.org/dandiset/000168/draft

as part of healthcheck if may be worthwhile considering if files should have a different name, but doesn't. this could capture a few things: 1) dandisets like this one, 2) dandisets that were uploaded in parts, and may need name resolution before upload.

yarikoptic commented 1 year ago

Per our brief discussion and possibly with my biased view over the situation, the first step would be to add to validate procedure for files which are not in BIDS dandiset, to conform our DANDI files layout which is at large conforming the description within https://github.com/dandi/dandi-cli/blob/master/dandi/cli/cmd_organize.py#L67 docstring and implemented in https://github.com/dandi/dandi-cli/blob/master/dandi/organize.py .

At large it is the fact that asset path should conform to sub-{labelregex}/sub-{labelregex}(_{entity}-{labelregex})*_{suffix}{extension} with also a check (separate dedicated error record) if sub-{labelregex} is identical within folder and file path.

https://github.com/dandi/dandi-cli/blob/master/dandi/organize.py#L397 _sanitize_value is the function which shows which charaters aren't allowed in {labelregex}.

Assigning to @jwodder to make it woven it in current logic on deciding either it is within BIDS or DANDI layout.

satra commented 1 year ago

indeed, as a first pass a more generic regex would work and should be implemented.

jwodder commented 1 year ago

@yarikoptic

This check should only apply to NWB files that are not in BIDS datasets, correct?
What is the format for the {suffix} field in the pattern in your last comment?
To be clear, this check should produce two validation errors — one for not following the pattern, another for when the folder and filename start with different sub-{labelregex} segments — where the fields of the ValidationResult are:
- id: ???
- origin: ValidationOrigin(name="dandi", version=dandi.__version__)
- severity: Severity.ERROR
- scope: Scope.FILE
- path: filepath of the asset
- asset_paths: None
- within_asset_paths: None
- dandiset_path: path to root of Dandiset
- dataset_path: None ?
- message: ???
- metadata: ???
- path_regex: None

yarikoptic commented 1 year ago

@yarikoptic

This check should only apply to NWB files that are not in BIDS datasets, correct?

short answer: yes, for starters it would apply only to NWB files.

Longer answer: eventually it should apply not only to .nwb files but also other data types, e.g. those video files we allow being referenced within NWB files and referencing in VIDEO_FILE_EXTENSIONS, and possibly other file formats. I just don't know if we decided on the exact file naming for those yet.

What is the format for the {suffix} field in the pattern in your last comment?

in organize ATM we build it https://github.com/dandi/dandi-cli/blob/HEAD/dandi/organize.py#L377

                if isinstance(value, (list, tuple)):
                    value = "+".join(map(str, value))

so let's make it smth like [a-z]+(\+[a-z]+)* ? (I don't think we ATM allow/have any upper case or digits)

To be clear, this check should produce two validation errors — one for not following the pattern, another for when the folder and filename start with different sub-{labelregex} segments

yes...

FWIW - here is current ids we have for validation results (I am leaning toward standardizing them all to be upper case so there is no thought needed on how to case them). We might want to adjust them later as well in terms of names to make them more analogous across BIDS and DANDI

```shell ❯ git grep "\

so let's make 3 of them

DANDI.NON_DANDI_FILENAME - when file is not following the pattern,
DANDI.NON_DANDI_FOLDERNAME - when folder is either not present (file flat in the top) or wrong level (e.g. dir1/dir2/filename) or not following sub-{subject} regex
DANDI.METADATA_MISMATCH_SUBJECT -- I foresee more of similar METADATA_MISMATCH errors in the future. That is when filename subject is different from folder subject. If any previous error already in place for that path -- avoid this check.

— where the fields of the ValidationResult are:

id: ???

above 3 ids

origin: ValidationOrigin(name="dandi", version=dandi.__version__)

severity: Severity.ERROR

scope: Scope.FILE

Let's add Scope FOLDER for DANDI.NON_DANDI_FOLDERNAME

path: filepath of the asset

or folder if defined (not empty)

asset_paths: None

ok, but might later be all the files under folder for DANDI.NON_DANDI_FOLDERNAME

within_asset_paths: None

dandiset_path: path to root of Dandiset

dataset_path: None ?

message: ???

short description of the error

metadata: ???

parse similarly to how done for BIDS ones for DANDI.METADATA_MISMATCH_SUBJECT and DANDI.NON_DANDI_FILENAME

path_regex: None

I think so although for DANDI.NON_DANDI_FILENAME if you implement it via regex -- add that regex here

jwodder commented 1 year ago

@yarikoptic

Let's add Scope FOLDER for DANDI.NON_DANDI_FOLDERNAME

I don't think that makes much sense, as that would really only be appropriate if validation was run on individual folders, yet it's instead run on files (and Dandisets & datasets as a whole).

metadata: ???

parse similarly to how done for BIDS ones for DANDI.METADATA_MISMATCH_SUBJECT and DANDI.NON_DANDI_FILENAME

The only BIDS validation result that sets metadata is BIDS.MATCH, which gets its metadata from the match_listing field of the structure returned by bidsschematools.

jwodder commented 1 year ago

@yarikoptic Problem: The validate command now needs to determine the root of the Dandiset(s) being validated in order to properly calculate the assets' relative paths. How should this be done, taking into account that validate can accept many paths that (currently) need not belong to the same Dandiset?

yarikoptic commented 1 year ago

@yarikoptic

Let's add Scope FOLDER for DANDI.NON_DANDI_FOLDERNAME

I don't think that makes much sense, as that would really only be appropriate if validation was run on individual folders, yet it's instead run on files (and Dandisets & datasets as a whole).

The scope is not about what validation is ran on -- but about the level/entity what that error is applicable to. E.g. we can have error of scope of a DATASET when it is ran on dandiset which is BIDS dataset with subdatasets and some specific subdataset has an error. So in principle FOLDER scope does make sense to me. Indeed though we do not have notion of Folder in assets, but we do organize into folders and even return folders in listing. The question/problem is that we would have 1 such error for each asset under that folder. We might need to have some post processing function which would join multiple such errors of scope folder into one for that specific folder path, and thus appending all the asset_paths into a single list within that combined error.

metadata: ???

parse similarly to how done for BIDS ones for DANDI.METADATA_MISMATCH_SUBJECT and DANDI.NON_DANDI_FILENAME

The only BIDS validation result that sets metadata is BIDS.MATCH, which gets its metadata from the match_listing field of the structure returned by bidsschematools.

yeap, because it was the only process which relied on regexes. If you rely on regexes -- here you can tell which regex was not matched.

@yarikoptic Problem: The validate command now needs to determine the root of the Dandiset(s) being validated in order to properly calculate the assets' relative paths. How should this be done, taking into account that validate can accept many paths that (currently) need not belong to the same Dandiset?

up to you -- since we have dandiset field -- we can in principle get a list of validation results where they would "belong" to different dandisets. Or it is ok to just error out and require paths for a single dandiset for validate invocation if you see the need for that ATM.

jwodder commented 1 year ago

@yarikoptic I'm unclear what you want the metadata fields of the ValidationResults set to.

jwodder commented 1 year ago

@yarikoptic

up to you -- since we have dandiset field -- we can in principle get a list of validation results where they would "belong" to different dandisets.

If you're referring to the dandiset_path field in ValidationResult, that's currently not filled in by anything, as assets currently don't store the path to the root of their Dandiset.

Regardless, I think the simplest option is to use find_parent_directory_containing() to locate the Dandiset root for each path passed to validate, but the question then becomes, what should happen if a path is not inside a Dandiset? Should there be a dedicated ValidationResult error for that?

yarikoptic commented 1 year ago

@yarikoptic I'm unclear what you want the metadata fields of the ValidationResults set to.

ok, don't set to anything for now -- it seems we have not described/formalized what kind of metadata we want to see there so to not confuse more (since we already have all kinds of metadata) -- omit setting that one.

@yarikoptic

up to you -- since we have dandiset field -- we can in principle get a list of validation results where they would "belong" to different dandisets.

If you're referring to the dandiset_path field in ValidationResult, that's currently not filled in by anything, as assets currently don't store the path to the root of their Dandiset.

that is ok, code which calls validation would know dandiset those assets belong to, right? so that code could set dandiset_path in the provided to it records appropriately.

what should happen if a path is not inside a Dandiset? Should there be a dedicated ValidationResult error for that?

good question. We did allow for use of dandi validate on some individual .nwb files. I think if there is no dandiset boundary found (directory with dandiset.yaml) neither of the tests described in this issue are pertinent. We should just issue a ValidationResult record with id DANDI.NO_DANDISET_FOUND and skip testing any of those 3 tests, thus possibly just running tests pertinent to .nwb file (or BIDS dataset). Later we could add ability to skip/suppress specific errors (ids).

jwodder commented 1 year ago

@yarikoptic

I think if there is no dandiset boundary found (directory with dandiset.yaml) neither of the tests described in this issue are pertinent. We should just issue a ValidationResult record with id DANDI.NO_DANDISET_FOUND and skip testing any of those 3 tests, thus possibly just running tests pertinent to .nwb file (or BIDS dataset).

The code isn't structured in a way to make that reasonable. All validation of an asset happens in the DandiFile.get_validation_errors() method, and obtaining a DandiFile happens after the calling code either determines the Dandiset root or leaves it unset. A DandiFile automatically assumes that it belongs to a Dandiset and that its path attribute is the path to that asset from the root of the Dandiset.

yarikoptic commented 1 year ago

after the calling code either determines the Dandiset root or leaves it unset.

so here we go -- if there is no root - no need for this checks in the 'calling code'. If there is root -- do the checks and add that Dandiset root into returned records.

dandi / dandi-cli

upload doesn't seem to be checking for valid DANDI (when not BIDS) organization of files/assets #1163