dandi / dandi-cli

DANDI command line client to facilitate common operations
https://dandi.readthedocs.io/
Apache License 2.0
22 stars 26 forks source link

Add support for validation of zarr filesets #1281

Open yarikoptic opened 1 year ago

yarikoptic commented 1 year ago

ATM I believe we are just testing if we can open them and two custom checks (not an empty group and not too deep of hierarchy). Initial validate support, with --strict option. in ome-zarr-py was recently merged so we should make use of it for our ome .zarrs.

jwodder commented 1 year ago

@yarikoptic

yarikoptic commented 1 year ago
  • What makes you say that ome-zarr-py PR was merged? It was clearly closed without being merged.

heh, not sure why I thought that "Closed" meant "Merged" to me ;) Left a question on that PR on what is the destiny/plan there in terms of validation.

  • It appears that the given validation method loads all(?) of the Zarr data into memory, which will be a problem for arbitrarily large Zarrs.

that would really make it unlikely to be usable by default... where do they do it?

from glancing over https://github.com/ome/ome-zarr-py/pull/142/files#diff-b50d9715cc6e4017cfc055fd0ed73ecb5d9158e17f4d58ca5b3ba08b89c46657R206 I thought it would just validate structure/metadata against some jsonschema.

jwodder commented 1 year ago

@yarikoptic ome_zarr.utils.validate() calls visit(), which iterates over the return values of Reader.__call__(), which either descends through the node (I haven't yet found what's populating the "descend" structures) or (line 698) calls ZarrLocation.load(), which calls out to a third party library that I haven't looked at yet, but the name sure sounds like it's loading data.

yarikoptic commented 1 year ago

I have followed https://github.com/ome/ome-zarr-py/pull/142#issuecomment-1517024760 and ran check-jsonschema --schemafile /home/dandi/proj/ngff/0.4/schemas/image.schema <(curl --silent "$url")

coded within `/mnt/backup/dandi/dandizarrs/tools/jsonschema-check-zattrs` on drogon: ```shell #!/bin/bash # inspired by https://github.com/ome/ome-zarr-py/pull/142#issuecomment-1517024760 set -eu #set -x for z in "$@"; do zattrs="$z/.zattrs" if ! /bin/ls "$zattrs" &>/dev/null; then echo "$z - no .zattrs, skipping" continue fi url=$(git -C "$z" annex whereis .zattrs | grep https://dandiarchive | awk '{print $2;}' | head -n 1) echo "$z - $url" check-jsonschema --schemafile /home/dandi/proj/ngff/0.4/schemas/image.schema <(curl --silent "$url") | sed -e 's,^, ,'g done ```

and got following list of failures http://www.oneukrainian.com/tmp/dandizarrs-jsonschema-checks.out - so the majority of zarrs have

  Schema validation errors were encountered.
    /dev/fd/63::$.omero.channels[0].window: 'start' is a required property
    /dev/fd/63::$.omero.channels[0].window: 'end' is a required property

in fact - there is only 137 zarrs which pass validation and over 4k which do not.

@slaytonmarx could you please check with similar (check-jsonschema --schemafile https://raw.githubusercontent.com/ome/ngff/main/0.4/schemas/image.schema YOUR.zarr/.zattrs) command on zarr files you have?

slaytonmarx commented 1 year ago

I'll check tomorrow morning and let you know!

slaytonmarx commented 1 year ago

I received the same validation errors as Yarik:

smarx@leviathan:/mnt/beegfs/Lee/dandi/sub-MITU01/ses-20211001h11m49s01/micr$ check-jsonschema --schemafile https://raw.githubusercontent.com/ome/ngff/main/0.4/schemas/image.schema sub-MITU01_ses-20211001h11m49s01_sample-103_stain-LEC_run-1_chunk-10_SPIM.ome.zarr/.zattrs
Schema validation errors were encountered.
  sub-MITU01_ses-20211001h11m49s01_sample-103_stain-LEC_run-1_chunk-10_SPIM.ome.zarr/.zattrs::$.omero.channels[0].window: 'start' is a required property
  sub-MITU01_ses-20211001h11m49s01_sample-103_stain-LEC_run-1_chunk-10_SPIM.ome.zarr/.zattrs::$.omero.channels[0].window: 'end' is a required property