dandi / dandi-schema

Schemata for DANDI archive project
Apache License 2.0
5 stars 8 forks source link

Add asset "layout" metadata field, validate asset paths #157

Open yarikoptic opened 1 year ago

yarikoptic commented 1 year ago

in a follow up to https://github.com/dandi/dandi-cli/issues/1205#issuecomment-1421010945 of @satra

(@yarikoptic - this was done outside of dandi schema and also related to web-based validation, rather than local validation).

yes, that was done in https://github.com/dandi/dandi-cli/pull/1173 but it does not mean that it should stay here, at least as the "implementation logic".
In the past we removed dandi-cli from dependencies of dandi-archive IIRC to break the "dependency loop" since dandi-cli depends on API implementation of the archive. Hence we keep code worth reusing in both -cli and -archive elsewhere (dandi-schema, zarr_checksum) etc.

So may be we should move the validate_organized_path into dandischema? But again -- then for an asset it would matter on either it is a BIDS asset or some other (DANDI layout) asset.

looking at a sample asset from 000026 which is BIDS -- we do not provide such metadata ```shell ❯ curl --silent -X 'GET' 'https://api.dandiarchive.org/api/assets/3a6cdce9-ab49-41dd-967f-29349d0cc341/' -H 'accept: application/json' | jq . { "id": "dandiasset:3a6cdce9-ab49-41dd-967f-29349d0cc341", "path": "sub-EXC022/ses-MRI/anat/sub-EXC022_ses-MRI-echo-2_flip-2_VFA.json", "access": [ { "status": "dandi:OpenAccess", "schemaKey": "AccessRequirements" } ], "digest": { "dandi:sha2-256": "f48296e2812476e813c8d11e59ab1947ddfc6a8330c32ace04f040cd5d7d213e", "dandi:dandi-etag": "a7aaf0935f14bdb4d686486f4512a299-1" }, "@context": "https://raw.githubusercontent.com/dandi/schema/master/releases/0.6.0/context.json", "schemaKey": "Asset", "contentUrl": [ "https://api.dandiarchive.org/api/assets/3a6cdce9-ab49-41dd-967f-29349d0cc341/download/", "https://dandiarchive.s3.amazonaws.com/blobs/5d5/2c6/5d52c625-2efd-4ff2-a768-bd4f15e7da46" ], "identifier": "3a6cdce9-ab49-41dd-967f-29349d0cc341", "repository": "https://dandiarchive.org/", "contentSize": 247, "dateModified": "2021-10-05T14:50:57.025495-04:00", "schemaVersion": "0.6.0", "encodingFormat": "application/json", "wasGeneratedBy": [ { "id": "urn:uuid:7eec0bb1-668c-48da-a3be-a90df5485d8c", "name": "Metadata generation", "schemaKey": "Activity", "description": "Metadata generated by DANDI cli", "wasAssociatedWith": [ { "url": "https://github.com/dandi/dandi-cli", "name": "DANDI Command Line Interface", "version": "0.27.3", "schemaKey": "Software", "identifier": "RRID:SCR_019009" } ] } ], "blobDateModified": "2021-10-05T12:55:21.243891-04:00" } ```

so to validate the path we should first start annotating assets with "layout" (optional, default to DANDI, could be BIDS). Then validate either using our DANDI layout regex, or BIDS (using bidsschematools like done in dandi-cli) purely on the path (ref also #74 ).

satra commented 1 year ago

is there something that prevents detecting layout based on either local/remote dandiset.yaml + dataset_description.json ? if dataset_description exists, it's bids, if not it's dandi. since all assets are associated with a dandiset.

yarikoptic commented 1 year ago

Iirc we don't have association from asset to dandiset

satra commented 1 year ago

@yarikoptic - we do (see s3 bucket) and can expose it via api. at least at this point, assets still have a one to one correspondence with dandiset, but one to many with dandiset/version. so if assets are listed per dandiset, they have explicit association.

yarikoptic commented 1 year ago

@yarikoptic - we do (see s3 bucket) and can expose it via api. at least at this point, assets still have a one to one correspondence with dandiset, but one to many with dandiset/version. so if assets are listed per dandiset, they have explicit association.

I am asking since all such validations should be done within dandischema, not through some ad-hoc logic/API within dandi-cli if we would like dandi-archive to use them, hence this issue here in dandischema. So the question is really "where at the level of dandischema we do have such association?"

satra commented 1 year ago

i think the difference i am making is that in dandischema such a function could take type of layout as input, but that doesn't need to be in the asset model.

yarikoptic commented 1 year ago

Per brief monday discussion -- we might want/need to make validation of assets interface to be exposed to the notion of dandiset across the entire surface of the archive interfaces/implementation. But I would like to remind that dandi-schema is ATM the core logic for validation, so it might need to start here somehow and decide on how to expose/use it at dandi-cli/dandi-archive levels.

Might relate to https://github.com/dandi/dandi-archive/issues/1477 (by @bendichter ) where solution might be to establish regexes for URLs at dandi-schema level so they could be used/validated by both dandi-cli and dandi-archive UIs

yarikoptic commented 1 year ago

i think the difference i am making is that in dandischema such a function could take type of layout as input, but that doesn't need to be in the asset model.

yes. my only thinking is that we might need later some other aspect of knowing dandiset metadata or identity (e.g. to do validation in conjnction with with other assets), and then would need to change it again.

satra commented 1 year ago

why don't we start with copying the regex's over and adding a validation function to the asset model for the asset path.