dandi / dandi-schema

Schemata for DANDI archive project
Apache License 2.0
7 stars 10 forks source link

Validating assets calls github repeatedly #153

Open danlamanna opened 1 year ago

danlamanna commented 1 year ago

When calling dandischema.metadata.validate on n assets, n requests are made to github to fetch the schema. This makes validating assets take significantly longer than it should. The request also has no default timeout, meaning a call to validate can hang indefinitely.

https://github.com/dandi/dandi-schema/blob/d34658cc24c0e3c0c3a88e92bcd29b158241448e/dandischema/metadata.py#L184-L187

Can dandi-schema be modified to avoid relying on the network for validation? Either by bundling the schemas from dandi/schema into package data, allowing the caller of validate to pass a schema directly, or some other means?

FWIW this problem appears to exist with migrate as well.

satra commented 1 year ago

@danlamanna - we could probably package the schemas with dandi-schema. an alternative would be to cache the request on the server side. are the requests all in isolated processes or would a cache to keep a schema once downloaded work?

it's also the case that this download happens when an asset is using a different schema than the current one. this is true for many assets currently that were submitted a while back, but should not in theory be true for new assets being uploaded. i.e. the schema version should be the latest.

we have been planning to run a metadata update by processing the files with the latest extractor, but this hasn't been rolled into action.

danlamanna commented 1 year ago

we could probably package the schemas with dandi-schema. an alternative would be to cache the request on the server side. are the requests all in isolated processes or would a cache to keep a schema once downloaded work?

Avoiding network requests altogether would be best for maximizing reliability. A cache combined with giving the caller control over how network requests are performed (timeouts, retries, etc) would be the next best option.

satra commented 1 year ago

dandischema in general requires access to online resources to carry out it's general work, so it will never be a network free library. but we can optimize it in some ways. we didn't want to make assumptions about availability of storage, persistence etc when we wrote that component, but i can try a few changes. @djarecka and @sooyounga - is this something you folks could take a stab at? happy to discuss details.

waxlamp commented 1 year ago

@satra, Dan's idea has a lot of merit: even if the goal is to always be validating against the newest schema version, we are not there yet, and keeping the allowed schema versions as static package data would gain us an immediate and obvious win (while we are still litigating, so to speak, schema autoupgrades etc.).

Dan can create a quick proof of concept so we can observe the benefits/drawbacks of the approach. He can coordinate this idea with whatever Dorota and Sooyoung are looking into as well.

satra commented 1 year ago

@waxlamp - i have no issues with a proof of concept.