dandi / dandi-schema

Schemata for DANDI archive project
Apache License 2.0
5 stars 8 forks source link

refactoring the schema to provide always "valid" models #204

Open satra opened 7 months ago

satra commented 7 months ago

there has been a lot of discussion in relation to how to accept and serve valid models. let's use this issue to discuss details:

  1. separating type validation versus requirement validation. we often mix these as a result of our tools. Let's start with everything being Optional. this will allow us to do type validation (for most part, see next point). and we can do requirement validation as a function/method that can be customized for different needs, similar to the nwb inspector.
  2. there are several asynchronous processes in the system (e.g., generating a valid checksum). till the checksum is generated, the server should indicate something about state/stage of operation. this is an enumeration that we can come up with that is semantically meaningful to the end user. would require polling on draft dandisets, but that's ok.
  3. python uses None as a catch all. and different systems encode missing versus explicit use of None in different ways. we should become consistent and i suggest we come up with a plan that distinguishes None from MISSING and further subcategorize MISSING with more specific elements from some ontology. this will also allow for valid dandisets where we impose new requirements, but someone could say the information was MISSING, NOT ACQUIRED, RESTRICTED. allowing us to encode sensitive fields as well.

@jwodder @yarikoptic @AlmightyYakob @CodyCBakerPhD @candleindark

related issues: #127 #182

jwodder commented 7 months ago

I just filed #205 at the same time as this. Which do you think is better?

satra commented 7 months ago

@jwodder - i tried to highlight use-cases here: type vs requirement validation, asynchronicity of information filling a model, and missingness of information for various reasons (including asynchronicity). perhaps both are complementary.

if we were to make every attribute Optional, added a property/function that checks requirements against a versioned state, and added enumerated qualifiers for missingness, would that help even compress the classes? for example, could the publishable mixins already be added? i.e do we need separate classes (typed checks) or properties/functions is_published , is_valid_for, etc ?

jwodder commented 7 months ago

type vs requirement validation

Type validation is part of requirement validation.

asynchronicity of information filling a model, and missingness of information for various reasons

I think both of those can be addressed without an overhaul to what the models accept.

do we need separate classes (typed checks) or properties/functions is_published, is_valid_for, etc ?

I would say that separate classes are a good idea, as it would allow consumers of the library to construct a "published metadata" instance once, after which the type system (such as it is in Python) would ensure that any further code that needed valid published data was receiving it. In contrast, if we just had is_published() methods, any function that received a metadata instance that needed it to be publish-valid would have to call the method even if the calling code had previously called it. Compare "Parse, don't validate".

satra commented 7 months ago

Type validation is part of requirement validation.

yes, but not the other way around. requirement is a semantic construct dandi imposes on the structure. we require this field for the good of neuroscience when one creates a dandiset. type validation ensures that the variable has the right type so computers can do the right thing with it, whether it's critical for neuroscience or not. and the requirements for good of neuroscience can change between now and later (higher frequency). types can also change but typically at a lower frequency.

publish-valid would have to call the method even if the calling code had previously called it.

i agree this makes it messy. we could write one time properties, deal with dirty states, etc., to address the mess. however, i don't fully see how the end user/scripter gets around this through classes. they still have to do if isinstance(asset, 'SomeAssetStateClass') to know what state the Asset is in when they do a GET on the archive. isn't that the issue that the requirements are different at different stages of the evolution of the Asset?

would say that separate classes are a good idea,

Currently we have BareAsset, Asset, PublishedAsset in dandischema, but from a retrieval from server perspective it's always an Asset currently even after publishing. if each stage was a different Asset Class, would one expect the API to return AssetInvalid or AssetWithoutChecksum or PostedAsset between Asset POST and checksum or other task that injects info into the Asset record finishes? would this then change to PublishableAsset after it's made valid, and PublishedAsset after it's published? and what if different tasks (checksum, NWB re-extract, other scientific computation tasks, etc.,.) enter info at different stages into the asset record.

the question is whether we categorize state/stage as classes or properties or functions. states of assets will not go away as asset modification is an inherent feature of the system till the asset is published.