airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

schema pattern for array of objects is not followed for new Germline objects #669

Closed schristley closed 1 year ago

schristley commented 1 year ago

I noticed when coding for the new Germline objects, the schema is not structured to allow for an easy programming pattern, ie. creating new objects to add to an array, like we have in the docs. Thus mhc_alleles should be defined this way:

        mhc_alleles:
            type: array
            description: List of MHC alleles of the indicated mhc_class identified in an individual
            items:
                $ref: '#/MHCAllele'

I changed some objects on the vdjserver branch but I didn't do a full scan.

javh commented 1 year ago

Seems related to #667.

williamdlees commented 1 year ago

Scott, are you saying that arrays should not be defined as nullable, because the defined programming pattern creates a null array? I don't have an issue with that but want to make sure I understand exactly what needs to be changed to resolve this issue.

schristley commented 1 year ago

Hey William, no, null arrays are fine. This isn't about the schema structure but about how arrays and object definitions interact. I think most of the schema is okay but there are few definitions that should be reviewed. I'm also not sure how much of this is written down or just informal consensus, but one norm is that blank template objects have all of its properties defined in that object, the other norm is that the blank template object passes validation.

There's a certain sense to this. Imagine you are hand-editing a metadata file, having such a blank template in the file let's you see all of the properties, no jumping back and forth to the spec asking "what field..." Instead, just change the ones that you want and leave the others as their default.

Now consider coding in a similar scenario. Take the current mhc_alleles definition. If you wanted to programmatically fill the array you have to "know" how to create a valid object. In this case, you'd have to look at the spec to get the list of fields, check if any have a default value, etc. Your code looks like this:

data = subject[...]['mhc_alleles']
for i in range(5):
    obj = { }
    obj['allele_designation'] = ...
    obj['gene'] = ...
    obj['reference_set_ref'] = ...
    data.append(obj)

No big deal, but what if the schema changes and adds new fields, specifically required or have a default? You have to find all the places in your code and update them with the new fields. It's poor design from a maintainability standpoint.

Now if array of objects are always defined with $ref for the object, like above, there is a simple programming pattern using the AIRR template object

data = subject[...]['mhc_alleles']
for i in range(5):
    obj = airr.schema['MHCAllele'].template()
    data.append(obj)

The object returned by the template is valid, and it is guaranteed to have all properties defined in the schema with their default values.

In general, I think developers should be encouraged to use the template function to create new objects instead of hand-crafting them.

williamdlees commented 1 year ago

Hi Scott,

This approach works fine as long as all parameters have sensible defaults - but it's not so unusual to come across cases where there is no sensible default. This is where the constructor comes in. template() should take parameters for required fields that don't have defaults. When the schema changes, template() changes, and the developer may get a warning that a required argument to template() is missing. That's much better than having the code run without error after the schema is updated, but generating records that can't be properly processed by a recipient (but still pass validation, because of the compromises made in validation).

I appreciate that we have resource limits and may not be able to address this in the template object for the time being, but, if so, let's at least log it as something we should try to address later, and make sure we make users aware that, for the time being, invalid records may pass validation and they need to check for themselves that fields have sensible values because we are not giving guidance in the schema.

scharch commented 1 year ago

@williamdlees @schristley is this done now?

williamdlees commented 1 year ago

I believe it is.

javh edit: remove email headers