airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

genotype locus field has nullable:false #663

Closed schristley closed 11 months ago

schristley commented 1 year ago

This causes a validation error when a new blank template repertoire is created. Either this field needs to be nullable:true or given a default value.

Repertoire at array position 1 with validation error: Non-nullable field "subject.genotype.receptor_genotype_set.genotype_class_list.locus" is null or missing
schristley commented 1 year ago

Actually there are a bunch of fields in Genotype and MHCGenotype with this issue.

javh commented 1 year ago

Same issue as #665. Fields should be set nullable:true unless they are identifiers or mandatory references, though null values may be not recommended. No logical default for locus. @williamdlees , please confirm.

javh commented 1 year ago

Also, check all germline nullability fields.

williamdlees commented 1 year ago

Perhaps I have misunderstood, but I have a feeling that we are putting the cart before the horse here. The way it's defined, a genotype has to have one of the seven values listed for locus to mean anything. In a data file, finding a genotype with a null locus, or a locus with some value other than those listed, should trigger an error. To me, the validation being mentioned here is being done at the wrong point: it should be done when the record is complete, not when it is in the process of construction. Alternatively, the mandated fields could be passed as mandatory arguments to a constructor, or initialised in a dict when the object is first created. The implementor will know what the locus should be set to when the object is first created. Advising users to create blank records and then fill in the blanks is an open invitation to leave out necessary information, and weakening validation makes this worse.

If limitations in the way things have been implemented mean that this is unachievable, then the field can of course be changed so that it is not mandatory but I feel this shouldn't go without challenge, both because it weakens the validation quite considerably and secondly because it sends a message to implementors that the field is optional, when it really isn't.

schristley commented 1 year ago

In a data file, finding a genotype with a null locus, or a locus with some value other than those listed, should trigger an error. To me, the validation being mentioned here is being done at the wrong point: it should be done when the record is complete, not when it is in the process of construction.

This is valid point and one that we discussed when first introducing Repertoire and its identifiers. The same logic applies, if you have a data file with repertoires that don't have a repertoire_id, is it valid or invalid? Complicating the scenario is that there are numerous situations where the id's may be legitimately blank because they just haven't been assigned yet in some process. Now you need to make the decision, can data files sometimes be invalid and sometimes be valid? We decided that it's simpler for the user if they can always assume that data files should be valid. This then implies that blank template objects also need to be valid.

Now it just so happens that this seemed to work out okay for Repertoire and other objects. There might be specific fields with Germline that we need to think more carefully about.

If limitations in the way things have been implemented mean that this is unachievable, then the field can of course be changed so that it is not mandatory but I feel this shouldn't go without challenge, both because it weakens the validation quite considerably and secondly because it sends a message to implementors that the field is optional, when it really isn't.

I would make the argument that nullable: false isn't really the validation you want, at least when it comes to identifiers (and maybe other fields). If an identifier is null is the data object really invalid, or is it valid but just not identifiable/resolvable? My argument is that validation should really check the id values and insure 1) uniqueness, and 2) resolvable. Right now our validation routines don't do that, so right now if you have a nullable: false field, you can give it a bogus value and it's suddenly valid.

williamdlees commented 1 year ago

Thanks Scott.

This leads to a situation in which information that's required for processing can be left out without anyone noticing, and it is doubly hard to fix because there isn't even a way to say in the schema what information really is required: it relies on people working it out for themselves. Someone could, for example, produce a data file with a set of Studies, none of which contained an id, a title, or a description. It may be simple at the point of creation, but that's achieved at substantial risk to data quality downstream.

As in #669 I suggest we log this as an issue to be addressed in the future and make sure that people using the template API are aware of it, and I can remove all the nullable: False atributes from the Germline objects. But perhaps we should introduce an x-airr property that says, effectively, that this property requires a value even though it isn't currently checked in validation? That would at least let us alert users and start to prepare for the future.

javh commented 1 year ago

The purpose of the validation routines in the reference library is to check for a valid format/types, not valid values. I'd say enums are a grey area because they are "typed" to have specific values. And, at least for fields like locus, there's no clear default value, so there's a non-ideal compromise there. In that I think that it would be worse to assume locus: "IGH" than assume locus: null (not sure everyone agrees). locus would still be required though, so the validation would expect the field to be present and the value to have the correct type (one of the enum values or null).

This kind of dates back to the early days where we decided empty strings were valid entries for required fields, because not every alignment tool output v_cigar, germline_alignment, etc and we didn't want to setup major barriers to adoption. Having "nullable but required" fields is baked into the standards, for better or worse. So we do allow for files to be compliant with the schema while still having missing data / nonsense values that make them practically unusable.

As far as the reference library goes, we haven't been focused on making sure the file is good/useable, only that it fits the schema. So no checks on identifier uniqueness, correct ontology links, compliant gene name, making sure the locus and gene name match, etc. It might be worth having something that does that, but I don't know that we want to do it. Maybe we should have some sort of "level 2" validation or something.

javh commented 1 year ago

I set the various Germline enums to nullable: true in #674 to get over the current hurdle.

What do you think about adding an IARC or OGRDB specific x-airr attribute like either the miairr or adc-query-support attributes? The idea being to define what makes for a valid file in those contexts, beyond barebones analysis needs in other contexts. Eg, something like RearrangedSequence.derivation may be "essential" for IARC, but not useful for someone running an alignment.

miairr:
    type: string
    description: MiAIRR requirement level.
    enum:
        - essential
        - important
        - defined
    default: useful
adc-query-support:
    type: boolean
    description: >
        True if an ADC API implementation must support queries on the field.
        If false, query support for the field in ADC API implementations is optional.
    default: false

Alternatively, we could set enums to nullable: false, but then assign a non-informative default, like default: "undefined".

schristley commented 1 year ago

I can't really argue against William's point as I also tend to agree. I don't want to deal with these worthless "null" objects either. Plus, I realized that we are asking for a valid JSON property like nullable to be removed when that more faithfully represents the data, and somehow that seems wrong too.

So if we allow nullable: false, then we need to rethink what and how to generate a template. Maybe the template function checks something in x-airr instead. It would change our policy about templates being valid, which is ok practically, we can test that, then add the required fields and validate again. What else do we need to consider?

javh commented 1 year ago

If it's just the template maker, we could have it do something dumb (but valid) like pick the first item in the enum set and use that in the absence of a default for fields set nullable: false.

I think we have two separable issues here: (1) what to do about templates/validation for enums that have no default and should be set nullable: false and (2) what to do about fields that should probably be nullable: true in general cases but nullable: false under more strict regimes.

schristley commented 1 year ago

and (2) what to do about fields that should probably be nullable: true in general cases but nullable: false under more strict regimes.

we can take an approach similar to what we did with ADC using include_fields and define categories of fields. Do we do that with an x-airr attributes specific to IARC or OGRDB, or define general categories?

We can also define strictness regimes, like all identifiers must have valid values, null check is relaxed, etc.

williamdlees commented 1 year ago

I worked through all the germline objects in the v2 and v3 schema, changing

nullable: false

to

x-airr: nullable: false

in both v2 and v3 specs.

The schemas now fail validation because check-consistency-formats.py expects the x-airr qualifier in v2 but not in v3. This is enforced by the function translate_nullable().

Have I misunderstood what I should be doing, or is this an unexpected problem?

javh commented 1 year ago

@williamdlees, where are you making the changes? I can take a look and try to figure it out.

The v2 spec uses the syntax property.x-airr.nullable, while the v3 spec uses property.nullable. So if that's the difference, then, yes, that's an expected difference between v2 and v3.

williamdlees commented 1 year ago

Thanks Jason. I've created this pull request, which is failing checks: https://github.com/airr-community/airr-standards/pull/680.

If this is expected behaviour, the question is how you want me to treat those elements of the schema that should not be null in records, but can't be set to nullable: false because that causes the API code to fail. I had thought we were going to use x-airr attributes, but perhaps we can't use x-airr: nullable: false because of the way the consistency checker treats 'nullable'.

javh commented 1 year ago

property.x-airr.nullable (v2) and property.nullable (v3) are equivalent fields and only one should be used, appropriate to the version. No longer needing a custom (x-airr) nullable field is part of the benefit to v3. So it should be:

v2:

sequence_id:
    x-airr:
        nullable: false

v3:

sequence_id:
    nullable: false

That said, there are changes to nullable fields in #674, so we should probably include them in that PR to avoid merge conflicts.

Regarding this:

elements of the schema that should not be null in records, but can't be set to nullable: false

I think we need to discuss a solution. One possible route is a custom x-airr attribute, but that attribute would not be x-airr.nullable as that serves a different purpose already.

javh commented 1 year ago

Just as an example, to kick off discussion, we have x-airr.miarr to define expectations for MiAIRR compliance with more rigor than the schema otherwise requires:

miairr:
    type: string
    description: MiAIRR requirement level.
    enum:
        - essential
        - important
        - defined
    default: useful

We could make an equivalent for IARC.

williamdlees commented 1 year ago

We could use those levels as defined, I think.

javh edit: removed email headers.

javh commented 1 year ago

From the call:

  1. Regarding enum validation: Change the template code to default enums to empty string for type string. Fall back to first enum value if that doesn’t work for some. Also, understand use case for validating template file.
  2. Regarding nullability: Use miairr attribute as is (will resolve ownership issues later if we need to). Follow same guidelines of essential = required/non-nullable and important = required/nullable.
  3. Attributes.miarr.default should be defined (everything is useful).
javh commented 1 year ago

Second item above (nullability) is addressed. Need to take care of (1) still.

bussec commented 1 year ago

Third item was addressed in 2520bbd2

bussec commented 1 year ago

From the call: Item 1 is still open, needs fixing of validator code.

javh commented 11 months ago

Looking at the code for this, it looks like @williamdlees already fixed item 1 in #581.