Closed grst closed 3 years ago
This happens because repertoire_id
is an identifier, and from #391 missing identifier properties raises a ValidationError. Though it's not totally clear to me whether optional fields that are identifiers should be then considered as required. This probably needs to be clarified in the documentation.
Identifiers are not inherently required, though they are often listed in required
. We tried to simplify the rules in v1.3 so that the only determinant of whether a field is required or not, in the base schema, is what is listed in required
.
Hrm. Maybe we should issue some sort of warning for optional fields that are mandatory in certain contexts, but not required by the base schema? Ie, miairr: essential
and identifier: true
. Validation shouldn't fail if a non-required field is missing, but maybe it makes sense to notify the user they are missing an AIRR Data Model linker (identifier
) or MiAIRR field ('miairr: essential`).
ok, so it'd be more a fine-tuning of the #391 validation issue, right? just to confirm the actions would be:
1) remove ValidationError
s from non-required identifiers, and
2) replace them with ValidationWarning
s
If it's fine, I think we can add this to the R-Python validation sync. pinning @imkeller
Oh, man. Conflicting information. :)
This makes sense to me. @schristley, what do you think? Could we add an argument to control the validation level? Eg, base level throws an error only for missing required
and warnings for miairr: essential
and identifier: true
, whereas the full AIRR Data model validation level throws and error for everything.
Could we add an argument to control the validation level? Eg, base level throws an error only for missing
required
and warnings formiairr: essential
andidentifier: true
, whereas the full AIRR Data model validation level throws and error for everything.
Yes, I think we need a flag/parameter to indicate the level of validation.
It seems that this bug only affects repertoire_id
and sample_processing_id
for which the validation fails. Would it be enough to add the exception for those identifier fields that are nullable and not listed as required (see 9033340d13f), or should we go for different levels of validation with a flag? (in that case, #514 could still be a start (: )
the stderr for the example above would be:
Warning: Object has non-AIRR field that cannot be validated (c_sequence_start).
Warning: Object has non-AIRR field that cannot be validated (c_sequence_end).
Warning: Object has non-AIRR field that cannot be validated (is_cell).
Warning: Nullable identifier field repertoire_id is missing.
Warning: Nullable identifier field sample_processing_id is missing.
Warning: Nullable identifier field data_processing_id is missing.
but the validation fails if MiAIRR _id fields are not provided:
ValidationError: MiAIRR field cell_id is missing
All identifier: true
fields should be of the same class, that is, they are all identifiers for linking objects in the AIRR Data Model. I think they should all be treated the same, and not create exceptions for some versus the others. We were actually doing that before by trying to use a combination of flags, but converted to identifier: true
so we could treat them all the same. I think a simple validation flag to include/exclude these identifiers should be sufficient.
The flag for validation level is growing on me. Missing MiAIRR fields shouldn't result in an error for the default usage. I think we have 3 use cases:
required
only. Default behavior.required
plus miairr
.required
plus identifier
.Should the AIRR Data Model be all three instead? (required
+ identifier
+ miairr
). Technically, we could not check required
for MiAIRR compliance, but I think that sort of misses the point of validating files, which I think should conform to the schemas.
In the ADC API, we defined three levels for returning fields. So this is a little confusing in that the miairr
level suggests a set of fields that do not contain all of the required
. I guess the thought was that users might want the MiAIRR fields as a minimum, but not the additional required
fields for analysis. The airr-core
level would correspond to your 3).
Is it useful to check a file for MiAIRR compliance, even though it doesn't have all of the required
fields for analysis?
Is it useful to check a file for MiAIRR compliance, even though it doesn't have all of the required fields for analysis?
I don't think so. In the abstract, I suppose. But, if a required field is missing, then any tool that ingests the file would have to know that the file is a "special" non-compliant file. Not sure what value there is validating that a broken file is MiAIRR compliant.
I'm thinking of required
as more of a storage/transfer/interoperability requirement, and not as an analysis requirement.
I don't think so. In the abstract, I suppose. But, if a required field is missing, then any tool that ingests the file would have to know the file is a "special" non-compliant file. Not sure what value there is validating that a broken file is MiAIRR compliant.
I'm thinking of
required
as more of a storage/transfer/interoperability requirement, and not as an analysis requirement.
Ok, yeah, I'm thinking the same way. The only reason to have it is for "completeness". I think we intended that all MiAIRR fields would be required
so it would be a strict subset, but I think that's only true for repertoire metadata, not rearrangements.
I would be tempted to roll your 1) and 2) together, or leave out 2). Technically 2) is not MiAIRR compliance, it's MiAIRR plus other stuff, which might be confusing for users as it would complain about fields which aren't in MiAIRR.
I think we intended that all MiAIRR fields would be
required
so it would be a strict subset, but I think that's only true for repertoire metadata, not rearrangements.
I think there was some discussion about revisiting MiAIRR to reconcile the differences for rearrangements. Does that sound right to you @bussec ? Maybe the current differences intend to be resolved in the future...
I am wondering if it might be useful to have an option to NOT report extra fields as a problem?
Warning: Object has non-AIRR field that cannot be validated (c_sequence_start). Warning: Object has non-AIRR field that cannot be validated (c_sequence_end). Warning: Object has non-AIRR field that cannot be validated (is_cell).
This has the potential to be quite noisy and have actual problems get lost in that noise. If you want a MiAIRR validation, having a bunch of warnings about fields that don't break MiAIRR validation but are still reported on maybe isn't what we want?
We do want the user to be able to get these warning, but maybe they should be turned off when the user asks for a specific level of validation?
As required
is a property of a field itself, it does not make any statements about whether the field MUST/SHOULD/MAY contain meaningful information, garbage or NULL
. Independent of this, MiAIRR as a metadata standard -- of which x-airr.miairr
can be considered to be the formalized version -- defines the following rules:
miairr:essential
or miairr:important
MUST be presentmiairr:essential
MUST NOT contain NULL-LIKE data (i.e., garbage)miairr
field (i.e., including :defined
), it MUST be encoded in this field.For our discussion here, we can ignore (2) + (3), but I would like to mention that it would be nice to have a "NOT NULL-LIKE" validation for :essential
fields in the future.
What follows from (1) is that miairr:(essential|important)
fields -- not miairr
in general -- should be a subset of the required
fields in the Schema. Fields that are miairr:defined
may be required
but probably we should avoid this for clarity, especially as MiAIRR does not define identifiers
.
Looking at the current schema, there are a number of sections that are at variance with the principles stated above:
rearrangement.cell_id
is miairr:important
but not required
. It is an identifier
in addition.rearrangement.duplicate_count
is miairr:important
but not required
.Cell
are miairr:defined
and required
at the same timeIMO we can resolve all of these in the Schema, i.e., change their MiAIRR status if this avoids yet-another-flag for validation. @schristley: Yes, we had these discussions and I think we should continue it to see whether we should make MiAIRR a pure metadata (i.e., not analysis results) standard.
Looking at the current schema, there are a number of sections that are at variance with the principles stated above:
repertoire.cell_id
ismiairr:important
but notrequired
. It is anidentifier
in addition.repertoire.duplicate_count
ismiairr:important
but notrequired
.- Several properties of
Cell
aremiairr:defined
andrequired
at the same time
These are rearrangement.cell_id
and rearrangement.duplicate_count
, no? Just making sure I am not confused 8-)
These are rearrangement.cell_id and rearrangement.duplicate_count, no? Just making sure I am not confused 8-)
Upps... changed it in the post above.
Yes, we had these discussions and I think we should continue it to see whether we should make MiAIRR a pure metadata (i.e., not analysis results) standard.
Another important question is around its minimalism. Minimalism encourages use but reduces usefulness 8-)
I am wondering if it might be useful to have an option to NOT report extra fields as a problem?
Warning: Object has non-AIRR field that cannot be validated (c_sequence_start). Warning: Object has non-AIRR field that cannot be validated (c_sequence_end). Warning: Object has non-AIRR field that cannot be validated (is_cell).
This has the potential to be quite noisy and have actual problems get lost in that noise. If you want a MiAIRR validation, having a bunch of warnings about fields that don't break MiAIRR validation but are still reported on maybe isn't what we want?
We do want the user to be able to get these warning, but maybe they should be turned off when the user asks for a specific level of validation?
The function has a nonairr
parameter to turn off those warning.
As
required
is a property of a field itself, it does not make any statements about whether the field MUST/SHOULD/MAY contain meaningful information, garbage orNULL
. Independent of this, MiAIRR as a metadata standard -- of whichx-airr.miairr
can be considered to be the formalized version -- defines the following rules:
- Fields that are labeled as
miairr:essential
ormiairr:important
MUST be present- Fields labeled as
miairr:essential
MUST NOT contain NULL-LIKE data (i.e., garbage)- If metadata is provided that semantically matches the definition of ANY
miairr
field (i.e., including:defined
), it MUST be encoded in this field.For our discussion here, we can ignore (2) + (3), but I would like to mention that it would be nice to have a "NOT NULL-LIKE" validation for
:essential
fields in the future.What follows from (1) is that
miairr:(essential|important)
fields -- notmiairr
in general -- should be a subset of therequired
fields in the Schema. Fields that aremiairr:defined
may berequired
but probably we should avoid this for clarity, especially as MiAIRR does not defineidentifiers
.
Ok, so the combination of required
and nullable
gets us most of the way. I don't think we have a robust way to check for garbage NULL-LIKE values though...
I don't think we have a robust way to check for garbage NULL-LIKE values though...
No, but we could at least check for the keywords defined in the Docs, i.e., not_applicable
, not_collected
and missing
.
The function has a
nonairr
parameter to turn off those warning.
We could also do a typical level
argument: debug
, info
, warning
, error
in Python; message
, warning
, stop
in R. Ie, level="warning"
reports warnings and errors/stops. I don't know if we need info/message, but we could use it in place of debug
and then have the same levels across R and python.
I was playing a bit with the
RearrangmentSchema
validation and for an example chain, theRearrangementSchema.validate_object
fails with "Identifier field repertoire_id is missing". The repertoire_id is indeed missing, but according to the website, this is an optional field, and I would expect it to validate just fine. Am I missing something?The example chain is taken from an example dataset from the 10x website.