airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

RearrangementSchema.validate_object: Identifier field repertoire_id is missing #508

Closed grst closed 3 years ago

grst commented 3 years ago

I was playing a bit with the RearrangmentSchema validation and for an example chain, the RearrangementSchema.validate_object fails with "Identifier field repertoire_id is missing". The repertoire_id is indeed missing, but according to the website, this is an optional field, and I would expect it to validate just fine. Am I missing something?

The example chain is taken from an example dataset from the 10x website.

from airr import RearrangementSchema
chain_obj = {
    "cell_id": "AAACCTGAGAGACGAA-1",
    "clone_id": "",
    "sequence_id": "AAACCTGAGAGACGAA-1_contig_2",
    "sequence": "AGCTCTGAGAGAGGAGCCCAGCCCTGGGATTTTCAGGTGTTTTCATTTGGTGATCAGGACTGAACAGAGAGAACTCACCATGGAGTTTGGGCTGAGCTGGCTTTTTCTTGTGGCTATTTTAAAAGGTGTCCAGTGTGAGGTGCAGCTGTTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGAGTCACCTTTAACAGCTATGCCATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGCTATTAGTGGTAGTGGTGCTAACACATACTACGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATTTGCAAATGAACAGCCTGAGAGCCGAAGACACGGCCGTATATTACTGTGCCGCTACCCTAGGGACCCGTAGAGATGGTTACAATTTTGCATACTGGGGCCAGGAAACCCTGGTCACCGTCTCCTCAGGGAGTGCATCCGCCCCAACCCTTTTCCCCCTCGTCTCCTGTGAGAATTCCCCGTCGGATACGAGCAGCGTG",
    "sequence_aa": "MEFGLSWLFLVAILKGVQCEVQLLESGGGLVQPGGSLRLSCAASGVTFNSYAMSWVRQAPGKGLEWVSAISGSGANTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCAATLGTRRDGYNFAYWGQETLVTVSSGSASAPTLFPLVSCENSPSDTSSV",
    "productive": True,
    "rev_comp": False,
    "v_call": "IGHV3-23",
    "v_cigar": "79S347M148S",
    "d_call": "IGHD5-24",
    "d_cigar": "443S20M111S",
    "j_call": "IGHJ4",
    "j_cigar": "455S48M71S",
    "c_call": "IGHM",
    "c_cigar": "503S71M",
    "sequence_alignment": "AGCTCTGAGAGAGGAGCCCAGCCCTGGGATTTTCAGGTGTTTTCATTTGGTGATCAGGACTGAACAGAGAGAACTCACCATGGAGTTTGGGCTGAGCTGGCTTTTTCTTGTGGCTATTTTAAAAGGTGTCCAGTGTGAGGTGCAGCTGTTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGAGTCACCTTTAACAGCTATGCCATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGCTATTAGTGGTAGTGGTGCTAACACATACTACGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATTTGCAAATGAACAGCCTGAGAGCCGAAGACACGGCCGTATATTACTGTGCCGCTACCCTAGGGACCCGTAGAGATGGTTACAATTTTGCATACTGGGGCCAGGAAACCCTGGTCACCGTCTCCTCAGGGAGTGCATCCGCCCCAACCCTTTTCCCCCTCGTCTCCTGTGAGAATTCCCCGTCGGATACGAGCAGCGTG",
    "germline_alignment": "AGCTCTGAGAGAGGAGCCCAGCCCTGGGATTTTCAGGTGTTTTCATTTGGTGATCAGGACTGAACAGAGAGAACTCACCATGGAGTTTGGGCTGAGCTGGCTTTTTCTTGTGGCTATTTTAAAAGGTGTCCAGTGTGAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGCAGCTATGCCATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGCTATTAGTGGTAGTGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTATATTACTGTGCGAAAGAGTAGAGATGGCTACAATTACACTACTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAGGGAGTGCATCCGCCCCAACCCTTTTCCCCCTCGTCTCCTGTGAGAATTCCCCGTCGGATACGAGCAGCGTGGCCGTTGGCTGCCTCGCACAGGACTTCCTTCCCGACTCCATCACTTTCTCCTGGAAATACAAGAACAACTCTGACATCAGCAGCACCCGGGGCTTCCCATCAGTCCTGAGAGGGGGCAAGTACGCAGCCACCTCACAGGTGCTGCTGCCTTCCAAGGACGTCATGCAGGGCACAGACGAACACGTGGTGTGCAAAGTCCAGCACCCCAACGGCAACAAAGAAAAGAACGTGCCTCTTCCAGTGATTGCTGAGCTGCCTCCCAAAGTGAGCGTCTTCGTCCCACCCCGCGACGGCTTCTTCGGCAACCCCCGCAAGTCCAAGCTCATCTGCCAGGCCACGGGTTTCAGTCCCCGGCAGATTCAGGTGTCCTGGCTGCGCGAGGGGAAGCAGGTGGGGTCTGGCGTCACCACGGACCAGGTGCAGGCTGAGGCCAAAGAGTCTGGGCCCACGACCTACAAGGTGACCAGCACACTGACCATCAAAGAGAGCGACTGGCTCGGCCAGAGCATGTTCACCTGCCGCGTGGATCACAGGGGCCTGACCTTCCAGCAGAATGCGTCCTCCATGTGTGTCCCCGATCAAGACACAGCCATCCGGGTCTTCGCCATCCCCCCATCCTTTGCCAGCATCTTCCTCACCAAGTCCACCAAGTTGACCTGCCTGGTCACAGACCTGACCACCTATGACAGCGTGACCATCTCCTGGACCCGCCAGAATGGCGAAGCTGTGAAAACCCACACCAACATCTCCGAGAGCCACCCCAATGCCACTTTCAGCGCCGTGGGTGAGGCCAGCATCTGCGAGGATGACTGGAATTCCGGGGAGAGGTTCACGTGCACCGTGACCCACACAGACCTGCCCTCGCCACTGAAGCAGACCATCTCCCGGCCCAAGGGGGTGGCCCTGCACAGGCCCGATGTCTACTTGCTGCCACCAGCCCGGGAGCAGCTGAACCTGCGGGAGTCGGCCACCATCACGTGCCTGGTGACGGGCTTCTCTCCCGCGGACGTCTTCGTGCAGTGGATGCAGAGGGGGCAGCCCTTGTCCCCGGAGAAGTATGTGACCAGCGCCCCAATGCCTGAGCCCCAGGCCCCAGGCCGGTACTTCGCCCACAGCATCCTGACCGTGTCCGAAGAGGAATGGAACACGGGGGAGACCTACACCTGCGTGGTGGCCCATGAGGCCCTGCCCAACAGGGTCACCGAGAGGACCGTGGACAAGTCCACCGGTAAACCCACCCTGTACAACGTGTCCCTGGTCATGTCCGACACAGCTGGCACCTGCTAC",
    "junction": "TGTGCCGCTACCCTAGGGACCCGTAGAGATGGTTACAATTTTGCATACTGG",
    "junction_aa": "CAATLGTRRDGYNFAYW",
    "junction_length": 51,
    "junction_aa_length": 17,
    "v_sequence_start": 79,
    "v_sequence_end": 426,
    "d_sequence_start": 443,
    "d_sequence_end": 463,
    "j_sequence_start": 455,
    "j_sequence_end": 503,
    "c_sequence_start": None,
    "c_sequence_end": "574",
    "consensus_count": 2201,
    "duplicate_count": 28,
    "is_cell": "T",
}
RearrangementSchema.validate_object(chain_obj)
Warning: Object has non-AIRR field that cannot be validated (c_sequence_start).
Warning: Object has non-AIRR field that cannot be validated (c_sequence_end).
Warning: Object has non-AIRR field that cannot be validated (is_cell).

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
<ipython-input-2-2a62f77a45f4> in <module>
     34     "is_cell": "T",
     35 }
---> 36 RearrangementSchema.validate_object(chain_obj)

~/anaconda3/envs/sctcrpy2/lib/python3.8/site-packages/airr/schema.py in validate_object(self, obj, missing, nonairr, context)
    343             if xairr and xairr.get('identifier'):
    344                 if is_missing_key:
--> 345                     raise ValidationError('Identifier field %s is missing' %(full_field))
    346 
    347             # check nullable requirements

ValidationError: Identifier field repertoire_id is missing
franasa commented 3 years ago

This happens because repertoire_id is an identifier, and from #391 missing identifier properties raises a ValidationError. Though it's not totally clear to me whether optional fields that are identifiers should be then considered as required. This probably needs to be clarified in the documentation.

javh commented 3 years ago

Identifiers are not inherently required, though they are often listed in required. We tried to simplify the rules in v1.3 so that the only determinant of whether a field is required or not, in the base schema, is what is listed in required.

Hrm. Maybe we should issue some sort of warning for optional fields that are mandatory in certain contexts, but not required by the base schema? Ie, miairr: essential and identifier: true. Validation shouldn't fail if a non-required field is missing, but maybe it makes sense to notify the user they are missing an AIRR Data Model linker (identifier) or MiAIRR field ('miairr: essential`).

franasa commented 3 years ago

ok, so it'd be more a fine-tuning of the #391 validation issue, right? just to confirm the actions would be: 1) remove ValidationErrors from non-required identifiers, and
2) replace them with ValidationWarnings

If it's fine, I think we can add this to the R-Python validation sync. pinning @imkeller

javh commented 3 years ago

Oh, man. Conflicting information. :)

This makes sense to me. @schristley, what do you think? Could we add an argument to control the validation level? Eg, base level throws an error only for missing required and warnings for miairr: essential and identifier: true, whereas the full AIRR Data model validation level throws and error for everything.

schristley commented 3 years ago

Could we add an argument to control the validation level? Eg, base level throws an error only for missing required and warnings for miairr: essential and identifier: true, whereas the full AIRR Data model validation level throws and error for everything.

Yes, I think we need a flag/parameter to indicate the level of validation.

franasa commented 3 years ago

It seems that this bug only affects repertoire_id and sample_processing_id for which the validation fails. Would it be enough to add the exception for those identifier fields that are nullable and not listed as required (see 9033340d13f), or should we go for different levels of validation with a flag? (in that case, #514 could still be a start (: )

the stderr for the example above would be:

Warning: Object has non-AIRR field that cannot be validated (c_sequence_start).
Warning: Object has non-AIRR field that cannot be validated (c_sequence_end).
Warning: Object has non-AIRR field that cannot be validated (is_cell).
Warning: Nullable identifier field repertoire_id is missing.
Warning: Nullable identifier field sample_processing_id is missing.
Warning: Nullable identifier field data_processing_id is missing.

but the validation fails if MiAIRR _id fields are not provided:

ValidationError: MiAIRR field cell_id is missing
schristley commented 3 years ago

All identifier: true fields should be of the same class, that is, they are all identifiers for linking objects in the AIRR Data Model. I think they should all be treated the same, and not create exceptions for some versus the others. We were actually doing that before by trying to use a combination of flags, but converted to identifier: true so we could treat them all the same. I think a simple validation flag to include/exclude these identifiers should be sufficient.

javh commented 3 years ago

The flag for validation level is growing on me. Missing MiAIRR fields shouldn't result in an error for the default usage. I think we have 3 use cases:

  1. Schema compliance - check required only. Default behavior.
  2. MiAIRR compliance - check required plus miairr.
  3. AIRR Data Model support - check required plus identifier.

Should the AIRR Data Model be all three instead? (required + identifier + miairr). Technically, we could not check required for MiAIRR compliance, but I think that sort of misses the point of validating files, which I think should conform to the schemas.

schristley commented 3 years ago

In the ADC API, we defined three levels for returning fields. So this is a little confusing in that the miairr level suggests a set of fields that do not contain all of the required. I guess the thought was that users might want the MiAIRR fields as a minimum, but not the additional required fields for analysis. The airr-core level would correspond to your 3).

Is it useful to check a file for MiAIRR compliance, even though it doesn't have all of the required fields for analysis?

javh commented 3 years ago

Is it useful to check a file for MiAIRR compliance, even though it doesn't have all of the required fields for analysis?

I don't think so. In the abstract, I suppose. But, if a required field is missing, then any tool that ingests the file would have to know that the file is a "special" non-compliant file. Not sure what value there is validating that a broken file is MiAIRR compliant.

I'm thinking of required as more of a storage/transfer/interoperability requirement, and not as an analysis requirement.

schristley commented 3 years ago

I don't think so. In the abstract, I suppose. But, if a required field is missing, then any tool that ingests the file would have to know the file is a "special" non-compliant file. Not sure what value there is validating that a broken file is MiAIRR compliant.

I'm thinking of required as more of a storage/transfer/interoperability requirement, and not as an analysis requirement.

Ok, yeah, I'm thinking the same way. The only reason to have it is for "completeness". I think we intended that all MiAIRR fields would be required so it would be a strict subset, but I think that's only true for repertoire metadata, not rearrangements.

I would be tempted to roll your 1) and 2) together, or leave out 2). Technically 2) is not MiAIRR compliance, it's MiAIRR plus other stuff, which might be confusing for users as it would complain about fields which aren't in MiAIRR.

schristley commented 3 years ago

I think we intended that all MiAIRR fields would be required so it would be a strict subset, but I think that's only true for repertoire metadata, not rearrangements.

I think there was some discussion about revisiting MiAIRR to reconcile the differences for rearrangements. Does that sound right to you @bussec ? Maybe the current differences intend to be resolved in the future...

bcorrie commented 3 years ago

I am wondering if it might be useful to have an option to NOT report extra fields as a problem?

Warning: Object has non-AIRR field that cannot be validated (c_sequence_start). Warning: Object has non-AIRR field that cannot be validated (c_sequence_end). Warning: Object has non-AIRR field that cannot be validated (is_cell).

This has the potential to be quite noisy and have actual problems get lost in that noise. If you want a MiAIRR validation, having a bunch of warnings about fields that don't break MiAIRR validation but are still reported on maybe isn't what we want?

We do want the user to be able to get these warning, but maybe they should be turned off when the user asks for a specific level of validation?

bussec commented 3 years ago

As required is a property of a field itself, it does not make any statements about whether the field MUST/SHOULD/MAY contain meaningful information, garbage or NULL. Independent of this, MiAIRR as a metadata standard -- of which x-airr.miairr can be considered to be the formalized version -- defines the following rules:

  1. Fields that are labeled as miairr:essential or miairr:important MUST be present
  2. Fields labeled as miairr:essential MUST NOT contain NULL-LIKE data (i.e., garbage)
  3. If metadata is provided that semantically matches the definition of ANY miairr field (i.e., including :defined), it MUST be encoded in this field.

For our discussion here, we can ignore (2) + (3), but I would like to mention that it would be nice to have a "NOT NULL-LIKE" validation for :essential fields in the future.

What follows from (1) is that miairr:(essential|important) fields -- not miairr in general -- should be a subset of the required fields in the Schema. Fields that are miairr:defined may be required but probably we should avoid this for clarity, especially as MiAIRR does not define identifiers.

Looking at the current schema, there are a number of sections that are at variance with the principles stated above:

  1. rearrangement.cell_id is miairr:important but not required. It is an identifier in addition.
  2. rearrangement.duplicate_count is miairr:important but not required.
  3. Several properties of Cell are miairr:defined and required at the same time

IMO we can resolve all of these in the Schema, i.e., change their MiAIRR status if this avoids yet-another-flag for validation. @schristley: Yes, we had these discussions and I think we should continue it to see whether we should make MiAIRR a pure metadata (i.e., not analysis results) standard.

bcorrie commented 3 years ago

Looking at the current schema, there are a number of sections that are at variance with the principles stated above:

  1. repertoire.cell_id is miairr:important but not required. It is an identifier in addition.
  2. repertoire.duplicate_count is miairr:important but not required.
  3. Several properties of Cell are miairr:defined and required at the same time

These are rearrangement.cell_id and rearrangement.duplicate_count, no? Just making sure I am not confused 8-)

bussec commented 3 years ago

These are rearrangement.cell_id and rearrangement.duplicate_count, no? Just making sure I am not confused 8-)

Upps... changed it in the post above.

bcorrie commented 3 years ago

Yes, we had these discussions and I think we should continue it to see whether we should make MiAIRR a pure metadata (i.e., not analysis results) standard.

Another important question is around its minimalism. Minimalism encourages use but reduces usefulness 8-)

schristley commented 3 years ago

I am wondering if it might be useful to have an option to NOT report extra fields as a problem?

Warning: Object has non-AIRR field that cannot be validated (c_sequence_start). Warning: Object has non-AIRR field that cannot be validated (c_sequence_end). Warning: Object has non-AIRR field that cannot be validated (is_cell).

This has the potential to be quite noisy and have actual problems get lost in that noise. If you want a MiAIRR validation, having a bunch of warnings about fields that don't break MiAIRR validation but are still reported on maybe isn't what we want?

We do want the user to be able to get these warning, but maybe they should be turned off when the user asks for a specific level of validation?

The function has a nonairr parameter to turn off those warning.

schristley commented 3 years ago

As required is a property of a field itself, it does not make any statements about whether the field MUST/SHOULD/MAY contain meaningful information, garbage or NULL. Independent of this, MiAIRR as a metadata standard -- of which x-airr.miairr can be considered to be the formalized version -- defines the following rules:

  1. Fields that are labeled as miairr:essential or miairr:important MUST be present
  2. Fields labeled as miairr:essential MUST NOT contain NULL-LIKE data (i.e., garbage)
  3. If metadata is provided that semantically matches the definition of ANY miairr field (i.e., including :defined), it MUST be encoded in this field.

For our discussion here, we can ignore (2) + (3), but I would like to mention that it would be nice to have a "NOT NULL-LIKE" validation for :essential fields in the future.

What follows from (1) is that miairr:(essential|important) fields -- not miairr in general -- should be a subset of the required fields in the Schema. Fields that are miairr:defined may be required but probably we should avoid this for clarity, especially as MiAIRR does not define identifiers.

Ok, so the combination of required and nullable gets us most of the way. I don't think we have a robust way to check for garbage NULL-LIKE values though...

bussec commented 3 years ago

I don't think we have a robust way to check for garbage NULL-LIKE values though...

No, but we could at least check for the keywords defined in the Docs, i.e., not_applicable, not_collected and missing.

javh commented 3 years ago

The function has a nonairr parameter to turn off those warning.

We could also do a typical level argument: debug, info, warning, error in Python; message, warning, stop in R. Ie, level="warning" reports warnings and errors/stops. I don't know if we need info/message, but we could use it in place of debug and then have the same levels across R and python.