Closed bussec closed 4 years ago
Also:
x-airr: required
and the top-level required
attributed of each object? Is there one?false
if unspecified?It feels to me like these should be resolved for v1.3??? Added the tag...
In particular, top level required versus x-airr: required needs to sorted, as there are inconsistencies.
locus is x-airr: required but is not in the YAML required fields
Python validation checks YAML required fields but not x-airr: required.
Rearrangement is confusing... There is a long list of top level required fields, but many have no x-airr attribute. Is there any reason why we don't have:
required:
- sequence_id
Imply it is an error NOT to have:
sequence_id:
type: string
x-airr:
required: true
Or maybe get rid of one of them so we don't have both???
It looks to me like the base required
attribute is only on Alignment and Rearrangement, so it seems to me like this could be removed and we rely on x-airr
attributes...
I think I would recommend if we need another required
flag to mean something else, then it should be added to the x-airr
attributes like we did for adc-api-optional
What's the distinction between x-airr: required and the top-level required attributed of each object? Is there one?
The primary distinction is that the top-level required
is an integral part of the OpenAPI spec, so software/middleware will interpret and act upon that setting (for example, by disallowing data to be transferred/saved/etc because it is missing a required
field). While x-airr
is an extension object and needs to be manually interpreted.
The top-level required
only exists for rearrangements (and maybe alignment too) to indicate requirement for the AIRR TSV format (and correspondingly the minimum set of annotations fields required for downstream analysis). The x-airr: required
indicates that the field is either required for MiAIRR and/or required for ADC API.
There was an old discussion about "what is required for a rearrangement AIRR TSV file" is likely different from "what is required for a rearrangement according to the ADC API". We didn't know at the time what the ADC API really required, but now we have a much better idea. The ADC API has mostly followed MiAIRR except for a few fields that are non-MiAIRR.
So, x-airr: required
is entirely redundant with x-airr: miarr
+ x-airr: adc-api-optional
? Is that correct?
So, x-airr: required is entirely redundant with x-airr: miarr + x-airr: adc-api-optional? Is that correct?
No, x-airr: adc-api-optional
relates to the input query, x-airr: required
relates to the output data. ADC API uses x-airr: required
for #213 so users can choose for output data to be "valid" (i.e. it would pass AIRR validation with all the required fields) or not (because they want only a specific set of fields).
Ah, okay. Thanks, @schristley. So would renaming x-airr: required
to something like x-airr: adc-api-required
be appropriate?
(This would easily be confused with x-airr: adc-api-optional
, but I was already confused as is...)
Ah, okay. Thanks, @schristley. So would renaming
x-airr: required
to something likex-airr: adc-api-required
be appropriate?
I think x-airr: required
means it is required for both MiAIRR and the ADC API response. So this is playing a dual role at the moment so we don't want to change the name to be API specific.
(This would easily be confused with
x-airr: adc-api-optional
, but I was already confused as is...)
Perhaps we could change this attribute to x-airr: api-query-required
as I think that is the more precise meaning, correct @schristley
The top-level
required
only exists for rearrangements (and maybe alignment too) to indicate requirement for the AIRR TSV format (and correspondingly the minimum set of annotations fields required for downstream analysis). Thex-airr: required
indicates that the field is either required for MiAIRR and/or required for ADC API.
@schristley is there an example where the top-level required
(for AIRR TSV) would not be the same as what the ADC API response would be? It makes sense to me that the AIRR TSV and the ADC API rearrangement response should have the same required fields.
It also seems like this would almost certainly mean that what is required for MiAIRR as a minimal standard would be different than what the AIRR TSV and ADC API response require...
So it seems like the mapping that makes the most sense is:
x-airr: required
means required for MiAIRR. In this case this field doesn't have any direct meaning for the ADC API.required
means required for AIRR TSV and ADC API response. This should almost certainly include the x-airr: required
fields since if they are MiAIRR they should be in the AIRR TSV and ADC API response. If this was to exist as an x-airr extension or if the API response needed to be different than the AIRR TSV file fields, I would suggest it could be named x-airr: api-response-required
x-airr: adc-api-optional
means not required for the ADC API query. I suggest above perhaps renaming this to x-airr: api-query-required
and having the meaning change to it is required that it be supported in the ADC API queries. Note that this means that it be supported, not that it be present in all queries (correct @schristley )In all cases, if the field is missing, False is assumed...
In all cases, if the field is missing, False is assumed...
I'm very much in favor of this. This begs the question of whether nullable
should also be renamed to not-nullable
, making nullability the default assumption.
Another question... What's the value in having x-airr: miairr
in addition to x-airr: required
(if using the definition @bcorrie describes above)? Ie, why does it matter if a field is a MiAIRR defined data element if it's not required by MiAIRR? If the field is in the schema somewhere, does it matter if it's part of MiAIRR sets 1-5? (This distinction seems to completely collapse in set 6 and beyond.)
Could we just have a single x-airr: minimal
tag instead?
Edit:
not-nullable
?null
?x-airr: miairr
seems redundant with x-airr: set
. If x-airr: set
defaults to null
.@schristley is there an example where the top-level required (for AIRR TSV) would not be the same as what the ADC API response would be? It makes sense to me that the AIRR TSV and the ADC API rearrangement response should have the same required fields.
Yes, fields like repertoire_id
and rearrangement_id
, which were added as part of the API development, are not required by the AIRR TSV. It's worth considering if we can consolidate on a single required
for all scenarios.
I'm very much in favor of this. This begs the question of whether nullable should also be renamed to not-nullable, making nullability the default assumption.
OpenAPI V3 has a specific nullable
attribute which is missing from V2. This is why we added x-airr: nullable
. When we move to V3, we can get rid of x-airr: nullable
, so to make that transition smooth, I think we should keep the same semantics as the OpenAPI V3 nullable
.
I think we should keep the same semantics as the OpenAPI V3 nullable.
Seems wise. Is there a mechanism to set defaults on an object-wide basis for nullable
? It'd be nice to be able to set nullable: true
at the top level, and then only have to define nullable: false
for those fields that aren't nullable, rather than having to add nullable
to every field in the schema.
Is there a mechanism to set defaults on an object-wide basis for nullable?
Not that I'm aware of. The default value in V3 is nullable: false
.
How about a variant of what @bcorrie proposed above?
x-airr: minimal
: MiAIRR required field.required
(top level): The field must be present in any portable flat file (JSON/TSV).x-airr: api-response-required
: Field must be returned from an ADC API query.x-airr: api-query-required
: Field must be specified in an ADC API query.x-airr: nullable
: functionally equivalent to OpenAPI V3 nullable
attribute, which will replace it when moving to OpenAPI V3.With rules:
x-airr: minimal
: Defined as required by MiniStd.required
(top level): Defined as required by DataRep. Technically independent from x-airr: minimal
, but in practice required
will be a superset of x-airr: minimal
unless there is an unusual edge case.x-airr: api-response-required
: Defined as required by CommonRepo. Technically independent from x-airr: minimal
and required
, but in practice x-airr: api-response-required
will be a superset of required
(and hence x-airr: minimal
) unless there is an unusual edge case.x-airr: api-query-required
: Defined by CommonRepo.Not sure that I like api-
as prefix as that also implies the flat file parsers in the python/R libraries. Maybe rest-
?
I don't quite understand what that complexity gains us (and admittedly trying to avoid writing lots of complicated validation code). Here are issues that I see:
required
(top level): Because this is the OpenAPI keyword, these fields will be required under all circumstances, not just in files. Other codes (which we don't have control over) that read OpenAPI specs may apply that rule. So even if the ADC API or MiAIRR doesn't consider it required, it must be included.
I'm not sure how x-airr: api-response-required
helps. The point of the ADC API having a parameter to return required fields is so that output will pass AIRR validation, but if the validation routines only checks required
(top level) then it's missing the fields like repertoire_id
for linking to metadata.
We will need to modify all validate routines (also airr-tools command) to require a context for validation. Are you validating for a file, ADC API, MiAIRR, or something new down the road.
My preference is something simple like what @bussec documents in #319
required
(top level): Eliminate completely.x-airr: miairr
: MiAIRR field.x-airr: required
: AIRR required field, superset of MiAIRR. It can be AIRR required for numerous reasons: because it is a MiAIRR field, because it is needed for analysis, because it is needed to link objects (rearrangement -> repertoire) in the AIRR Data Model, because it is a DOI in the AIRR Data Commons, etc.x-airr: adc-api-optional
: There are no fields that "must be specified in an ADC API query". This flag indicates if the ADC API service must allow queries on that field. It has optional
in the name as the default is false
, meaning that no, the field is not optional, and the ADC API service must allow queries on the field. Having required
in the name means we have to set it to true
for pretty much every field in all the schemas.x-airr: nullable
: As is.I'm all for simplifying as much as possible. Some questions:
x-airr: miairr
? What's it used for? Can we just eliminate it and rely on context/set?x-airr
attribute specifying the field as required only in the context of the ADC API? Ie, how do we handle repertoire_id
?required
) or go with x-airr: required
. I agree that both of them shouldn't exist, ideally. I don't know that I have a preference - there's a tradeoff between following a standard and flexibility.Why do we need x-airr: miairr? What's it used for? Can we just eliminate it and rely on context/set?
To generate the MiAIRR table in the documentation, it's also reasonable that some tools might want to know which are MiAIRR fields, e.g. a data submission tool to NCBI.
How do we distinguish fields that are required by the ADC API but not in the TSV? Don't we need an x-airr attribute specifying the field as required only in the context of the ADC API? Ie, how do we handle repertoire_id?
My suggestion is to eliminate this distinction. Many of the required fields for the TSV were set because they were critical for downstream analysis. At the time, we only had rearrangements, so a simple sequence_id
was sufficient, but now that we've added other schema objects (repertoire metadata, clones, etc), I think we should consider the identifiers that link these objects together to be required as well. With the AIRR Data Commons, this has broadened the analysis capabilities that can be done with AIRR-seq data beyond just rearrangements.
I think the additional required fields we are considering for the ADC API are all identifiers. So you could make the argument it isn't really the ADC API that is requiring them, but the enhancements to the AIRR Data Model. Users may expect, for example, if they are given a rearrangement TSV file and a repertoire metadata file, that both files will contain identifiers needed to link the data.
Why do we need x-airr: miairr? What's it used for? Can we just eliminate it and rely on context/set?
set
by itself would only be a reliable indicator if everything that is assigned a set
is considered to be required by MiAIRR. However, if we consider MiAIRR not strictly as a minimal standard but as a slightly broader metadata core standard (which can include optional fields), then there will be fields that are assigned to a set
but are not required.
Hrm. I'm still a bit confused by x-airr: miairr
. Do we want this to mean "required by MiAIRR" or "defined by MiAIRR"? If the former, I'm struggling to see the difference from x-airr: required
. If the later, then it seems like it's just a convenience annotation, like set
/name
, that doesn't really do anything (which is fine, but is tangential to whether something is required).
Maybe we can discuss it on the call? Something's just not clicking for me.
@schristley:
I think the additional required fields we are considering for the ADC API are all identifiers. So you could make the argument it isn't really the ADC API that is requiring them, but the enhancements to the AIRR Data Model. Users may expect, for example, if they are given a rearrangement TSV file and a repertoire metadata file, that both files will contain identifiers needed to link the data.
I don't think we're there yet. In the TSV, the required fields may be nullable, but them being required is a very strong suggestion to populate them. They should be populated, but to facilitate adoption we decided to be more flexible. Adoption of the metadata schemas is going to be a long haul. I think the majority of use cases will still be a Rearrangement TSV without a Repertoire metadata file.
I see where you're coming from, but I don't think it is currently practical to expect tools like IgBLAST, IMGT/V-QUEST, MiXCR, etc to add empty metadata linker fields (or build metadata support into their tool). I suspect we may hit the same issue with the Clones/Trees schema.
Hrm. I'm still a bit confused by x-airr: miairr. Do we want this to mean "required by MiAIRR" or "defined by MiAIRR"? If the former, I'm struggling to see the difference from x-airr: required. If the later, then it seems like it's just a convenience annotation, like set/name, that doesn't really do anything (which is fine, but is tangential to whether something is required).
I understand it as the latter, i.e. these are the data fields that, at a minimum, should be reported when publishing a study.
I see where you're coming from, but I don't think it is currently practical to expect tools like IgBLAST, IMGT/V-QUEST, MiXCR, etc to add empty metadata linker fields (or build metadata support into their tool). I suspect we may hit the same issue with the Clones/Trees schema.
I hear ya. It's like my struggles with duplicate_count
, which isn't a required field, and IgBlast doesn't know how to provide it anyways, but yet it's required for half of the downstream analyses we perform. I have to write a post-script hack to put it in. Gonna have the same issue with repertoire_id
.
So maybe the terminology of required
is the wrong way to think about this, because you are right, if the field is there but always null
then it's just as useless as if it wasn't there. Maybe we need a different terminology that indicates, these fields are important for the AIRR Data Model, so if you have data for these fields, you are strongly encouraged to leave them in the data so that information isn't lost, but they aren't required per se.
Yeah. Maybe we can have some sort of attribute for recommended, not required in all contexts, but required for metadata specification. Which would encompass MiAIRR non-required fields and ADC API required fields.
@javh on the meaing of x-airr: miairr
:
set
, as every MiAIRR-defined field is assigned to one of those. However, there is a difference between defining and requiring a field (i.e. there can be fields that are assigned to a set
but are still OPTIONAL).false
-> OPTIONAL and true
-> RECOMMENDED. The meaning of the field might be more obvious if we would use a controlled vocabulary (required
, recommended
, optional
). The only redundancy this would created is that nullable
MUST be false
for required
fields.Thanks, @bussec. From #319:
Requirement level | required |
nullable |
---|---|---|
REQUIRED | true |
false |
RECOMMENDED | true |
true |
OPTIONAL | false |
false |
So the current x-airr: miairr
is then the same as the top level required
? (Field must be present, but may be null.)
Why is OPTIONAL nullable=false
? It would generally be better to just exclude an OPTIONAL field, but if you have a table (TSV) then nullability would be important as not all records may have data for the OPTIONAL field.
It would generally be better to just exclude an OPTIONAL field, but if you have a table (TSV) then nullability would be important as not all records may have data for the OPTIONAL field.
Agreed, I just didn't think about the TSV situation. So we change this to true
?
How do we distinguish fields that are required by the ADC API but not in the TSV? Don't we need an x-airr attribute specifying the field as required only in the context of the ADC API? Ie, how do we handle repertoire_id?
My suggestion is to eliminate this distinction. Many of the required fields for the TSV were set because they were critical for downstream analysis.
@schristley that is the same question I raised in https://github.com/airr-community/airr-standards/issues/297#issuecomment-579873992. You pointed out the repertoire_id
and rearrangement_id
were added for the API, but I agree that it feels like these things could and maybe even should be in a file...
- I'm not sure how
x-airr: api-response-required
helps. The point of the ADC API having a parameter to return required fields is so that output will pass AIRR validation, but if the validation routines only checksrequired
(top level) then it's missing the fields likerepertoire_id
for linking to metadata.
I agree as long as AIRR TSV required is the same as ADC API response required. If not then I think we need it.
x-airr: adc-api-optional
: There are no fields that "must be specified in an ADC API query". This flag indicates if the ADC API service must allow queries on that field. It hasoptional
in the name as the default isfalse
, meaning that no, the field is not optional, and the ADC API service must allow queries on the field. Havingrequired
in the name means we have to set it totrue
for pretty much every field in all the schemas.
I think I would prefer to have the phrase "query" in this attribute, as this applies only to ADC Web API queries. @javh also mentioned avoiding API generally as there are python and R application APIs. Is it crazy to use adc-web-api-query-optional
? I am not sure we really need the ADC so this could be web-api-query-optional
? Or if we accept that people know that web queries are done through an API we could go to web-query-optional
BTW I would prefer to avoid REST because in its truest form I am not sure our API is really RESTful.
WRT to the number of fields that the flag applies to, all of the fields in the Repertoire
schema query are adc-api-optional = false
and therefore this flag does not need to be set for those schema fields.
With that said, the number of fields in the Rearrangement
schema that need to be supported for queries is currently very small, and therefore many rearrangement fields (124 in fact) need to be set to adc-api-optional = true
.
One way or the other, most of the fields in one of either Rearrangement
or Repertoire
schemas have to have this field set...
Maybe we need a different terminology that indicates, these fields are important for the AIRR Data Model, so if you have data for these fields, you are strongly encouraged to leave them in the data so that information isn't lost, but they aren't required per se.
How about a tag of x-airr: pretty-please=true
We talked about this issue a lot on the DataRep call. My personal take away was that something like this might work:
Requirement levels
required
(top level): As is. The field must be present; single-object context (ie, required
in Rearrangement doesn't mandated required
in Clone).x-airr: linker
: Linker field (id, uuid) required by the full AIRR Data Model. Multi-object context (i.e., if you need repertoire_id
in Rearrangement, you will also need it in Clone).x-airr: adc-api-optional
: As is.x-airr: nullable
: As is.Annotations
x-airr: miairr
: Convenience annotation that labels the field as part of the MiAIRR sets.x-airr: miairr-required
: MiAIRR REQUIRED
(required=True, nullable=False)x-airr: miairr-recommended
: MiAIRR RECOMMENDED
(required=True, nullable=True)x-airr: miairr-optional
: MiAIRR OPTIONAL
(required=False, nullable=True). Probably doesn't need to exist, but could.Yay? Nay? (Not sure about the names.)
Do we have the x-airr: miairr-required
and others as actual attributes, or do we just document them based upon the required
and nullable
flags?
@javh So the miairr-*
annotations would be keys, i.e. there would have a boolean value assigned? Or is there something like an "tag" (i.e. something that is either there or not) in YAML that I don't know about?
Do we have the x-airr: miairr-required and others as actual attributes, or do we just document them based upon the required and nullable flags?
@javh So the miairr-* annotations would be keys, i.e. there would have a boolean value assigned? Or is there something like an "tag" (i.e. something that is either there or not) in YAML that I don't know about?
When I wrote it down I thought they'd be booleans, but looking over it again I think it would make more sense as a single field with a controlled vocabulary. Eg, x-airr: miairr = "required"
. That seems such simpler:
miarr:
type: string
description: MiAIRR requirement level.
enum:
- required
- recommended
- optional
default: optional
I think I'm good with what @javh presents. I haven't thought of any issues with it.
Other names for x-airr: linker
... x-airr: identifier
, x-airr: model-key
, x-airr: relation
. I think x-airr: identifier
sounds the most explicit.
Looking through rearrangements, shouldn't duplicate_count
be required? I don't think there is any way to derive it from the other required fields.
@bcorrie How about we rename adc-api-optional
to adc-query-optional
?
Looking through rearrangements, shouldn't duplicate_count be required? I don't think there is any way to derive it from the other required fields.
I don't think so, just because it's a post-alignment processing step that isn't usually performed by alignment tools.
x-airr: identifier
seems the most explicit to me as well. But, do we anticipate anything that will be required for the full AIRR Data model that won't be a linking key?
Looking through rearrangements, shouldn't duplicate_count be required? I don't think there is any way to derive it from the other required fields.
I don't think so, just because it's a post-alignment processing step that isn't usually performed by alignment tools.
I mean, that would just be a default value of 1, then, right?
I don't think so, just because it's a post-alignment processing step that isn't usually performed by alignment tools.
Ok, I see. I do it pre-alignment but I see your point.
Ok, I see. I do it pre-alignment but I see your point.
I usually do it twice. Once before alignment to reduce the number of sequences input into the alignment. Then once after alignment (propagating the previous duplicate_count) to catch duplicates that are "corrected" during alignment (sequencing error indels, mismatches in leader/c-region, differences in priming position, etc).
@bcorrie How about we rename
adc-api-optional
toadc-query-optional
?
Yes, I think that is reasonable... I think I prefer adc-query-required
to match the other _required
attributes but it is a six of one, half dozen of the other situation so I am OK this way...
Yes, I think that is reasonable... I think I prefer adc-query-required to match the other _required attributes but it is a six of one, half dozen of the other situation so I am OK this way...
I think pretty much every field in the schema is going to have an x-airr
object anyways... So if this is our one change to it, might as well make it good :-D I just don't want the naive interpretation to be "I'm required to query with this field" versus the reality of "the API service must support queries on this field". What about adc-query-support
or adc-query-support-required
? Or is that just making it longer without actually helping?
I am OK to make it longer if it helps... I think this helps... 8-)
Added ADC API V1 tag because #335 needs this to be resolved
MiniStd discussed this again in its last call:
essential
, important
and defined
.essential
fields to be present and non-NULL and all important
fields to be present. If there are no further concerns or comments, I will commit this along with the respective documentation via #319.The three requirement levels for individual fields will be
essential
,important
anddefined
.
I updated the levels in #342.
Thanks @javh, I forgot about that PR. Is the plan to merge it into master
soonish and do the documentation stuff via #319?
Currently, all that's in #342 is changing what's defined in Attributes
. So we could merge that in now, which might be the easiest thing to do in terms of avoiding merge conflicts, but it'll leave the schema in master temporary incorrect (Attributes
won't match x-airr
tags on fields).
I say we leave the doc updates in #319.
We can update all the x-airr
tags in #342 or do all that in a separate PR if we want to break the task into chunks. Either way.
I don't like the idea of having invalid stuff on master
. Do we need new validation routines or just update all existing tags?
Given the recent discussion in #213, #272, #282 and other issues/PRs, would it be helpful if we would formalize the relations between the following
x-airr
fields?miairr
required
nullable
adc-api-optional
Where "formalizing" means
if-then
rules regarding the state of the fieldsMUST
,MAY
orMUST NOT
be included in the schemaThere is already some information in the docs and there is also a table from the ADC API perspective, but IMO they don't address all those point. If I missing something, please let me know.