airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Formalize relations between x-airr fields #297

Closed bussec closed 4 years ago

bussec commented 4 years ago

Given the recent discussion in #213, #272, #282 and other issues/PRs, would it be helpful if we would formalize the relations between the following x-airr fields?

Where "formalizing" means

  1. having a comprehensive set of if-then rules regarding the state of the fields
  2. mapping frequently encountered requirement groups to combination of states
  3. defining the default values and whether fields withs default values MUST, MAY or MUST NOT be included in the schema

There is already some information in the docs and there is also a table from the ADC API perspective, but IMO they don't address all those point. If I missing something, please let me know.

javh commented 4 years ago

Also:

  1. What's the distinction between x-airr: required and the top-level required attributed of each object? Is there one?
  2. Do these all default to false if unspecified?
bcorrie commented 4 years ago

It feels to me like these should be resolved for v1.3??? Added the tag...

bcorrie commented 4 years ago

In particular, top level required versus x-airr: required needs to sorted, as there are inconsistencies.

locus is x-airr: required but is not in the YAML required fields

Python validation checks YAML required fields but not x-airr: required.

bcorrie commented 4 years ago

Rearrangement is confusing... There is a long list of top level required fields, but many have no x-airr attribute. Is there any reason why we don't have:

    required:
        - sequence_id

Imply it is an error NOT to have:

        sequence_id:
            type: string
            x-airr:
                required: true
bcorrie commented 4 years ago

Or maybe get rid of one of them so we don't have both???

It looks to me like the base required attribute is only on Alignment and Rearrangement, so it seems to me like this could be removed and we rely on x-airr attributes...

I think I would recommend if we need another required flag to mean something else, then it should be added to the x-airr attributes like we did for adc-api-optional

schristley commented 4 years ago

What's the distinction between x-airr: required and the top-level required attributed of each object? Is there one?

The primary distinction is that the top-level required is an integral part of the OpenAPI spec, so software/middleware will interpret and act upon that setting (for example, by disallowing data to be transferred/saved/etc because it is missing a required field). While x-airr is an extension object and needs to be manually interpreted.

The top-level required only exists for rearrangements (and maybe alignment too) to indicate requirement for the AIRR TSV format (and correspondingly the minimum set of annotations fields required for downstream analysis). The x-airr: required indicates that the field is either required for MiAIRR and/or required for ADC API.

There was an old discussion about "what is required for a rearrangement AIRR TSV file" is likely different from "what is required for a rearrangement according to the ADC API". We didn't know at the time what the ADC API really required, but now we have a much better idea. The ADC API has mostly followed MiAIRR except for a few fields that are non-MiAIRR.

javh commented 4 years ago

So, x-airr: required is entirely redundant with x-airr: miarr + x-airr: adc-api-optional? Is that correct?

schristley commented 4 years ago

So, x-airr: required is entirely redundant with x-airr: miarr + x-airr: adc-api-optional? Is that correct?

No, x-airr: adc-api-optional relates to the input query, x-airr: required relates to the output data. ADC API uses x-airr: required for #213 so users can choose for output data to be "valid" (i.e. it would pass AIRR validation with all the required fields) or not (because they want only a specific set of fields).

javh commented 4 years ago

Ah, okay. Thanks, @schristley. So would renaming x-airr: required to something like x-airr: adc-api-required be appropriate?

(This would easily be confused with x-airr: adc-api-optional, but I was already confused as is...)

bcorrie commented 4 years ago

Ah, okay. Thanks, @schristley. So would renaming x-airr: required to something like x-airr: adc-api-required be appropriate?

I think x-airr: required means it is required for both MiAIRR and the ADC API response. So this is playing a dual role at the moment so we don't want to change the name to be API specific.

(This would easily be confused with x-airr: adc-api-optional, but I was already confused as is...)

Perhaps we could change this attribute to x-airr: api-query-required as I think that is the more precise meaning, correct @schristley

bcorrie commented 4 years ago

The top-level required only exists for rearrangements (and maybe alignment too) to indicate requirement for the AIRR TSV format (and correspondingly the minimum set of annotations fields required for downstream analysis). The x-airr: required indicates that the field is either required for MiAIRR and/or required for ADC API.

@schristley is there an example where the top-level required (for AIRR TSV) would not be the same as what the ADC API response would be? It makes sense to me that the AIRR TSV and the ADC API rearrangement response should have the same required fields.

It also seems like this would almost certainly mean that what is required for MiAIRR as a minimal standard would be different than what the AIRR TSV and ADC API response require...

So it seems like the mapping that makes the most sense is:

In all cases, if the field is missing, False is assumed...

javh commented 4 years ago

In all cases, if the field is missing, False is assumed...

I'm very much in favor of this. This begs the question of whether nullable should also be renamed to not-nullable, making nullability the default assumption.

Another question... What's the value in having x-airr: miairr in addition to x-airr: required (if using the definition @bcorrie describes above)? Ie, why does it matter if a field is a MiAIRR defined data element if it's not required by MiAIRR? If the field is in the schema somewhere, does it matter if it's part of MiAIRR sets 1-5? (This distinction seems to completely collapse in set 6 and beyond.)

Could we just have a single x-airr: minimal tag instead?

Edit:

schristley commented 4 years ago

@schristley is there an example where the top-level required (for AIRR TSV) would not be the same as what the ADC API response would be? It makes sense to me that the AIRR TSV and the ADC API rearrangement response should have the same required fields.

Yes, fields like repertoire_id and rearrangement_id, which were added as part of the API development, are not required by the AIRR TSV. It's worth considering if we can consolidate on a single required for all scenarios.

schristley commented 4 years ago

I'm very much in favor of this. This begs the question of whether nullable should also be renamed to not-nullable, making nullability the default assumption.

OpenAPI V3 has a specific nullable attribute which is missing from V2. This is why we added x-airr: nullable. When we move to V3, we can get rid of x-airr: nullable, so to make that transition smooth, I think we should keep the same semantics as the OpenAPI V3 nullable.

javh commented 4 years ago

I think we should keep the same semantics as the OpenAPI V3 nullable.

Seems wise. Is there a mechanism to set defaults on an object-wide basis for nullable? It'd be nice to be able to set nullable: true at the top level, and then only have to define nullable: false for those fields that aren't nullable, rather than having to add nullable to every field in the schema.

schristley commented 4 years ago

Is there a mechanism to set defaults on an object-wide basis for nullable?

Not that I'm aware of. The default value in V3 is nullable: false.

javh commented 4 years ago

How about a variant of what @bcorrie proposed above?

With rules:

Not sure that I like api- as prefix as that also implies the flat file parsers in the python/R libraries. Maybe rest-?

schristley commented 4 years ago

I don't quite understand what that complexity gains us (and admittedly trying to avoid writing lots of complicated validation code). Here are issues that I see:

My preference is something simple like what @bussec documents in #319

javh commented 4 years ago

I'm all for simplifying as much as possible. Some questions:

schristley commented 4 years ago

Why do we need x-airr: miairr? What's it used for? Can we just eliminate it and rely on context/set?

To generate the MiAIRR table in the documentation, it's also reasonable that some tools might want to know which are MiAIRR fields, e.g. a data submission tool to NCBI.

How do we distinguish fields that are required by the ADC API but not in the TSV? Don't we need an x-airr attribute specifying the field as required only in the context of the ADC API? Ie, how do we handle repertoire_id?

My suggestion is to eliminate this distinction. Many of the required fields for the TSV were set because they were critical for downstream analysis. At the time, we only had rearrangements, so a simple sequence_id was sufficient, but now that we've added other schema objects (repertoire metadata, clones, etc), I think we should consider the identifiers that link these objects together to be required as well. With the AIRR Data Commons, this has broadened the analysis capabilities that can be done with AIRR-seq data beyond just rearrangements.

I think the additional required fields we are considering for the ADC API are all identifiers. So you could make the argument it isn't really the ADC API that is requiring them, but the enhancements to the AIRR Data Model. Users may expect, for example, if they are given a rearrangement TSV file and a repertoire metadata file, that both files will contain identifiers needed to link the data.

bussec commented 4 years ago

Why do we need x-airr: miairr? What's it used for? Can we just eliminate it and rely on context/set?

set by itself would only be a reliable indicator if everything that is assigned a set is considered to be required by MiAIRR. However, if we consider MiAIRR not strictly as a minimal standard but as a slightly broader metadata core standard (which can include optional fields), then there will be fields that are assigned to a set but are not required.

javh commented 4 years ago

Hrm. I'm still a bit confused by x-airr: miairr. Do we want this to mean "required by MiAIRR" or "defined by MiAIRR"? If the former, I'm struggling to see the difference from x-airr: required. If the later, then it seems like it's just a convenience annotation, like set/name, that doesn't really do anything (which is fine, but is tangential to whether something is required).

Maybe we can discuss it on the call? Something's just not clicking for me.

@schristley:

I think the additional required fields we are considering for the ADC API are all identifiers. So you could make the argument it isn't really the ADC API that is requiring them, but the enhancements to the AIRR Data Model. Users may expect, for example, if they are given a rearrangement TSV file and a repertoire metadata file, that both files will contain identifiers needed to link the data.

I don't think we're there yet. In the TSV, the required fields may be nullable, but them being required is a very strong suggestion to populate them. They should be populated, but to facilitate adoption we decided to be more flexible. Adoption of the metadata schemas is going to be a long haul. I think the majority of use cases will still be a Rearrangement TSV without a Repertoire metadata file.

I see where you're coming from, but I don't think it is currently practical to expect tools like IgBLAST, IMGT/V-QUEST, MiXCR, etc to add empty metadata linker fields (or build metadata support into their tool). I suspect we may hit the same issue with the Clones/Trees schema.

schristley commented 4 years ago

Hrm. I'm still a bit confused by x-airr: miairr. Do we want this to mean "required by MiAIRR" or "defined by MiAIRR"? If the former, I'm struggling to see the difference from x-airr: required. If the later, then it seems like it's just a convenience annotation, like set/name, that doesn't really do anything (which is fine, but is tangential to whether something is required).

I understand it as the latter, i.e. these are the data fields that, at a minimum, should be reported when publishing a study.

schristley commented 4 years ago

I see where you're coming from, but I don't think it is currently practical to expect tools like IgBLAST, IMGT/V-QUEST, MiXCR, etc to add empty metadata linker fields (or build metadata support into their tool). I suspect we may hit the same issue with the Clones/Trees schema.

I hear ya. It's like my struggles with duplicate_count, which isn't a required field, and IgBlast doesn't know how to provide it anyways, but yet it's required for half of the downstream analyses we perform. I have to write a post-script hack to put it in. Gonna have the same issue with repertoire_id.

So maybe the terminology of required is the wrong way to think about this, because you are right, if the field is there but always null then it's just as useless as if it wasn't there. Maybe we need a different terminology that indicates, these fields are important for the AIRR Data Model, so if you have data for these fields, you are strongly encouraged to leave them in the data so that information isn't lost, but they aren't required per se.

javh commented 4 years ago

Yeah. Maybe we can have some sort of attribute for recommended, not required in all contexts, but required for metadata specification. Which would encompass MiAIRR non-required fields and ADC API required fields.

bussec commented 4 years ago

@javh on the meaing of x-airr: miairr:

javh commented 4 years ago

Thanks, @bussec. From #319:

Requirement level required nullable
REQUIRED true false
RECOMMENDED true true
OPTIONAL false false

So the current x-airr: miairr is then the same as the top level required? (Field must be present, but may be null.)

Why is OPTIONAL nullable=false? It would generally be better to just exclude an OPTIONAL field, but if you have a table (TSV) then nullability would be important as not all records may have data for the OPTIONAL field.

bussec commented 4 years ago

It would generally be better to just exclude an OPTIONAL field, but if you have a table (TSV) then nullability would be important as not all records may have data for the OPTIONAL field.

Agreed, I just didn't think about the TSV situation. So we change this to true?

bcorrie commented 4 years ago

How do we distinguish fields that are required by the ADC API but not in the TSV? Don't we need an x-airr attribute specifying the field as required only in the context of the ADC API? Ie, how do we handle repertoire_id?

My suggestion is to eliminate this distinction. Many of the required fields for the TSV were set because they were critical for downstream analysis.

@schristley that is the same question I raised in https://github.com/airr-community/airr-standards/issues/297#issuecomment-579873992. You pointed out the repertoire_id and rearrangement_id were added for the API, but I agree that it feels like these things could and maybe even should be in a file...

bcorrie commented 4 years ago
  • I'm not sure how x-airr: api-response-required helps. The point of the ADC API having a parameter to return required fields is so that output will pass AIRR validation, but if the validation routines only checks required (top level) then it's missing the fields like repertoire_id for linking to metadata.

I agree as long as AIRR TSV required is the same as ADC API response required. If not then I think we need it.

bcorrie commented 4 years ago
  • x-airr: adc-api-optional: There are no fields that "must be specified in an ADC API query". This flag indicates if the ADC API service must allow queries on that field. It has optional in the name as the default is false, meaning that no, the field is not optional, and the ADC API service must allow queries on the field. Having required in the name means we have to set it to true for pretty much every field in all the schemas.

I think I would prefer to have the phrase "query" in this attribute, as this applies only to ADC Web API queries. @javh also mentioned avoiding API generally as there are python and R application APIs. Is it crazy to use adc-web-api-query-optional? I am not sure we really need the ADC so this could be web-api-query-optional? Or if we accept that people know that web queries are done through an API we could go to web-query-optional

BTW I would prefer to avoid REST because in its truest form I am not sure our API is really RESTful.

WRT to the number of fields that the flag applies to, all of the fields in the Repertoire schema query are adc-api-optional = false and therefore this flag does not need to be set for those schema fields.

With that said, the number of fields in the Rearrangement schema that need to be supported for queries is currently very small, and therefore many rearrangement fields (124 in fact) need to be set to adc-api-optional = true.

One way or the other, most of the fields in one of either Rearrangement or Repertoire schemas have to have this field set...

bcorrie commented 4 years ago

Maybe we need a different terminology that indicates, these fields are important for the AIRR Data Model, so if you have data for these fields, you are strongly encouraged to leave them in the data so that information isn't lost, but they aren't required per se.

How about a tag of x-airr: pretty-please=true

javh commented 4 years ago

We talked about this issue a lot on the DataRep call. My personal take away was that something like this might work:

Requirement levels

Annotations

Yay? Nay? (Not sure about the names.)

schristley commented 4 years ago

Do we have the x-airr: miairr-required and others as actual attributes, or do we just document them based upon the required and nullable flags?

bussec commented 4 years ago

@javh So the miairr-* annotations would be keys, i.e. there would have a boolean value assigned? Or is there something like an "tag" (i.e. something that is either there or not) in YAML that I don't know about?

javh commented 4 years ago

Do we have the x-airr: miairr-required and others as actual attributes, or do we just document them based upon the required and nullable flags?

@javh So the miairr-* annotations would be keys, i.e. there would have a boolean value assigned? Or is there something like an "tag" (i.e. something that is either there or not) in YAML that I don't know about?

When I wrote it down I thought they'd be booleans, but looking over it again I think it would make more sense as a single field with a controlled vocabulary. Eg, x-airr: miairr = "required". That seems such simpler:

miarr:
    type: string
    description: MiAIRR requirement level.
    enum:
        - required
        - recommended
        - optional
    default: optional
schristley commented 4 years ago

I think I'm good with what @javh presents. I haven't thought of any issues with it.

Other names for x-airr: linker ... x-airr: identifier, x-airr: model-key, x-airr: relation. I think x-airr: identifier sounds the most explicit.

Looking through rearrangements, shouldn't duplicate_count be required? I don't think there is any way to derive it from the other required fields.

@bcorrie How about we rename adc-api-optional to adc-query-optional?

javh commented 4 years ago

Looking through rearrangements, shouldn't duplicate_count be required? I don't think there is any way to derive it from the other required fields.

I don't think so, just because it's a post-alignment processing step that isn't usually performed by alignment tools.

x-airr: identifier seems the most explicit to me as well. But, do we anticipate anything that will be required for the full AIRR Data model that won't be a linking key?

scharch commented 4 years ago

Looking through rearrangements, shouldn't duplicate_count be required? I don't think there is any way to derive it from the other required fields.

I don't think so, just because it's a post-alignment processing step that isn't usually performed by alignment tools.

I mean, that would just be a default value of 1, then, right?

schristley commented 4 years ago

I don't think so, just because it's a post-alignment processing step that isn't usually performed by alignment tools.

Ok, I see. I do it pre-alignment but I see your point.

javh commented 4 years ago

Ok, I see. I do it pre-alignment but I see your point.

I usually do it twice. Once before alignment to reduce the number of sequences input into the alignment. Then once after alignment (propagating the previous duplicate_count) to catch duplicates that are "corrected" during alignment (sequencing error indels, mismatches in leader/c-region, differences in priming position, etc).

bcorrie commented 4 years ago

@bcorrie How about we rename adc-api-optional to adc-query-optional?

Yes, I think that is reasonable... I think I prefer adc-query-required to match the other _required attributes but it is a six of one, half dozen of the other situation so I am OK this way...

schristley commented 4 years ago

Yes, I think that is reasonable... I think I prefer adc-query-required to match the other _required attributes but it is a six of one, half dozen of the other situation so I am OK this way...

I think pretty much every field in the schema is going to have an x-airr object anyways... So if this is our one change to it, might as well make it good :-D I just don't want the naive interpretation to be "I'm required to query with this field" versus the reality of "the API service must support queries on this field". What about adc-query-support or adc-query-support-required? Or is that just making it longer without actually helping?

bcorrie commented 4 years ago

I am OK to make it longer if it helps... I think this helps... 8-)

bcorrie commented 4 years ago

Added ADC API V1 tag because #335 needs this to be resolved

bussec commented 4 years ago

MiniStd discussed this again in its last call:

javh commented 4 years ago

The three requirement levels for individual fields will be essential, important and defined.

I updated the levels in #342.

bussec commented 4 years ago

Thanks @javh, I forgot about that PR. Is the plan to merge it into master soonish and do the documentation stuff via #319?

javh commented 4 years ago

Currently, all that's in #342 is changing what's defined in Attributes. So we could merge that in now, which might be the easiest thing to do in terms of avoiding merge conflicts, but it'll leave the schema in master temporary incorrect (Attributes won't match x-airr tags on fields).

I say we leave the doc updates in #319.

We can update all the x-airr tags in #342 or do all that in a separate PR if we want to break the task into chunks. Either way.

bussec commented 4 years ago

I don't like the idea of having invalid stuff on master. Do we need new validation routines or just update all existing tags?