airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Rearrangement::rearrangement_set_id element #82

Closed javh closed 6 years ago

javh commented 6 years ago

In the Rearrangement spec we have an element called rearrangement_set_id. Wasn't sure what this was, so I asked @schristley:

rearrangement_set_id is the link to study metadata. We called it sample_id in an earlier version but that was ambiguous as people may think it was the same as the sample_id in MiAIRR, which it isn’t. All rearrangements with the same rearrangement_set_id are part of the same repertoire, where the definition of a repertoire is defined by the study design.

If I'm understanding correctly, something like metadata_id would be a more appropriate name if it's meant to be a direct link to the metadata file. If it's meant to be descriptive of some grouping, then I think we should be explicit about what that grouping is and use fields that are already defined by MiAIRR, such as study_id, subject_id, sample_id, etc.

If it's the later, then I don't think it should be a mandatory field. I'm undecided on whether it should be mandatory if it's a link to the metadata file.

Thoughts? Is this what this field is for?

schristley commented 6 years ago

It cannot be explicit about what the grouping is, i.e. study_id, subject_id, as that grouping would not be valid for all studies. The name repertoire_id might be more meaningful as it's the set of rearrangements that compose a repertoire; this would be the most common usage. The name rearrangement_set_id is a bit more generic, to just indicate a set of rearrangements that are group together; the meaning of the grouping is outside the scope of the file.

I don't care about mandatory or not, as currently our notion of mandatory is very loose. I think of this field as mandatory only in the sense that typical use cases (i'm thinking CRWG queries here) should always have this field.

javh commented 6 years ago

Hrm. I'm still confused. If you can't explicitly define the grouping, then is that grouping useful? If one field can contain a study name or a subject name, then you can't compare the values in that field because they won't necessarily have the same meaning or scale.

If I understand correctly, which I probably don't, this sounds like an arbitrary database identifier to group records by their source file, with no real biological or analytical meaning. Which, if that's the case, I think we should support it, but not as a required field. We probably need a few such database convenience fields for CWRG. Maybe with a consistent naming scheme? I think repertoire_id might be misleading as that is just as biologically explicit as subject_id.

If it's in the required section then it'll appear in the output file, but end up populated with empty values if that information is missing. If it's not applicable to all data, then we shouldn't have it under required so we don't clutter the file with an empty column.

schristley commented 6 years ago

There's no requirement that all groupings have the same semantics. One can be a B-cell repertoire and another is a T-cell repertoire. Comparing those two groups directly might not make sense, but the groupings are still useful. So yes, people running analysis have to know what their groups are, and they probably always have some biological meaning associated with them, but you don't know that just from looking at the value (or the name of the field).

While a VDJ assignment tool probably doesn't know this field's value, if anything it just might be a passthrough. The actual value might not be assigned until later in the analysis workflow. We have a conflict in requirements here. For CRWG, this rearrangement_set_id is a mandatory field, the API must return this field. However, tools in the analysis workflow may not utilize it, nor know what it is. We need to decide how to handle that. There may be a technical solution with the API spec...

javh commented 6 years ago

Yeah, okay, so this doesn't sound appropriate for an analysis file.

I like the idea of addressing this in the spec. Can we add a new child to Rearrangement for these types of fields? Like:

Rearrangement:
   repository:
      - rearrangement_id
      - rearrangement_set_id

Wherein repository appends to required in a database context, but is treated as optional (ignored) in a file context?

schristley commented 6 years ago

Can we added a new child

That isn't directly supported in OpenAPI, though you can compose objects and extend objects. This gets us back to the scenario of having multiple rearrangements object schema. In pseudocode, something like this, whereCRWG_Rearrangement inherits and extends Rearrangement

Rearrangment:
   - sequence
   - v_call

CRWG_Rearrangement:
  - allOf Rearrangement
  - rearrangement_id
  - rearrangement_set_id

The other possibility is to use custom fields like we are doing for x-miairr. In this case, the AIRR library would ignore the required tag which the API needs, and instead looks at a custom field x-mandatory (not sure what a good name is).

javh commented 6 years ago

The required tag is an oddity because it only exists in the Rearrangement definition anyway.

Maybe we should remove required entirely in favor of adding a custom mandatory/required boolean to each field. Then alter the API accordingly. It's more verbose, unless we can default to `false', but probably easier to maintain. Then we'll never accidentally get the required list and the actual field names out of sync.

schristley commented 6 years ago

required is a keyword in the OpenAPI spec, that is the exact definition to define required fields, we don't want to define our own usage for API purposes, as various API middleware software won't recognize our custom usage.

javh commented 6 years ago

Ah, okay.

Well then, I have no real opinion on whether the custom label or inheritance approach is better. Whatever works best for the CRWG.

schristley commented 6 years ago

We could get rid of the idea of mandatory for the AIRR library. I think the original idea was to give downstream tools some expectation of fields they could expect to be there, but if a field is always blank, then what's the point?

javh commented 6 years ago

I'm okay with mandatory fields, but I'm inclined to think mandatory fields should be defined in the context of files. Just because I think that'll be the vast majority of use cases.

I feel like database support needs a lot of extra stuff and we should tailor to it explicitly somehow instead of trying to meet the needs of flat files and databases with the same set of requirements. I have this sneaking suspicion CRWG is going to need to extend the schema further as y'all work through the process.

bcorrie commented 6 years ago

A comment on inheritance VS custom label. My opinion is that the inheritance approach is more elegant and rigorous, but I think it is also less flexible. If you think of MiAIRR as the baseline and then the file format and the API are extending that in some way, then we have divergence when we really want to maintain convergence (I think) as much as practical. If we start to inherit I think it might start to get difficult to diverge a bit (if CRWG think they need X that isn't in the file format) and then converge again four months later (when they realize that the file format actually is doing the same thing).

In that regard, I think the custom label might be better - although I think I would like to see what this looks like before making that decision...

Brian

schristley commented 6 years ago

And just to throw more fuel on the fire, CRWG has passed on using MiAIRR as a baseline. As far as they are concerned, MiAIRR is designed for publication and isn't the best starting point for repository design. In practical purposes, this means the schema objects in definitions.yaml for MiAIRR entities aren't being used anywhere, except in the iReceptor API.

Your point is well-taken Brian, I'm trying to avoid divergence, but it seems really hard at the moment because we are not able to satisfy all the requirements, especially when WGs are still trying to figure out what they are! I think we need to just bite the bullet and let them diverge for awhile, then once everything settles, look at how we can converge.

bcorrie commented 6 years ago

I have to admit, I don't quite understand what is meant by the statement the "CRWG has passed on using MiAIRR". I think we should be a bit careful and define what we mean by that statement - I'm not sure everyone on CRWG agrees - myself included... 8-)

I would certainly agree that MiAIRR doesn't have everything that is needed for CRWG - but as in the name, it is Minimal after all... It is completely understandable if CRWG needs more than what MiAIRR has. If CRWG needs more fields, or a more precise definition of a field, or needs to specify onotologies for that preciseness then that is all fine. But that is quite different than saying that you are not going to use the MiAIRR standard at all...

schristley commented 6 years ago

Me neither but it sounded like good fuel, "using" is nicely ambiguous ;-D My sense of the discussion is actually that MiAIRR is too much stuff, more than what is necessary for a useful common repository.

And currently CRWG is proposing an entity repertoire which doesn't exist in MiAIRR.

javh commented 6 years ago

I'm going to mark this as resolved in #89. Though, I know the question of how to denote CRWG required fields will come up again, that seems like a larger separate discussion.