Closed javh closed 6 years ago
It cannot be explicit about what the grouping is, i.e. study_id
, subject_id
, as that grouping would not be valid for all studies. The name repertoire_id
might be more meaningful as it's the set of rearrangements that compose a repertoire
; this would be the most common usage. The name rearrangement_set_id
is a bit more generic, to just indicate a set of rearrangements that are group together; the meaning of the grouping is outside the scope of the file.
I don't care about mandatory or not, as currently our notion of mandatory is very loose. I think of this field as mandatory only in the sense that typical use cases (i'm thinking CRWG queries here) should always have this field.
Hrm. I'm still confused. If you can't explicitly define the grouping, then is that grouping useful? If one field can contain a study name or a subject name, then you can't compare the values in that field because they won't necessarily have the same meaning or scale.
If I understand correctly, which I probably don't, this sounds like an arbitrary database identifier to group records by their source file, with no real biological or analytical meaning. Which, if that's the case, I think we should support it, but not as a required field. We probably need a few such database convenience fields for CWRG. Maybe with a consistent naming scheme? I think repertoire_id
might be misleading as that is just as biologically explicit as subject_id
.
If it's in the required
section then it'll appear in the output file, but end up populated with empty values if that information is missing. If it's not applicable to all data, then we shouldn't have it under required
so we don't clutter the file with an empty column.
There's no requirement that all groupings have the same semantics. One can be a B-cell repertoire and another is a T-cell repertoire. Comparing those two groups directly might not make sense, but the groupings are still useful. So yes, people running analysis have to know what their groups are, and they probably always have some biological meaning associated with them, but you don't know that just from looking at the value (or the name of the field).
While a VDJ assignment tool probably doesn't know this field's value, if anything it just might be a passthrough. The actual value might not be assigned until later in the analysis workflow. We have a conflict in requirements here. For CRWG, this rearrangement_set_id
is a mandatory field, the API must return this field. However, tools in the analysis workflow may not utilize it, nor know what it is. We need to decide how to handle that. There may be a technical solution with the API spec...
Yeah, okay, so this doesn't sound appropriate for an analysis file.
I like the idea of addressing this in the spec. Can we add a new child to Rearrangement
for these types of fields? Like:
Rearrangement:
repository:
- rearrangement_id
- rearrangement_set_id
Wherein repository
appends to required
in a database context, but is treated as optional (ignored) in a file context?
Can we added a new child
That isn't directly supported in OpenAPI, though you can compose objects and extend objects. This gets us back to the scenario of having multiple rearrangements object schema. In pseudocode, something like this, whereCRWG_Rearrangement
inherits and extends Rearrangement
Rearrangment:
- sequence
- v_call
CRWG_Rearrangement:
- allOf Rearrangement
- rearrangement_id
- rearrangement_set_id
The other possibility is to use custom fields like we are doing for x-miairr
. In this case, the AIRR library would ignore the required
tag which the API needs, and instead looks at a custom field x-mandatory
(not sure what a good name is).
The required
tag is an oddity because it only exists in the Rearrangement
definition anyway.
Maybe we should remove required
entirely in favor of adding a custom mandatory
/required
boolean to each field. Then alter the API accordingly. It's more verbose, unless we can default to `false', but probably easier to maintain. Then we'll never accidentally get the required list and the actual field names out of sync.
required
is a keyword in the OpenAPI spec, that is the exact definition to define required fields, we don't want to define our own usage for API purposes, as various API middleware software won't recognize our custom usage.
Ah, okay.
Well then, I have no real opinion on whether the custom label or inheritance approach is better. Whatever works best for the CRWG.
We could get rid of the idea of mandatory for the AIRR library. I think the original idea was to give downstream tools some expectation of fields they could expect to be there, but if a field is always blank, then what's the point?
I'm okay with mandatory fields, but I'm inclined to think mandatory fields should be defined in the context of files. Just because I think that'll be the vast majority of use cases.
I feel like database support needs a lot of extra stuff and we should tailor to it explicitly somehow instead of trying to meet the needs of flat files and databases with the same set of requirements. I have this sneaking suspicion CRWG is going to need to extend the schema further as y'all work through the process.
A comment on inheritance VS custom label. My opinion is that the inheritance approach is more elegant and rigorous, but I think it is also less flexible. If you think of MiAIRR as the baseline and then the file format and the API are extending that in some way, then we have divergence when we really want to maintain convergence (I think) as much as practical. If we start to inherit I think it might start to get difficult to diverge a bit (if CRWG think they need X that isn't in the file format) and then converge again four months later (when they realize that the file format actually is doing the same thing).
In that regard, I think the custom label might be better - although I think I would like to see what this looks like before making that decision...
Brian
And just to throw more fuel on the fire, CRWG has passed on using MiAIRR as a baseline. As far as they are concerned, MiAIRR is designed for publication and isn't the best starting point for repository design. In practical purposes, this means the schema objects in definitions.yaml
for MiAIRR entities aren't being used anywhere, except in the iReceptor API.
Your point is well-taken Brian, I'm trying to avoid divergence, but it seems really hard at the moment because we are not able to satisfy all the requirements, especially when WGs are still trying to figure out what they are! I think we need to just bite the bullet and let them diverge for awhile, then once everything settles, look at how we can converge.
I have to admit, I don't quite understand what is meant by the statement the "CRWG has passed on using MiAIRR". I think we should be a bit careful and define what we mean by that statement - I'm not sure everyone on CRWG agrees - myself included... 8-)
I would certainly agree that MiAIRR doesn't have everything that is needed for CRWG - but as in the name, it is Minimal after all... It is completely understandable if CRWG needs more than what MiAIRR has. If CRWG needs more fields, or a more precise definition of a field, or needs to specify onotologies for that preciseness then that is all fine. But that is quite different than saying that you are not going to use the MiAIRR standard at all...
Me neither but it sounded like good fuel, "using" is nicely ambiguous ;-D My sense of the discussion is actually that MiAIRR is too much stuff, more than what is necessary for a useful common repository.
And currently CRWG is proposing an entity repertoire
which doesn't exist in MiAIRR.
I'm going to mark this as resolved in #89. Though, I know the question of how to denote CRWG required fields will come up again, that seems like a larger separate discussion.
In the Rearrangement spec we have an element called
rearrangement_set_id
. Wasn't sure what this was, so I asked @schristley:If I'm understanding correctly, something like
metadata_id
would be a more appropriate name if it's meant to be a direct link to the metadata file. If it's meant to be descriptive of some grouping, then I think we should be explicit about what that grouping is and use fields that are already defined by MiAIRR, such asstudy_id
,subject_id
,sample_id
, etc.If it's the later, then I don't think it should be a mandatory field. I'm undecided on whether it should be mandatory if it's a link to the metadata file.
Thoughts? Is this what this field is for?