Closed javh closed 3 weeks ago
Both Genotype and MHCGenotype are nested more deeply than typical for the schema.
I had the same sense too, it's little things like for example instead of this in Subject
genotype:
type: object
nullable: true
description: Genotype for this subject, if known
properties:
receptor_genotype_set:
nullable: true
$ref: '#/GenotypeSet'
description: Immune receptor genotype set for this subject.
mhc_genotype_set:
nullable: true
$ref: '#/MHCGenotypeSet'
description: MHC genotype set for this subject.
just name them two different fields in Subject
receptor_genotype_set:
nullable: true
$ref: '#/GenotypeSet'
description: Immune receptor genotype set for this subject.
mhc_genotype_set:
nullable: true
$ref: '#/MHCGenotypeSet'
description: MHC genotype set for this subject.
Making receptor_genotype_set and mhc_genotype_set independent properties of Subject makes sense to me as these are independent objects.
I am less sure of the logic behind splitting documented_alleles, undocumented_alleles and deleted_genes into high-level objects., These are inherent related components of the genotype. Is there a need to reduce the depth of nesting to a specific level? If so could you articulate the reason please, so that we can consider options.
The reason is mostly stylistic. In the rest of the schema, objects live in other objects via $ref:
not by nesting the definition. They would still be inherent related components if their definition was at the top level. But, it should be easier to validate if we use the same style throughout.
Aside from that, we might be able to apply some abstraction to documented_alleles
, undocumented_alleles
, and deleted_genes
. Not sure on that. We'd have to discuss.
Thanks Jason. As with the other issues I've commented on this morning, this feels like a restriction on validation that we have to live with because the amount of resource we can put in to the validation code is limited. I think it will become problematic as, inevitably, the schema objects become more complex. Perhaps again we should log it as restriction that we would like to overcome in the future, and for the time being I can change the Germline objects to live with it.
Thanks, @williamdlees. I think for this particular issue, we can just focus on the style fix and maybe the allele abstraction question. The validation should follow from that. If the style is right, the validation should be fine, but, if not, then we'll need to fix the validation.
From the call:
documented_alleles
, undocumented_alleles
, deleted_genes
are sufficiently distinct to leave them as separate objects when de-nesting.undocumented_alleles.allele_name
should be renamed to label
, but we'll need to check if this breaks backwards compatibility in the ADC and therefore needs to wait until v2.0 (likely yes).Subject.genotype
: Fine to de-nest by moving receptor_genotype_set
and mhc_genotype_set
up a level. However, this will break backwards compatibility so it needs to wait until v2.0. Additional issues to address in v2.0: (1) should these be arrays and (2) should they be in Repertoire instead of Subject (as they are downstream from DataProcessing).
undocumented_alleles.allele_name
should be renamed tolabel
, but we'll need to check if this breaks backwards compatibility in the ADC and therefore needs to wait until v2.0 (likely yes).
@javh Can you deprecate and add label
in with v1.4.2? That way I can start using label
now.
Deprecation and changing a label breaks backwards compatibility so it shouldn't be part of v1.4.2, no? Any field name/type changes shouldn't be a part of a minor release...
Deprecation and changing a label breaks backwards compatibility so it shouldn't be part of v1.4.2, no? Any field name/type changes shouldn't be a part of a minor release...
Right, yeah minor release should just be software and minor fixes. Could be a v1.5.x then. I wasn't sure if saying v2.0 because the change required a new version of MiAIRR, or there's just an expectation to get v2.0 released before a v1.5...
- De-nesting Germline schema: De-nesting shouldn't break anything, so we can include this is v1.4.2.
As long as the "path" to the object doesn't change - which de-nesting should not do - then things should be fine.
So subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class
should not change after the de-nesting.
The above is how one queries for this field, so if the path changes then we are breaking compatibility...
A query using the API looks like this:
{
"filters": {
"op":"=",
"content": {
"field":"subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class",
"value":"MHC-I"
}
}
}
So this query should work after any changes:
curl -d '{"filters":{"op":"=", "content": { "field":"subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class", "value":"MHC-I"}}}' https://t1d-1.ireceptor.org/airr/v1/repertoire
From the call:
label
and receptor_genotype_*
field renames for milestone v1.5.0.
- De-nesting Germline schema: De-nesting shouldn't break anything, so we can include this is v1.4.2.
As long as the "path" to the object doesn't change - which de-nesting should not do - then things should be fine.
The term de-nesting seems to be used in two contexts. For Subject.genotype
, the de-nesting is changing the "path"
So subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class
--> subject.mhc_genotype_set.mhc_genotype_list.mhc_class
and thus queries would change.
The other de-nesting is making stylistic changes to the schema, e.g. #669 and these shouldn't affect the "path"
So where do we stand on this for v2.0?
Subject.genotype
: Fine to de-nest by movingreceptor_genotype_set
andmhc_genotype_set
up a level. However, this will break backwards compatibility so it needs to wait until v2.0. Additional issues to address in v2.0: (1) should these be arrays and (2) should they be in Repertoire instead of Subject (as they are downstream from DataProcessing).
Personally, I am not sure I understand why we want to move these "up a level". Logically they are both genotype of a subject, so why do we not want to have them represented that way? This is a minor detail, and I don't really care that much, but it seems odd that this is viewed as "better". For v2.0 I would lean toward keeping the subject.genotype
but I am not highly invested either way.
Subject.genotype
: Fine to de-nest by movingreceptor_genotype_set
andmhc_genotype_set
up a level. However, this will break backwards compatibility so it needs to wait until v2.0. Additional issues to address in v2.0: (1) should these be arrays and (2) should they be in Repertoire instead of Subject (as they are downstream from DataProcessing).
It doesn't make sense to me to have subject genotype logically stored only at the repertoire level since biologically they are related to the subject. We feel strongly that we need the ability to store both MHC and Receptor genotype at the subject level if they are known. That is what these fields are for. If you don't know subject level genotype you don't have to store anything!
I admit our experience is more with MHC Genotype for the studies that are looking at MHC/peptide binding, but it is always subject specific so it would not make any sense to group it with a repertoire.
It is my understanding that in the ideal world, we would know the Subject
receptor_genotype_set
and that is what would be used for sequence annotation to get the most accurate annotation of a Repertoire for that subject. The experimenter can store their best estimate of what the receptor_genotype_set
is for this subject if it has been determined. To me that is the intent of these subject
MHC and Receptor Genotype fields.
I am pretty sure I have suggested this before, but if we need the ability to store a receptor_genotype_set
as inferred from a Repertoire
with that Repertoire
, and that can not be considered the Subject
receptor_genotype_set
then I would suggest we need another field. Maybe another data collection as @schristley suggests, It is this type of receptor_genotype_set
that is downstream of DataProcessing
(inference_process = repertoire_sequencing
). I am not saying this isn't needed...
But that should not detract from the fact that if the experimenter has determined a Subject
receptor_genotype_set
(e.g. using inference_process = genomic_seqeuncing
) they should have the ability to store that with the Subject
.
So
subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class
-->subject.mhc_genotype_set.mhc_genotype_list.mhc_class
and thus queries would change.
Up one level. So still a property of Subject
as per @schristley's comment above
So where do we stand on this for v2.0?
Subject.genotype
: Fine to de-nest by movingreceptor_genotype_set
andmhc_genotype_set
up a level. However, this will break backwards compatibility so it needs to wait until v2.0. Additional issues to address in v2.0: (1) should these be arrays and (2) should they be in Repertoire instead of Subject (as they are downstream from DataProcessing).Personally, I am not sure I understand why we want to move these "up a level". Logically they are both genotype of a subject, so why do we not want to have them represented that way? This is a minor detail, and I don't really care that much, but it seems odd that this is viewed as "better". For v2.0 I would lean toward keeping the
subject.genotype
but I am not highly invested either way.
"better" only in that the data structure is simpler, and yes it is a minor detail about syntax. The semantics aren't in question. I would ask the reverse question, how are these two field paths:
subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class
subject.genotype.receptor_genotype_set.genotype_class_list.locus
better than these two field paths?
subject.mhc_genotype_set.mhc_genotype_list.mhc_class
subject.receptor_genotype_set.genotype_class_list.locus
There doesn't seem to be any specific advantage to the deeper structure with the longer field path.
In other place in the schema, we try to only create these composite objects when it creates a meaningful composite. As William indicated, these two concepts are independent enough that composing them into genotype
doesn't seem to add anything.
Ultimately any genotype is going to be derived from experimental data and may, to a greater or lesser extent, be partial. I think it’s important that the underlying data is explicitly referenced. This could be a reference to one or more repertoires, and/or to genomic sequencing, or even a snp array perhaps, together with some indication of the protocol and scope. I’m ok with the idea of storing all genotypes at the subject level. If a genotype has been used to annotate a repertoire, the repertoire should point back to the genotype that was used. I think we should try to avoid any implication that a particular repertoire is comprehensive or otherwise and stick to the description of how it was obtained, as standards will continue to change over time.
javh edit: remove email headers.
apologies, I meant to say “any implication that a particular genotype is comprehensive…”, not “any particular repertoire”.
javh edit: remove email headers.
subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class
subject.genotype.receptor_genotype_set.genotype_class_list.locus
better than these two field paths?
subject.mhc_genotype_set.mhc_genotype_list.mhc_class
subject.receptor_genotype_set.genotype_class_list.locus
There doesn't seem to be any specific advantage to the deeper structure with the longer field path.
I don't see the need for the change - which requires work 8-) It is not insignificant work in that it breaks backwards compatibility. AIRR v1 and AIRR v2 files with genotype will no longer be compatible. Repositories need to change, the Gateway needs to change, etc.
I also feel that logically they belong together since they are both genotype. I will bow to your wisdom in terms of whether biologically this is an important concept that is worthy of capturing in the schema. It was my understanding that subject genotype was important.
You can see the redundancy in the naming, which you don't really need in either case - e.g. you could have:
subject.genotype.mhc_set.class_list.class
subject.genotype.receptor_set.class_list.locus
Bottom line is I am fine however the group want to do this (I have made my case). I would suggest that you clean up some of the redundancy in the naming in either case when you make the changes.
subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class
subject.genotype.receptor_genotype_set.genotype_class_list.locus
better than these two field paths?subject.mhc_genotype_set.mhc_genotype_list.mhc_class
subject.receptor_genotype_set.genotype_class_list.locus
There doesn't seem to be any specific advantage to the deeper structure with the longer field path.I don't see the need for the change - which requires work 8-) It is not insignificant work in that it breaks backwards compatibility. AIRR v1 and AIRR v2 files with genotype will no longer be compatible. Repositories need to change, the Gateway needs to change, etc.
Yeah, it's already "out in the wild". If not I'd push to clean it up but now I'm fine with leaving it as the churn isn't likely worth it. Structurally and semantically it is fine, just a little verbose. Besides, it wouldn't be a real spec without a few quirks! ;-D
So have we decided to leave this as is for AIRR 2.0?
From the call:
Both Genotype and MHCGenotype are nested more deeply than typical for the schema.
MHCGenotype.mhc_alleles
and andGenotype.*_alleles
could be pulled out into top level objects with refs in Genotype and MHCGenotype.