airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Genotype and MHCGenotype are too deeply nested #667

Closed javh closed 3 weeks ago

javh commented 1 year ago

Both Genotype and MHCGenotype are nested more deeply than typical for the schema. MHCGenotype.mhc_alleles and and Genotype.*_alleles could be pulled out into top level objects with refs in Genotype and MHCGenotype.

schristley commented 1 year ago

Both Genotype and MHCGenotype are nested more deeply than typical for the schema.

I had the same sense too, it's little things like for example instead of this in Subject

        genotype:
            type: object
            nullable: true
            description: Genotype for this subject, if known
            properties:
                receptor_genotype_set:
                    nullable: true
                    $ref: '#/GenotypeSet'
                    description: Immune receptor genotype set for this subject.
                mhc_genotype_set:
                    nullable: true
                    $ref: '#/MHCGenotypeSet'
                    description: MHC genotype set for this subject.

just name them two different fields in Subject

                receptor_genotype_set:
                    nullable: true
                    $ref: '#/GenotypeSet'
                    description: Immune receptor genotype set for this subject.
                mhc_genotype_set:
                    nullable: true
                    $ref: '#/MHCGenotypeSet'
                    description: MHC genotype set for this subject.
williamdlees commented 1 year ago

Making receptor_genotype_set and mhc_genotype_set independent properties of Subject makes sense to me as these are independent objects.

I am less sure of the logic behind splitting documented_alleles, undocumented_alleles and deleted_genes into high-level objects., These are inherent related components of the genotype. Is there a need to reduce the depth of nesting to a specific level? If so could you articulate the reason please, so that we can consider options.

javh commented 1 year ago

The reason is mostly stylistic. In the rest of the schema, objects live in other objects via $ref: not by nesting the definition. They would still be inherent related components if their definition was at the top level. But, it should be easier to validate if we use the same style throughout.

Aside from that, we might be able to apply some abstraction to documented_alleles, undocumented_alleles, and deleted_genes. Not sure on that. We'd have to discuss.

williamdlees commented 1 year ago

Thanks Jason. As with the other issues I've commented on this morning, this feels like a restriction on validation that we have to live with because the amount of resource we can put in to the validation code is limited. I think it will become problematic as, inevitably, the schema objects become more complex. Perhaps again we should log it as restriction that we would like to overcome in the future, and for the time being I can change the Germline objects to live with it.

javh commented 1 year ago

Thanks, @williamdlees. I think for this particular issue, we can just focus on the style fix and maybe the allele abstraction question. The validation should follow from that. If the style is right, the validation should be fine, but, if not, then we'll need to fix the validation.

javh commented 1 year ago

From the call:

schristley commented 1 year ago
  • undocumented_alleles.allele_name should be renamed to label, but we'll need to check if this breaks backwards compatibility in the ADC and therefore needs to wait until v2.0 (likely yes).

@javh Can you deprecate and add label in with v1.4.2? That way I can start using label now.

bcorrie commented 1 year ago

Deprecation and changing a label breaks backwards compatibility so it shouldn't be part of v1.4.2, no? Any field name/type changes shouldn't be a part of a minor release...

schristley commented 1 year ago

Deprecation and changing a label breaks backwards compatibility so it shouldn't be part of v1.4.2, no? Any field name/type changes shouldn't be a part of a minor release...

Right, yeah minor release should just be software and minor fixes. Could be a v1.5.x then. I wasn't sure if saying v2.0 because the change required a new version of MiAIRR, or there's just an expectation to get v2.0 released before a v1.5...

bcorrie commented 1 year ago
  • De-nesting Germline schema: De-nesting shouldn't break anything, so we can include this is v1.4.2.

As long as the "path" to the object doesn't change - which de-nesting should not do - then things should be fine.

So subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class should not change after the de-nesting.

The above is how one queries for this field, so if the path changes then we are breaking compatibility...

A query using the API looks like this:

{
  "filters": {
    "op":"=",
    "content": {
      "field":"subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class",
      "value":"MHC-I"
    }
  }
}

So this query should work after any changes:

curl -d '{"filters":{"op":"=", "content": { "field":"subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class", "value":"MHC-I"}}}' https://t1d-1.ireceptor.org/airr/v1/repertoire
javh commented 1 year ago

From the call:

schristley commented 1 year ago
  • De-nesting Germline schema: De-nesting shouldn't break anything, so we can include this is v1.4.2.

As long as the "path" to the object doesn't change - which de-nesting should not do - then things should be fine.

The term de-nesting seems to be used in two contexts. For Subject.genotype, the de-nesting is changing the "path"

So subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class --> subject.mhc_genotype_set.mhc_genotype_list.mhc_class and thus queries would change.

The other de-nesting is making stylistic changes to the schema, e.g. #669 and these shouldn't affect the "path"

bcorrie commented 8 months ago

So where do we stand on this for v2.0?

  • Subject.genotype: Fine to de-nest by moving receptor_genotype_set and mhc_genotype_set up a level. However, this will break backwards compatibility so it needs to wait until v2.0. Additional issues to address in v2.0: (1) should these be arrays and (2) should they be in Repertoire instead of Subject (as they are downstream from DataProcessing).

Personally, I am not sure I understand why we want to move these "up a level". Logically they are both genotype of a subject, so why do we not want to have them represented that way? This is a minor detail, and I don't really care that much, but it seems odd that this is viewed as "better". For v2.0 I would lean toward keeping the subject.genotype but I am not highly invested either way.

bcorrie commented 8 months ago
  • Subject.genotype: Fine to de-nest by moving receptor_genotype_set and mhc_genotype_set up a level. However, this will break backwards compatibility so it needs to wait until v2.0. Additional issues to address in v2.0: (1) should these be arrays and (2) should they be in Repertoire instead of Subject (as they are downstream from DataProcessing).

It doesn't make sense to me to have subject genotype logically stored only at the repertoire level since biologically they are related to the subject. We feel strongly that we need the ability to store both MHC and Receptor genotype at the subject level if they are known. That is what these fields are for. If you don't know subject level genotype you don't have to store anything!

I admit our experience is more with MHC Genotype for the studies that are looking at MHC/peptide binding, but it is always subject specific so it would not make any sense to group it with a repertoire.

It is my understanding that in the ideal world, we would know the Subject receptor_genotype_set and that is what would be used for sequence annotation to get the most accurate annotation of a Repertoire for that subject. The experimenter can store their best estimate of what the receptor_genotype_set is for this subject if it has been determined. To me that is the intent of these subject MHC and Receptor Genotype fields.

I am pretty sure I have suggested this before, but if we need the ability to store a receptor_genotype_set as inferred from a Repertoire with that Repertoire, and that can not be considered the Subject receptor_genotype_set then I would suggest we need another field. Maybe another data collection as @schristley suggests, It is this type of receptor_genotype_set that is downstream of DataProcessing (inference_process = repertoire_sequencing). I am not saying this isn't needed...

But that should not detract from the fact that if the experimenter has determined a Subject receptor_genotype_set (e.g. using inference_process = genomic_seqeuncing) they should have the ability to store that with the Subject.

scharch commented 8 months ago

So subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class --> subject.mhc_genotype_set.mhc_genotype_list.mhc_class and thus queries would change.

Up one level. So still a property of Subject as per @schristley's comment above

schristley commented 7 months ago

So where do we stand on this for v2.0?

  • Subject.genotype: Fine to de-nest by moving receptor_genotype_set and mhc_genotype_set up a level. However, this will break backwards compatibility so it needs to wait until v2.0. Additional issues to address in v2.0: (1) should these be arrays and (2) should they be in Repertoire instead of Subject (as they are downstream from DataProcessing).

Personally, I am not sure I understand why we want to move these "up a level". Logically they are both genotype of a subject, so why do we not want to have them represented that way? This is a minor detail, and I don't really care that much, but it seems odd that this is viewed as "better". For v2.0 I would lean toward keeping the subject.genotype but I am not highly invested either way.

"better" only in that the data structure is simpler, and yes it is a minor detail about syntax. The semantics aren't in question. I would ask the reverse question, how are these two field paths:

subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class subject.genotype.receptor_genotype_set.genotype_class_list.locus

better than these two field paths?

subject.mhc_genotype_set.mhc_genotype_list.mhc_class subject.receptor_genotype_set.genotype_class_list.locus

There doesn't seem to be any specific advantage to the deeper structure with the longer field path.

In other place in the schema, we try to only create these composite objects when it creates a meaningful composite. As William indicated, these two concepts are independent enough that composing them into genotype doesn't seem to add anything.

williamdlees commented 7 months ago

Ultimately any genotype is going to be derived from experimental data and may, to a greater or lesser extent, be partial. I think it’s important that the underlying data is explicitly referenced. This could be a reference to one or more repertoires, and/or to genomic sequencing, or even a snp array perhaps, together with some indication of the protocol and scope. I’m ok with the idea of storing all genotypes at the subject level. If a genotype has been used to annotate a repertoire, the repertoire should point back to the genotype that was used. I think we should try to avoid any implication that a particular repertoire is comprehensive or otherwise and stick to the description of how it was obtained, as standards will continue to change over time.

javh edit: remove email headers.

williamdlees commented 7 months ago

apologies, I meant to say “any implication that a particular genotype is comprehensive…”, not “any particular repertoire”.

javh edit: remove email headers.

bcorrie commented 7 months ago

subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class subject.genotype.receptor_genotype_set.genotype_class_list.locus

better than these two field paths?

subject.mhc_genotype_set.mhc_genotype_list.mhc_class subject.receptor_genotype_set.genotype_class_list.locus

There doesn't seem to be any specific advantage to the deeper structure with the longer field path.

I don't see the need for the change - which requires work 8-) It is not insignificant work in that it breaks backwards compatibility. AIRR v1 and AIRR v2 files with genotype will no longer be compatible. Repositories need to change, the Gateway needs to change, etc.

I also feel that logically they belong together since they are both genotype. I will bow to your wisdom in terms of whether biologically this is an important concept that is worthy of capturing in the schema. It was my understanding that subject genotype was important.

You can see the redundancy in the naming, which you don't really need in either case - e.g. you could have: subject.genotype.mhc_set.class_list.class subject.genotype.receptor_set.class_list.locus

Bottom line is I am fine however the group want to do this (I have made my case). I would suggest that you clean up some of the redundancy in the naming in either case when you make the changes.

schristley commented 7 months ago

subject.genotype.mhc_genotype_set.mhc_genotype_list.mhc_class subject.genotype.receptor_genotype_set.genotype_class_list.locus better than these two field paths? subject.mhc_genotype_set.mhc_genotype_list.mhc_class subject.receptor_genotype_set.genotype_class_list.locus There doesn't seem to be any specific advantage to the deeper structure with the longer field path.

I don't see the need for the change - which requires work 8-) It is not insignificant work in that it breaks backwards compatibility. AIRR v1 and AIRR v2 files with genotype will no longer be compatible. Repositories need to change, the Gateway needs to change, etc.

Yeah, it's already "out in the wild". If not I'd push to clean it up but now I'm fine with leaving it as the churn isn't likely worth it. Structurally and semantically it is fine, just a little verbose. Besides, it wouldn't be a real spec without a few quirks! ;-D

bcorrie commented 1 month ago

So have we decided to leave this as is for AIRR 2.0?

javh commented 3 weeks ago

From the call: