airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Add Receptor fixes #705

Closed bcorrie closed 4 months ago

bcorrie commented 11 months ago

Fixes #704

javh commented 8 months ago

From the call:

bussec commented 6 months ago

From the call: @bcorrie and @bussec are in general ok with this PR, @bussec will make some final clarification in the description of Cell.reactivity_measurement. Will put this up for a final discussion in January, then merge.

javh commented 6 months ago

From the call:

bcorrie commented 6 months ago

From the call:

Sorry I missed the call, looks like my calendar entry ended end of 2023. I am still traveling, returning home tomorrow.

  • Need to review, but seems mostly fine, except we don't think we need reactivity_measurements defined in Cell. There wouldn't be a prohibition on it, but it would not be the preferred schema path.

reactivity_measurements in both Cell and Receptor are arrays of IDs that refer to ReceptorReactivity objects (essentially the RecceptorReactivity.receptor_reactivity_id of the object). So reactivity_measurements are not defined in the Cell object per se.

My understanding of Cell having a receptor_reactivity is that this would capture when a specific experiment measured a reactivity for an epitope for a specific Cell (e.g. in a 10X experiment). See https://github.com/airr-community/airr-standards/issues/704#issuecomment-1641133894

Is that the general understanding? Would this not be the preferred thing to do in such an experiment?

javh commented 6 months ago

We were thinking the normal usage would be Cell -> Receptor -> ReceptorReactivity. If you wanted to jump over Receptor and put a reactivity_measurements array into Cell you could, in the same way that you can add sample_id to Rearrangement (semantics are clear even though there's no sample_id defined in Rearrangement), but it wouldn't be the model.

krishnaroskin commented 5 months ago

Hey gang,

I wanted to toss thoughts from a B cell perspective. The above discussion seems very focused on T cells so I'm trying to figure out if we need another Object to cover Antibody specificity or make a more general.

Random thoughts:

lgcowell commented 5 months ago

I agree that reactivity doesn’t make sense for Abs. I haven’t kept up with all the discussion on this, but doesn’t ReceptorSpecificity make more sense and cover BCR, TCR, and Ab (even though Abs aren’t receptors)?

krishnaroskin commented 5 months ago

I agree that reactivity doesn’t make sense for Abs. I haven’t kept up with all the discussion on this, but doesn’t ReceptorSpecificity make more sense and cover BCR, TCR, and Ab (even though Abs aren’t receptors)?

I think that name change covers my name issue. The only thing I could see falling outside that is if we had protein sequence from protein sequencing of antibodies, like Georgiou does. In IRAD we've thought about going after that data type because there is a lot of it in the patent Genbank. But, as far as I know, all our tool chains are very focused on nucleotide sequences, so that's an issue.

bussec commented 5 months ago

@krishnaroskin @lgcowell The Receptor object was developed with both AB/BCR and TCR in mind. In general, it can exist independent of a Cell, i.e., you can use it to annotate reactivities of antibodies that were not observed in cells derived from a Subject, e.g., a recombinantly expressed antibody in which somatic hypermutations were added or removed. However, as @javh described, we assume that the typical way how it would be used is in the context of single-cell experiment.

ReceptorReactivity is more of a T cell concept since I think of "reactivity" as a (T) cell's response when it's receptor binds something. For antibodies the concept of reactivity is not on-point since there is no cell.

This is basically a question about the semantics of the term "reactity". For me, reactivity implies that the receptor or antibody can interact with a given target in a way that can be experimentally detected. Using this definition, activiation of the cell that bears a given receptor/antibody is a sufficient, but not a necessary criterion. The reason why we use reactivity instead of specificity is that the later one - similar to its use in statistics - is relative to something else. Polyreactive antibodies for example are exactly that: They bind everything from polystyrol to shoe laces (i.e. they are "reactive") but they are not specific for any of these things.

The above and that BCRs (on B cells) can also interact with antigen (like T cells), begs the question is if we should have the distinction between antibodies and BCRs on B cells? We sequence B cells but often test binding/affinity on antibodies. Maybe that distinction should be encapsulated in the description of how the BCR/antibody specificity was measured?

In general this is the idea behind the Receptor/ReceptorReactivity split: A given receptor shoud always have the same intristic reactity (otherwise there would be no chance of reproducible results), but different assays will measure it by different means. For example, take an experiment where a fluorescently-labeled protein is used as bait to select for target-specific B cells during sorting. The sorted cells are then subjected to single-cell sequencing and the sequence information of the IGH/IGK transcripts is then used to produce a recombinant antibody that can be tested in vitro (e.g. ELISA). In this case a single Receptor record would have one or more ReceptorReactivity record that describe the results obtained using the recombinant antibody and one that describes the binding in the initial fluorescent bait assay. The latter ReactivityRecord could be directly referrenced from a Cell record (@bcorrie's use-case), while the prior ones would not. The reactivity_method and reactivity_readout properties can be used to provide further information on the experimental approach.

peptide_aa_string is something that don't have much applicability in B cell world.

Fully agree, therefore the schema does not require you to provide this property.

bcorrie commented 5 months ago

We were thinking the normal usage would be Cell -> Receptor -> ReceptorReactivity. If you wanted to jump over Receptor and put a reactivity_measurements array into Cell you could, in the same way that you can add sample_id to Rearrangement (semantics are clear even though there's no sample_id defined in Rearrangement), but it wouldn't be the model.

Is this implying that Cell.receptor_reactivity should be removed from the object schema? If so I think that is a problem.

When receptor_reactivity is measured in an experiment, that reactivity is associated with a specific Cell. If we go Cell -> Receptor -> ReceptorReactivity (there is no direct link in Cell to the receptor_reactivity measured) we lose the information that that specific receptor_reactivity is associate with that Cell. We also lose information about the Study in which the receptor_reactivity was measured.

Remember that since Receptor is a "global object" (there is only one Receptor with a specific paired VDJ/CDR3) it is quite likely for a specific Receptor to have more than one measured receptor_reactivity. This is even likely to happen in a single experiment if I am not mistaken, but it would certainly happen across multiple experiments. Multiple Cells will be assigned the same Receptor and have different measured receptor_reactivity. In the above model a single Receptor will have multiple receptor_reactivity measurements, but it won't be possible to tell which receptor_reactivity measurement came from which Cell. This would be even worse if the receptor_reactivity measures came from different experiments. In this case we lose information about which Study a receptor_reactivity measure came from, including the methodology for the study and how receptor_reactivity was measured.

Bottom line is I believe we need to be able to capture a link between Cell and a specific receptor_reactivity directly in the model.

There are two ways of doing this, having an array of Cell.receptor_reactivity IDs that point to the reactivity information for that Cell or have a ReceptorReactivity.cell_id so that the measured receptor_reactivity points back to the Cell that produced it. We went with the array in the Cell as that is the more logical way of thinking about it (in an experiment Cells have measured reactivity) - See https://github.com/airr-community/airr-standards/issues/704#issuecomment-1641133894

bcorrie commented 5 months ago

Hmm, in thinking about this a bit more, the AIRR Standard is primarily about capturing how an experiment is done, no? So a direct link from the Cell to the ReceptorReactivity that was measured in the experiment is actually the critical path. It is what is actually measured in the experiment.

Mapping Cells to Receptors is a processing step that may or may not be done in an experiment.

schristley commented 5 months ago

Hmm, in thinking about this a bit more, the AIRR Standard is primarily about capturing how an experiment is done, no?

I agree with this sentiment, which makes me wonder how to get the set of receptor reactivities that were done in a single experiment?

{"digression rant": true}

The underlying issue that I feel we continually run into with these discussions is how to represent relations. How do we resolve the tension between an efficient data structure and our semantic model of the biology (with the ADC often as a third wrinkle). This is where JSON schema fails us because it doesn't have an explicit representation for relations (unlike say SQL DDL), instead it is implicit semantics (like ids and arrays) without any syntactic enforcement.

I tend to always advocate for the efficient data structure. That means in 1-to-n relations, I advocate that the "link" always goes on the "n" side where a single field can be used to point to the "1" object. Versus if you put the relation on the "1" side then you need an array holding "n" objects/ids.

Given this, having reactivity_measurements in Receptor is "wrong" (inefficient), even though semantically it sounds just fine: "a receptor has a set of reactivities". Instead ReceptorReactivity should have a receptor_id field that points to its one and only Receptor. With the current model, every time there is a reactivity measurement then the "global Receptor object" needs to updated with a new object added to the array. Who does that? How do we even know when to do that? Sounds messy to me. As an extreme analogy, imagine instead of Rearrangement pointing to its V allele, the V allele in the germline set had an array of rearrangements.

So I understand the current discussion is around Cell. A Cell can have multiple Receptors, so it's an n-to-n relation with ReceptorReactivity. However if you consider a Cell/Receptor combo, then nominally its a 1-to-n relation. Using the "efficiency" logic, the ReceptorReactivity should have a cell_id that points to the Cell. If the ReceptorReactivity does not involve a cell then that field is null. With ReceptorReactivity having both receptor_id and cell_id then I believe you can represent the true n-to-n relation by having multiple ReceptorReactivity objects with various combos of receptor_id and cell_id values.

Finally, coming back to where I started, maybe we should consider adding some formalism, like some JSON-LD attributes, that more explicitly defines the relations in the schema. Not really sure if that helps though, though maybe it helps to document the arity of the relation. Looking at some of the new draft objects, I feel we've introduced a number of "data inefficient" relations.

{"digression rant": false}

bcorrie commented 5 months ago

Using the "efficiency" logic, the ReceptorReactivity should have a cell_id that points to the Cell. If the ReceptorReactivity does not involve a cell then that field is null. With ReceptorReactivity having both receptor_id and cell_id then I believe you can represent the true n-to-n relation by having multiple ReceptorReactivity objects with various combos of receptor_id and cell_id values.

I don't object to this implementation if we feel that is a better way to go. As long as we maintain the ability to have a link between ReceptorReactivity to both Cell and Receptor if desired I think we are fine. Both can be null if the experiment does not have such links.

The array stems from the original Receptor model where their was an Array of reactivity objects embedded in the Receptor object. This was changed to have a separate ReceptorReactivity object:

https://github.com/airr-community/airr-standards/pull/674/commits/ba626d21ad957c32259341b022f07a99986c67ec

And then to an array of ReceptorReactivity IDs instead of the actual embedded objects. This was also when we added an array of ReceptorReactivity IDs to Cell

https://github.com/airr-community/airr-standards/commit/5a2915f1acead70561adb571f3431d23402f3df9

I am OK to change ReceptorReactivity to have a cell_id and receptor_id to make these links. We can then remove the arrays receptor_reactivity from Cell and Receptor.

bcorrie commented 5 months ago

With the current model, every time there is a reactivity measurement then the "global Receptor object" needs to updated with a new object added to the array. Who does that? How do we even know when to do that? Sounds messy to me.

This is a really good reason to change our model as @schristley suggests...

bussec commented 5 months ago

I just pushed a set of changes to the schema that should address some of the points raised.

Some explanations on these changes and thoughts on the recent discussion regarding the relations between the objects, with a focus on experimental aspects:

From my point of view, there are two distinct types of reactivity information that we are trying to annotate:

  1. Reactivity of recombinant receptors: In this case the receptor is reconstructed based on the sequences from an scAIRR-seq experiment and recombinantly expressed, e.g., as an soluble antibody or a T cell line expressing it as TCR. In such a setup we look at one defined receptor at a time, so we can simplify the Cell:Receptor:Reactivity relation as N:1:N, i.e., multiple cells can have the same receptor and a given receptor can have multiple reactivity measurements. Following our discussion on the representation of 1:N relations, this means that both Cell and ReceptorReactivity should reference to Receptor (Note: The fact that a cell can have multiple receptors would IMO not change in this decision, as the number of receptors per cell is usually 1 or 2 (and rarely large than 3), while cells per receptor can easily go into the range of hundreds. However, it means that Cell needs to contain an array of receptor IDs (as it does)). Importantly, in such an experiment there is no direct relation between the cell observed in the scAIRR-seq experiment and the reactivity measurement, so there is no reason to have a direct reference between the respective records here. This leads to the situation that the origin of a ReceptorReactivity is lost, which is something that - as @bcorrie critizied - we need to avoid. However, I don't see any obvious entry point below the level of the study, which is why my edits contain the addition of a study_id property to ReceptorReactivity. This also addresses the problem of receptors that are tested within a study, but are not observed in the scAIRR-seq data (e.g., antibodies in which some or all hypermutations are reverted).

  2. Reactivity of individual cells: In this case cells are incubated with a bait (e.g., a protein or carbohydrate for B cells or an MHC:peptide multimere for T cells) that is labeled, e.g., with a fluorescent or DNA barcode tag. The presence of the tag on the cells (measured, e.g., as fluorescence or as barcode counts in the sequencing data to stick with the previous examples) is then interpreted as reactivity that can be attributed to the Ig/TCR expressed by the cell. We therefore had the idea to use ReceptorReactivity to capture this information, which then gave rise to the notion that there should be a direct reference between Cell and ReceptorReactivity as otherwise it would be impossible to attribute a given ReceptorReactivity to a Cell (given the N:1:N relation described above). After careful consideration, I have decided against adding a cell_id property to ReceptorReactivity and I would like us to reconsider the original decision to use ReceptorReactivity for this type of information, as I think that the ways these experiments are usually conducted do not allow for a proper attribution of reactivity information to an individual receptor:

    1. If multiple receptors are expressed, binding cannot be directly attributed to any of them
    2. Binding experiments are primarily used to select high-binding cells, not to measure reactivity. Therefore often the level of receptor surface expression in not measured, but would be necessary for normalization
    3. Combining the previous two points, multiple receptor might be expressed at different levels but it is very challenging with the methods described above to normalize for this
    4. Finally, the exact amount of background binding of the bait by an individual cell is unknown as it is difficult to control for it

    Therefore, while not impossible, most experiments will provide data that should not be considered to be receptor reactivity data, but rather cell reactivity data.

The most obvious solution would IMO be to leave Receptor and ReceptorReactivity as it is now, and add a separate CellReactivity object, but I am open to other suggestions.

bcorrie commented 5 months ago

@bussec as a non-expert, your argument makes sense to me.

Am I correct in stating that:

If I am not mistaken you have not made that change yet, correct?

For the record, we are trying to curate a study that has CellReactivity using MHC:peptide multimers. That is where issue #704 "Receptor object issues when used in real life..." came from. There is no Receptor inference done in this study and the reactivity is to the Cell. So to capture this study accurately from the experimental perspective we would have Cell and CellReactivity but no Receptor and ReceptorReactivity.

If the study actually inferred that Cell C was actually an instance of Receptor R, we would then create a Receptor object and link it to the Cell.

So I think you rationale of adding CellReactivity and having two different types of "Reactivity" make sense to me.

bussec commented 5 months ago

@bcorrie Yes, your two points are correct. I think that a separate CellReactivity record is the cleaner solution, although it will be very similar to the ReceptorReactivity record. In haven't made these changes yet, be we would need to clean up some of the experimental keywords in addition.

schristley commented 5 months ago

Thank you for the detailed description @bussec

  1. Reactivity of recombinant receptors:

A somewhat naive question. This overlaps (same, different?) with the annotation that IEDB is doing?

schristley commented 5 months ago
  1. Reactivity of individual cells: In this case cells ...

Hi @bpeters42 , this is an interesting case that Christian brings up. Is this something that IEDB is currently annotating in some way? While the reactivity is not tied to a one specific receptor, it is limited to a small possible set.

@bussec Are you aware of any publications that have reported this type of data?

bcorrie commented 5 months ago

| Reactivity of recombinant receptors:

| A somewhat naive question. This overlaps (same, different?) with the annotation that IEDB is doing?

See my use case here about how one might annotate a Cell with two paired chain Rearrangements with Receptor information that is linked to IEDB: https://github.com/airr-community/airr-standards/issues/704#issuecomment-1891063546

I think the basic idea for Receptor is not to reproduce what IEDB and other similar tools do but instead link to them (through the receptor_ref field), which would typically contain an IEDB (or other relevant repository) reference for the Receptor.

bcorrie commented 5 months ago

@bussec Are you aware of any publications that have reported this type of data?

We are working on curating this paper: https://doi.org/10.1038/s41590-022-01184-4 which uses MHC/peptide dextramer tags to determine cells that have specificity to specific MHC/peptide complexes.

We have loaded the Rearrangement and Cell data for this study already (https://gateway.ireceptor.org/samples?query_id=93707) and are working on the Cell/CellReactivity linking currently. We also have similar specificity data for some of our T1D studies that we are working on.

In fact, it is the curation of this study in the ADC that caused me to create #704 "Receptor object issues when used in real life...". To my knowledge this is the first "real life" use of the AIRR Cell/Receptor/ReceptorReactivity/CellReactivity data model - hence the fact that we are coming across issues 8-)

In this case, with the new model, we will be creating CellReactivity objects and NOT ReceptorReactivity objects for this study.

If some of the Cells in this study are known Receptors that have a know epitope specificity in IEDB, and we identify these Cells we could create Receptor objects for those Cells and then link that Receptor object to the relevant object in IEDB. If we were keeners, we could then replicate the specificity information from IEDB into a set of ReceptorReactivity objects.

At that point, the end user of the ADC could search for Cells with reactivity against specific epitopes or antigens that were acquired through direct experiment (CellReactivity) AND through Receptor association (ReceptorReactivity). I think that is the end game goal here. The end user wants to say, give me all of the Cells that have SARS-CoV2 specific reactivity. They ideally would get all such Cells that the experiments in the ADC identified as well as any other Cells in the ADC that have such specificity through known Receptor reactivity.

bcorrie commented 5 months ago

@bussec I have added a start of CellReactivity. Has a cell_id rather than study_id, and I removed receptor_hash as I don't think it applies??? I also removed receptor_reactivity from Cell.

scharch commented 5 months ago

@bussec Are you aware of any publications that have reported this type of data?

Lots of this going on in the B cell world: https://pubmed.ncbi.nlm.nih.gov/34848871/ https://pubmed.ncbi.nlm.nih.gov/37935199/ https://pubmed.ncbi.nlm.nih.gov/36708513/ https://pubmed.ncbi.nlm.nih.gov/37440409/

...though many of those don't do more than an SRA deposit, in terms of reporting.

javh commented 5 months ago

I'm a bit confused here. If there's reactivity data without any associated AIRR-seq data, then isn't this out of scope for the AIRR schema?

The COVID-19 paper with the oligo tagged MHC:peptide assay is measuring both the TCR and the reactivity of the TCR simultaneously (via the 10x protocol). The purpose of these assays is to determine the reactivity of individual TCRs. Wouldn't this be a case for Receptor -> ReceptorReactivity?

scharch commented 5 months ago

2. Reactivity of individual cells: In this case cells are incubated with a bait (e.g., a protein or carbohydrate for B cells or an MHC:peptide multimere for T cells) that is labeled, e.g., with a fluorescent or DNA barcode tag. The presence of the tag on the cells (measured, e.g., as fluorescence or as barcode counts in the sequencing data to stick with the previous examples) is then interpreted as reactivity that can be attributed to the Ig/TCR expressed by the cell. We therefore had the idea to use ReceptorReactivity to capture this information,

@bussec @bcorrie isn't this already covered by CellExpression though? In particular, we have "antigen_bait_binding_by_fluorescence_intensity" and "antigen_bait_binding_by_dna_barcode_count" as recommended values for CellExpression.property_type.

I could see how it might make sense to separate this into a new CellReactivity object, since antigens are not "expressed," but (a) we should make the boundaries/use cases clear and (b) the above should be added to CellReactivity.reactivity_method along with corresponding changes to readout.

scharch commented 5 months ago

I'm a bit confused here. If there's reactivity data without any associated AIRR-seq data, then isn't this out of scope for the AIRR schema?

The COVID-19 paper with the oligo tagged MHC:peptide assay is measuring both the TCR and the reactivity of the TCR simultaneously (via the 10x protocol). The purpose of these assays is to determine the reactivity of individual TCRs. Wouldn't this be a case for Receptor -> ReceptorReactivity?

I guess that "measuring both the TCR and the reactivity of the TCR simultaneously" is not quite true since (a) you can have a reactivity measurement for a cell from which a TCR was not captured (and vice versa) and (b) an allelic inclusion (or cell doublet) would result in a single reactivity measurement but >1 possible receptor.

schristley commented 5 months ago

@bussec Are you aware of any publications that have reported this type of data?

We are working on curating this paper: https://doi.org/10.1038/s41590-022-01184-4 which uses MHC/peptide dextramer tags to determine cells that have specificity to specific MHC/peptide complexes.

This study is curated in IEDB. It has 891 receptors, 44 assays, 18 epitopes and 5 antigens. What is AIRR trying to curate that isn't already in IEDB?

javh commented 5 months ago

I guess that "measuring both the TCR and the reactivity of the TCR simultaneously" is not quite true since (a) you can have a reactivity measurement for a cell from which a TCR was not captured (and vice versa) and (b) an allelic inclusion (or cell doublet) would result in a single reactivity measurement but >1 possible receptor.

@scharch good point. Hrm. Is the main challenge here that the global Receptor object is too ambitious? Hard to update, will require inferences that aren't simple, etc? In which case, I think the proposed CellReactivity would end up being the default route people take because of convenience.

bcorrie commented 5 months ago

@bussec Are you aware of any publications that have reported this type of data?

We are working on curating this paper: https://doi.org/10.1038/s41590-022-01184-4 which uses MHC/peptide dextramer tags to determine cells that have specificity to specific MHC/peptide complexes.

This study is curated in IEDB. It has 891 receptors, 44 assays, 18 epitopes and 5 antigens. What is AIRR trying to curate that isn't already in IEDB?

The AIRR-seq, Cell, and GEX data from the study.

This seems like a perfect use case for linking the ADC and IEDB since the study is currently curated in both. Currently the AIRR-seq/Cell/Gex data is in the ADC and the Epitope specificity data is in IEDB. There is currently no way to determine that there is related data in both repositories except through searching for the publication in each platform. In fact I didn't know it was curated in IEDB. Isn't that the crux of the problem 8-)

I suppose that this would mean that if we wanted to be complete in the curation of this study, we would need to create 891 Receptors and link the Cells we have loaded (https://gateway.ireceptor.org/samples/cell?query_id=93707) to each of the related Receptor object, which in turn would have a link to the related IEDB object. I suspect there is an N to 1 relationship between Cell and Receptor in this study and it is likely that all Cells will have a Receptor as we are only curating the epitope specific Cells from the study. So they should all have made it into IEDB.

I suppose that points out in a bit more detail what is different between IEDB and the ADC. You can go from Receptor back to the Cell's that provided the evidence, and then look at either the more detailed sequence annotation and/or the gene expression of those Cells. IEDB on the other hand has all the relevant information about the specific Receptor including any other studies that provide evidence for the given Receptor specificity.

bcorrie commented 5 months ago

I could see how it might make sense to separate this into a new CellReactivity object, since antigens are not "expressed," but (a) we should make the boundaries/use cases clear and (b) the above should be added to CellReactivity.reactivity_method along with corresponding changes to readout.

Yes, so far we have just copied the fields in ReceptorReactivity to CellReactivity and have not revisited the changes that @bussec suggested here: https://github.com/airr-community/airr-standards/pull/705#issuecomment-1891205323

bcorrie commented 5 months ago

I suppose that this would mean that if we wanted to be complete in the curation of this study, we would need to create 891 Receptors and link the Cells we have loaded (https://gateway.ireceptor.org/samples/cell?query_id=93707) to each of the related Receptor object, which in turn would have a link to the related IEDB object. I suspect there is an N to 1 relationship between Cell and Receptor and it is likely that all Cells will have a Receptor as we are only curating the epitope specific Cells from the study.

So for this study's reactivity data it seems like the minimal would be:

In this way:

This is non-trivial, but at least "semantically" distinct???

In the AKC world, this is all going to be trivial, right 8-)

bcorrie commented 5 months ago
  • One can find reactivity data in the ADC at the Receptor level with a more complex query. You would need to do something like "give me the set of Cells C that are linked to Receptor R and then give me all the CellReactivity for all the Cells in set C"

I wonder if CellReactivity should have a receptor_id so that a Cell->Epitope reaction can be associated with both a Cell and a Receptor, which would avoid the above issue... It can be null if no Receptor is created, but if there is a Cell->Receptor association, should we maintain this in the CellReactivity? Possibly not, as this makes internal consistency challenging.

bcorrie commented 5 months ago

I have added repertoire_id and data_processing_id to CellReactivity similar to how we have it with CellExpression. I think we need to link the reactivity back to the Repertoire, no?

bcorrie commented 5 months ago

@scharch good point. Hrm. Is the main challenge here that the global Receptor object is too ambitious? Hard to update, will require inferences that aren't simple, etc? In which case, I think the proposed CellReactivity would end up being the default route people take because of convenience.

It seems to me that the experiments are measuring CellReactivity. Isn't the fact that there is ReceptorReactivity inferred in some sense? If we define a Receptor in this case aren't we inferring that the reactivity is between the Receptor and the epitope.

schristley commented 5 months ago

@bussec Are you aware of any publications that have reported this type of data?

We are working on curating this paper: https://doi.org/10.1038/s41590-022-01184-4 which uses MHC/peptide dextramer tags to determine cells that have specificity to specific MHC/peptide complexes.

This study is curated in IEDB. It has 891 receptors, 44 assays, 18 epitopes and 5 antigens. What is AIRR trying to curate that isn't already in IEDB?

The AIRR-seq, Cell, and GEX data from the study.

This seems like a perfect use case for linking the ADC and IEDB since the study is currently curated in both. Currently the AIRR-seq/Cell/Gex data is in the ADC and the Epitope specificity data is in IEDB. There is currently no way to determine that there is related data in both repositories except through searching for the publication in each platform. In fact I didn't know it was curated in IEDB. Isn't that the crux of the problem 8-)

Yes, though I meant specifically about reactivity. My concern was duplication of curation between the two resources about receptors and reactivity, especially if there isn't coordination, then you get two separate curations, leading to confusion for users. The link that you want between the two resources doesn't happen automatically, so yeah that's the crux of the problem but ADC doing it's own curation that duplicates IEDB doesn't sound like a good idea.

I suppose that this would mean that if we wanted to be complete in the curation of this study, we would need to create 891 Receptors and link the Cells we have loaded (https://gateway.ireceptor.org/samples/cell?query_id=93707) to each of the related Receptor object, which in turn would have a link to the related IEDB object.

Or you can just point the AIRR Study to the IEDB Study and then you know there is information in both. Or have the AIRR Rearrangements point to the IEDB records. Or have AIRR Cells point to IEDB records. Or what you said... There are many options, it all depends upon what your goals are based upon some driving use cases.

schristley commented 5 months ago

In the AKC world, this is all going to be trivial, right 8-)

If not trivial then at least easier and more consistent. Part of my confusion with this whole thread is understanding where the boundaries are. Many things seem to be exactly what the AKC data harmonization effort is doing. If AIRR is trying to create standards for the integration effort and data harmonization that AKC is doing, then it feels a bit like the cart before the horse.

bcorrie commented 5 months ago

@schristley said:

Or you can just point the AIRR Study to the IEDB Study and then you know there is information in both. Or have the AIRR Rearrangements point to the IEDB records. Or have AIRR Cells point to IEDB records. Or what you said... There are many options, it all depends upon what your goals are based upon some driving use cases.

We (the iReceptor team) chose the Minervina et al. study because it provides a good scientific use case that tested all of the new AIRR Standard constructs - Subject HLA, Rearrangement, Cell, CellExpression, Receptor, and ReceptorReactivity that were available after the release of AIRR v1.4/iReceptor v4.0. Thus in a way it was chosen to help us both poke holes in the AIRR model as well as to understand the boundaries between the ADC and other resources better.

In order to poke holes in the AIRR model we wanted to curate the study as completely as possible. So we want to curate all of the HLA, Rearrangement, Cell, CellExpression, Receptor, and CellReactivity data that was produced in the study in the appropriate ADC constructs. Although we have done all of the others, we have only recently been trying to curate the Minervina Receptor, ReceptorReactivity, or CellReactivity data - and that is what started this whole discussion 8-)

@schristley also said:

If not trivial then at least easier and more consistent. Part of my confusion with this whole thread is understanding where the boundaries are. Many things seem to be exactly what the AKC data harmonization effort is doing. If AIRR is trying to create standards for the integration effort and data harmonization that AKC is doing, then it feels a bit like the cart before the horse.

For me the AIRR Receptor and Reactivity objects are designed to enable data harmonization and integration, not necessarily perform that integration. That is why Receptor has receptor_refs which point to external sources of information (such as IEDB) that might have information about that object that doesn't exist in the ADC. There is no intent to duplicate data needlessly.

The reason there is overlap in terms of Reactivity data between the ADC and IEDB in this case is that the data from the Minervina study has been reused for both purposes. This is a good thing, and to me these uses are complementary but different. The links between Cell, CellExpression, and CellReactivity are all part of the Minverina immune response study, and are only found in the ADC, so I do believe that the CellReactivity (linked back to the actual Cell) from the study should be captured in the ADC.

This study also provide evidences for specificity of known Receptors in IEDB, so the data has also been curated and captured there.

Our challenge, like usual, is to figure out how to link the two so that the user doesn't get confused... 8-)

bcorrie commented 5 months ago

Minor tweak in CellReactivity and ReceptorReactivity - shouldn't reactivity_unit be a Unit Ontology (UO) field?

        reactivity_unit:
            type: string
            description: The unit of the measurement
            example: pg/ml
            x-airr:
                nullable: false
                adc-query-support: true

http://purl.obolibrary.org/obo/UO_0010070

javh commented 5 months ago

From the call:

bcorrie commented 5 months ago

Minor tweak in CellReactivity and ReceptorReactivity - shouldn't reactivity_unit be a Unit Ontology (UO) field?

Answering this myself. I don't think it can because there is a need to have a Reactivity as a true/false - which I can't find in UO. But I am also not sure how to represent the boolean this cell binds to this antigen/epitope/peptide?

Do we really know the fields reactivity_readout and reactivity_method to have these as enums? Should they be strings with a recommended limited vocabulary instead?

bcorrie commented 5 months ago

From the call:

  • Keep lean version of Receptor.
  • Remove ReceptorReactivity.

Done

  • Receptor.reactivity_measurements to reactivity_ref.

This was already removed. There is now only a Receptor.receptor_ref

It seems to me like receptor_ref is a better name for this field. At least when it points to an IEDB record, it is pointing to a IEDB.Receptor object which likely records many documented reactivity records for that Receptor. I think the intent here is we are pointing to external definitions of the Receptor that provide more info like "receptor reactivity".

  • Can we rename CellReactivity to Reactivity.

This seems more appropriately named CellReactivity to me, since it is actually linked to the Cell. I don't think this is particularly critical since there is a cell_id field that makes the implicit link to Cell clear.

But it seems to me that since we have CellExpression as an object it seems like CellReactivity is more consistent.

bussec commented 4 months ago

Note to self: Changes in the docs are still pending, will wait for #628 to complete.

javh commented 4 months ago

But it seems to me that since we have CellExpression as an object it seems like CellReactivity is more consistent.

No objection from me.

It seems to me like receptor_ref is a better name for this field. At least when it points to an IEDB record, it is pointing to a IEDB. Receptor object which likely records many documented reactivity records for that Receptor. I think the intent here is we are pointing to external definitions of the Receptor that provide more info like "receptor reactivity".

I don't think that was the goal. I thought we were replacing the functionality of the reactivity object itself. @bussec, am I mistaken?

bcorrie commented 4 months ago

I don't think that was the goal. I thought we were replacing the functionality of the reactivity object itself. @bussec, am I mistaken?

The Receptor object now points to a bunch of external entities that contain information about that Receptor, noted with CURIEs. Those entities (e.g. IEDB_RECEPTOR:10) contain information about receptor reactivity, but they currently are not pointing to different reactivity records, you are correct.

        receptor_ref:
            type: array
            description: Array of receptor identifiers defined for the Receptor object
            title: Receptor cross-references
            items:
                type: string
            example: ["IEDB_RECEPTOR:10"]
            x-airr:
                nullable: true
                adc-query-support: true

I think we could link to individual external reactivity records, but that would be moving more towards replicating information that is in those other repositories. The above IEDB Receptor (https://www.iedb.org/receptor/10) has three assays that show positive results:

https://www.iedb.org/assay/1650226 https://www.iedb.org/assay/1650227 https://www.iedb.org/assay/1755178

So we could have:

        receptor_ref:
            type: array
            description: Array of receptor identifiers defined for the Receptor object
            title: Receptor cross-references
            items:
                type: string
            example: ["IEDB_ASSAY":"1650226","IEDB_ASSAY":"1650227","IEDB_ASSAY:"1755178"]
            x-airr:
                nullable: true
                adc-query-support: true

But as I say, you can get that from IEDB by linking to a single record (and then asking IEDB for more info), and you are now replicating all of the links to specific assay data that is curated in IEDB. Some Receptors would have MANY assays, so that would be painful to manage on the ADC side.

It makes more sense to me to be linking to the external object at the same conceptual level (e.g. Receptor).

bussec commented 4 months ago

It seems to me like receptor_ref is a better name for this field. At least when it points to an IEDB record, it is pointing to a IEDB. Receptor object which likely records many documented reactivity records for that Receptor. I think the intent here is we are pointing to external definitions of the Receptor that provide more info like "receptor reactivity".

I don't think that was the goal. I thought we were replacing the functionality of the reactivity object itself. @bussec, am I mistaken?

The way I interpreted our discussions in the last call was to replace the (AIRR-C-defined) ReceptorReactivity object with links to objects defined and hosted by other repositories. This is different from removing reactivity information completely from the Receptor object. IMO there is value for users to be able to directly assess how much (or how little) information there is about a given receptor, without having to do lookups with other repos.

bcorrie commented 4 months ago

The way I interpreted our discussions in the last call was to replace the (AIRR-C-defined) ReceptorReactivity object with links to objects defined and hosted by other repositories. This is different from removing reactivity information completely from the Receptor object. IMO there is value for users to be able to directly assess how much (or how little) information there is about a given receptor, without having to do lookups with other repos.

So I am not sure what this means in terms of the specification... 8-)

My understanding was that we were removing the ReceptorReactivity object, and we were creating a field in Receptor that linked to related external objects. Is that correct?

Both of my suggestions in https://github.com/airr-community/airr-standards/pull/705#issuecomment-1937835071 meet this requirement. One links to external Receptors while one links to external Assays that provide Receptor Reactivity information.

@bussec do you like the Assay link better? Do we want both. Personally I think the link to "IEDB_RECEPTOR:10" is pretty critical so I don't think we want to get rid of that (that is one of the key points of the Receptor object). Do we want another field that links to something like the IEDB Assay also ("IEDB_ASSAY":"1650226","IEDB_ASSAY":"1650227","IEDB_ASSAY:"1755178").

So do we have Receptor.receptor_refs and Receptor.receptor_reactivity_refs

I personally think this is a bit of can worms, as keeping the two in sync is scary... When an external repository (e.g. IEDB) adds more Assays for a external Receptor, the data in my repository is no longer in sync. At this point we are duplicating data in other repositories that is very challenging to keep in sync.

To me, this seems out of scope for the AIRR Standard.

bussec commented 4 months ago

@bcorrie @javh I had another look at this and did some minor fixes with regard to enums of experimental properties in CellReactivity. Otherwise:

  1. I would be ok with removing the reactivity_ref property from the Receptor object and only retain the links to an "receptor-like" object in an external database. While I would still prefer to have the reactivity information in there, I also think that @bcorrie's concerns with regard to logical structure and update frequency of reactivity information are valid.
  2. In which scenario do we need CellReactivity.repertoire_id and CellReactivity.data_processing_id, given that this information is already present in the Cell object?
  3. If we agree on dropping reactivity information altogether, the Docs need some more work-up to reflect the changes regarding *Reactivity objects.
  4. The original ReceptorReactivity object was designed with a low-throughput experimental setup in mind than what will get from experiments that would use a CellReactivity record (i.e., 10,000-1,000,000 cells stained with the same set of baits). Irrespective of normalization in the backend database of a repo, do we need to split CellReactivity into a "reagent" and a "measurement" object? Or will file compression save us?

As the last point is not directly connected to the original topic of this PR, I would like to suggest that we make the changes for 1-3 and merge and then have a new issue for 4.

bcorrie commented 4 months ago
  1. I would be ok with removing the reactivity_ref property from the Receptor object and only retain the links to an "receptor-like" object in an external database. While I would still prefer to have the reactivity information in there, I also think that @bcorrie's concerns with regard to logical structure and update frequency of reactivity information are valid.

This is currently the situation, I had not yet added reactivty_ref into the Receptor object, so there is currently only a receptor_ref. So this is "done" unless we decide otherwise.

bcorrie commented 4 months ago

2. In which scenario do we need CellReactivity.repertoire_id and CellReactivity.data_processing_id, given that this information is already present in the Cell object?

Currently, unless I am mistaken (gasp), all observed objects are connected to their respective Repertoire and the relative _id fields.

Although the use case might not be obvious, making it possible for the data consumer to find all CellReactivity information related to a Repertoire without having to find all Cells and the search CellReactivity for a large number of Cells seems like a good idea to me. It future proofs us from things that we might not think would be common.

In fact, now that I think about it from an ADC perspective, one use case would be I want to download all of the Cells, CellExpression, and CellReactivity data from a specific Sample. I would do this by processing the related repertoire_id fields from that Sample and query the appropriate end point once.

bcorrie commented 4 months ago

As the last point is not directly connected to the original topic of this PR, I would like to suggest that we make the changes for 1-3 and merge and then have a new issue for 4.

Good idea, since I don't understand 4 8-)