Closed bcorrie closed 4 months ago
From the call:
From the call: @bcorrie and @bussec are in general ok with this PR, @bussec will make some final clarification in the description of Cell
.reactivity_measurement
. Will put this up for a final discussion in January, then merge.
From the call:
reactivity_measurements
defined in Cell
. There wouldn't be a prohibition on it, but it would not be the preferred schema path.From the call:
Sorry I missed the call, looks like my calendar entry ended end of 2023. I am still traveling, returning home tomorrow.
- Need to review, but seems mostly fine, except we don't think we need
reactivity_measurements
defined inCell
. There wouldn't be a prohibition on it, but it would not be the preferred schema path.
reactivity_measurements
in both Cell
and Receptor
are arrays of IDs that refer to ReceptorReactivity
objects (essentially the RecceptorReactivity.receptor_reactivity_id
of the object). So reactivity_measurements
are not defined in the Cell object per se.
My understanding of Cell
having a receptor_reactivity
is that this would capture when a specific experiment measured a reactivity for an epitope for a specific Cell
(e.g. in a 10X experiment). See https://github.com/airr-community/airr-standards/issues/704#issuecomment-1641133894
Is that the general understanding? Would this not be the preferred thing to do in such an experiment?
We were thinking the normal usage would be Cell -> Receptor -> ReceptorReactivity. If you wanted to jump over Receptor and put a reactivity_measurements
array into Cell you could, in the same way that you can add sample_id
to Rearrangement (semantics are clear even though there's no sample_id
defined in Rearrangement), but it wouldn't be the model.
Hey gang,
I wanted to toss thoughts from a B cell perspective. The above discussion seems very focused on T cells so I'm trying to figure out if we need another Object to cover Antibody specificity or make a more general.
Random thoughts:
I agree that reactivity doesn’t make sense for Abs. I haven’t kept up with all the discussion on this, but doesn’t ReceptorSpecificity make more sense and cover BCR, TCR, and Ab (even though Abs aren’t receptors)?
I agree that reactivity doesn’t make sense for Abs. I haven’t kept up with all the discussion on this, but doesn’t ReceptorSpecificity make more sense and cover BCR, TCR, and Ab (even though Abs aren’t receptors)?
I think that name change covers my name issue. The only thing I could see falling outside that is if we had protein sequence from protein sequencing of antibodies, like Georgiou does. In IRAD we've thought about going after that data type because there is a lot of it in the patent Genbank. But, as far as I know, all our tool chains are very focused on nucleotide sequences, so that's an issue.
@krishnaroskin @lgcowell The Receptor
object was developed with both AB/BCR and TCR in mind. In general, it can exist independent of a Cell
, i.e., you can use it to annotate reactivities of antibodies that were not observed in cells derived from a Subject
, e.g., a recombinantly expressed antibody in which somatic hypermutations were added or removed. However, as @javh described, we assume that the typical way how it would be used is in the context of single-cell experiment.
ReceptorReactivity is more of a T cell concept since I think of "reactivity" as a (T) cell's response when it's receptor binds something. For antibodies the concept of reactivity is not on-point since there is no cell.
This is basically a question about the semantics of the term "reactity". For me, reactivity implies that the receptor or antibody can interact with a given target in a way that can be experimentally detected. Using this definition, activiation of the cell that bears a given receptor/antibody is a sufficient, but not a necessary criterion. The reason why we use reactivity instead of specificity is that the later one - similar to its use in statistics - is relative to something else. Polyreactive antibodies for example are exactly that: They bind everything from polystyrol to shoe laces (i.e. they are "reactive") but they are not specific for any of these things.
The above and that BCRs (on B cells) can also interact with antigen (like T cells), begs the question is if we should have the distinction between antibodies and BCRs on B cells? We sequence B cells but often test binding/affinity on antibodies. Maybe that distinction should be encapsulated in the description of how the BCR/antibody specificity was measured?
In general this is the idea behind the Receptor
/ReceptorReactivity
split: A given receptor shoud always have the same intristic reactity (otherwise there would be no chance of reproducible results), but different assays will measure it by different means. For example, take an experiment where a fluorescently-labeled protein is used as bait to select for target-specific B cells during sorting. The sorted cells are then subjected to single-cell sequencing and the sequence information of the IGH/IGK transcripts is then used to produce a recombinant antibody that can be tested in vitro (e.g. ELISA). In this case a single Receptor
record would have one or more ReceptorReactivity
record that describe the results obtained using the recombinant antibody and one that describes the binding in the initial fluorescent bait assay. The latter ReactivityRecord
could be directly referrenced from a Cell
record (@bcorrie's use-case), while the prior ones would not. The reactivity_method
and reactivity_readout
properties can be used to provide further information on the experimental approach.
peptide_aa_string is something that don't have much applicability in B cell world.
Fully agree, therefore the schema does not require you to provide this property.
We were thinking the normal usage would be Cell -> Receptor -> ReceptorReactivity. If you wanted to jump over Receptor and put a
reactivity_measurements
array into Cell you could, in the same way that you can addsample_id
to Rearrangement (semantics are clear even though there's nosample_id
defined in Rearrangement), but it wouldn't be the model.
Is this implying that Cell.receptor_reactivity
should be removed from the object schema? If so I think that is a problem.
When receptor_reactivity
is measured in an experiment, that reactivity is associated with a specific Cell. If we go Cell -> Receptor -> ReceptorReactivity
(there is no direct link in Cell
to the receptor_reactivity
measured) we lose the information that that specific receptor_reactivity
is associate with that Cell. We also lose information about the Study
in which the receptor_reactivity
was measured.
Remember that since Receptor
is a "global object" (there is only one Receptor with a specific paired VDJ/CDR3) it is quite likely for a specific Receptor
to have more than one measured receptor_reactivity
. This is even likely to happen in a single experiment if I am not mistaken, but it would certainly happen across multiple experiments. Multiple Cells
will be assigned the same Receptor
and have different measured receptor_reactivity
. In the above model a single Receptor
will have multiple receptor_reactivity
measurements, but it won't be possible to tell which receptor_reactivity
measurement came from which Cell
. This would be even worse if the receptor_reactivity
measures came from different experiments. In this case we lose information about which Study
a receptor_reactivity
measure came from, including the methodology for the study and how receptor_reactivity
was measured.
Bottom line is I believe we need to be able to capture a link between Cell
and a specific receptor_reactivity
directly in the model.
There are two ways of doing this, having an array of Cell.receptor_reactivity
IDs that point to the reactivity information for that Cell
or have a ReceptorReactivity.cell_id
so that the measured receptor_reactivity
points back to the Cell
that produced it. We went with the array in the Cell
as that is the more logical way of thinking about it (in an experiment Cells
have measured reactivity) - See https://github.com/airr-community/airr-standards/issues/704#issuecomment-1641133894
Hmm, in thinking about this a bit more, the AIRR Standard is primarily about capturing how an experiment is done, no? So a direct link from the Cell
to the ReceptorReactivity
that was measured in the experiment is actually the critical path. It is what is actually measured in the experiment.
Mapping Cells
to Receptors
is a processing step that may or may not be done in an experiment.
Hmm, in thinking about this a bit more, the AIRR Standard is primarily about capturing how an experiment is done, no?
I agree with this sentiment, which makes me wonder how to get the set of receptor reactivities that were done in a single experiment?
{"digression rant": true}
The underlying issue that I feel we continually run into with these discussions is how to represent relations. How do we resolve the tension between an efficient data structure and our semantic model of the biology (with the ADC often as a third wrinkle). This is where JSON schema fails us because it doesn't have an explicit representation for relations (unlike say SQL DDL), instead it is implicit semantics (like ids and arrays) without any syntactic enforcement.
I tend to always advocate for the efficient data structure. That means in 1-to-n relations, I advocate that the "link" always goes on the "n" side where a single field can be used to point to the "1" object. Versus if you put the relation on the "1" side then you need an array holding "n" objects/ids.
Given this, having reactivity_measurements
in Receptor
is "wrong" (inefficient), even though semantically it sounds just fine: "a receptor has a set of reactivities". Instead ReceptorReactivity
should have a receptor_id
field that points to its one and only Receptor
. With the current model, every time there is a reactivity measurement then the "global Receptor object" needs to updated with a new object added to the array. Who does that? How do we even know when to do that? Sounds messy to me. As an extreme analogy, imagine instead of Rearrangement
pointing to its V allele, the V allele in the germline set had an array of rearrangements.
So I understand the current discussion is around Cell
. A Cell
can have multiple Receptors
, so it's an n-to-n relation with ReceptorReactivity
. However if you consider a Cell/Receptor combo, then nominally its a 1-to-n relation. Using the "efficiency" logic, the ReceptorReactivity
should have a cell_id
that points to the Cell
. If the ReceptorReactivity
does not involve a cell then that field is null. With ReceptorReactivity
having both receptor_id
and cell_id
then I believe you can represent the true n-to-n relation by having multiple ReceptorReactivity
objects with various combos of receptor_id
and cell_id
values.
Finally, coming back to where I started, maybe we should consider adding some formalism, like some JSON-LD attributes, that more explicitly defines the relations in the schema. Not really sure if that helps though, though maybe it helps to document the arity of the relation. Looking at some of the new draft objects, I feel we've introduced a number of "data inefficient" relations.
{"digression rant": false}
Using the "efficiency" logic, the
ReceptorReactivity
should have acell_id
that points to theCell
. If theReceptorReactivity
does not involve a cell then that field is null. WithReceptorReactivity
having bothreceptor_id
andcell_id
then I believe you can represent the true n-to-n relation by having multipleReceptorReactivity
objects with various combos ofreceptor_id
andcell_id
values.
I don't object to this implementation if we feel that is a better way to go. As long as we maintain the ability to have a link between ReceptorReactivity
to both Cell
and Receptor
if desired I think we are fine. Both can be null if the experiment does not have such links.
The array stems from the original Receptor
model where their was an Array of reactivity objects embedded in the Receptor
object. This was changed to have a separate ReceptorReactivity
object:
And then to an array of ReceptorReactivity
IDs instead of the actual embedded objects. This was also when we added an array of ReceptorReactivity
IDs to Cell
https://github.com/airr-community/airr-standards/commit/5a2915f1acead70561adb571f3431d23402f3df9
I am OK to change ReceptorReactivity
to have a cell_id
and receptor_id
to make these links. We can then remove the arrays receptor_reactivity
from Cell
and Receptor
.
With the current model, every time there is a reactivity measurement then the "global Receptor object" needs to updated with a new object added to the array. Who does that? How do we even know when to do that? Sounds messy to me.
This is a really good reason to change our model as @schristley suggests...
I just pushed a set of changes to the schema that should address some of the points raised.
Some explanations on these changes and thoughts on the recent discussion regarding the relations between the objects, with a focus on experimental aspects:
From my point of view, there are two distinct types of reactivity information that we are trying to annotate:
Reactivity of recombinant receptors: In this case the receptor is reconstructed based on the sequences from an scAIRR-seq experiment and recombinantly expressed, e.g., as an soluble antibody or a T cell line expressing it as TCR. In such a setup we look at one defined receptor at a time, so we can simplify the Cell:Receptor:Reactivity relation as N:1:N, i.e., multiple cells can have the same receptor and a given receptor can have multiple reactivity measurements. Following our discussion on the representation of 1:N relations, this means that both Cell
and ReceptorReactivity
should reference to Receptor
(Note: The fact that a cell can have multiple receptors would IMO not change in this decision, as the number of receptors per cell is usually 1 or 2 (and rarely large than 3), while cells per receptor can easily go into the range of hundreds. However, it means that Cell
needs to contain an array of receptor IDs (as it does)). Importantly, in such an experiment there is no direct relation between the cell observed in the scAIRR-seq experiment and the reactivity measurement, so there is no reason to have a direct reference between the respective records here. This leads to the situation that the origin of a ReceptorReactivity
is lost, which is something that - as @bcorrie critizied - we need to avoid. However, I don't see any obvious entry point below the level of the study, which is why my edits contain the addition of a study_id
property to ReceptorReactivity
. This also addresses the problem of receptors that are tested within a study, but are not observed in the scAIRR-seq data (e.g., antibodies in which some or all hypermutations are reverted).
Reactivity of individual cells: In this case cells are incubated with a bait (e.g., a protein or carbohydrate for B cells or an MHC:peptide multimere for T cells) that is labeled, e.g., with a fluorescent or DNA barcode tag. The presence of the tag on the cells (measured, e.g., as fluorescence or as barcode counts in the sequencing data to stick with the previous examples) is then interpreted as reactivity that can be attributed to the Ig/TCR expressed by the cell. We therefore had the idea to use ReceptorReactivity
to capture this information, which then gave rise to the notion that there should be a direct reference between Cell
and ReceptorReactivity
as otherwise it would be impossible to attribute a given ReceptorReactivity
to a Cell
(given the N:1:N relation described above). After careful consideration, I have decided against adding a cell_id
property to ReceptorReactivity
and I would like us to reconsider the original decision to use ReceptorReactivity
for this type of information, as I think that the ways these experiments are usually conducted do not allow for a proper attribution of reactivity information to an individual receptor:
Therefore, while not impossible, most experiments will provide data that should not be considered to be receptor reactivity data, but rather cell reactivity data.
The most obvious solution would IMO be to leave Receptor
and ReceptorReactivity
as it is now, and add a separate CellReactivity
object, but I am open to other suggestions.
@bussec as a non-expert, your argument makes sense to me.
Am I correct in stating that:
ReceptorReactivity
to represent your case 2 above, we could simply add a cell_id
and it would work. But in this case we are "confusing" or "conflating" the two types of reactivity when you are suggesting we probably shouldn't.CellReactivity
it would have similar fields to ReceptorReactivity
but would not be related at all to the Receptor
. It would have a cell_id
to link it to the Cell
, but would not have a link to a Receptor
. We would then remove the array of receptor_reactivity
in Cell
also.If I am not mistaken you have not made that change yet, correct?
For the record, we are trying to curate a study that has CellReactivity
using MHC:peptide multimers. That is where issue #704 "Receptor object issues when used in real life..." came from. There is no Receptor
inference done in this study and the reactivity is to the Cell
. So to capture this study accurately from the experimental perspective we would have Cell
and CellReactivity
but no Receptor
and ReceptorReactivity
.
If the study actually inferred that Cell
C was actually an instance of Receptor
R, we would then create a Receptor
object and link it to the Cell
.
So I think you rationale of adding CellReactivity
and having two different types of "Reactivity" make sense to me.
@bcorrie Yes, your two points are correct. I think that a separate CellReactivity
record is the cleaner solution, although it will be very similar to the ReceptorReactivity
record. In haven't made these changes yet, be we would need to clean up some of the experimental keywords in addition.
Thank you for the detailed description @bussec
- Reactivity of recombinant receptors:
A somewhat naive question. This overlaps (same, different?) with the annotation that IEDB is doing?
- Reactivity of individual cells: In this case cells ...
Hi @bpeters42 , this is an interesting case that Christian brings up. Is this something that IEDB is currently annotating in some way? While the reactivity is not tied to a one specific receptor, it is limited to a small possible set.
@bussec Are you aware of any publications that have reported this type of data?
| Reactivity of recombinant receptors:
| A somewhat naive question. This overlaps (same, different?) with the annotation that IEDB is doing?
See my use case here about how one might annotate a Cell
with two paired chain Rearrangements
with Receptor
information that is linked to IEDB: https://github.com/airr-community/airr-standards/issues/704#issuecomment-1891063546
I think the basic idea for Receptor
is not to reproduce what IEDB and other similar tools do but instead link to them (through the receptor_ref
field), which would typically contain an IEDB (or other relevant repository) reference for the Receptor
.
@bussec Are you aware of any publications that have reported this type of data?
We are working on curating this paper: https://doi.org/10.1038/s41590-022-01184-4 which uses MHC/peptide dextramer tags to determine cells that have specificity to specific MHC/peptide complexes.
We have loaded the Rearrangement
and Cell
data for this study already (https://gateway.ireceptor.org/samples?query_id=93707) and are working on the Cell
/CellReactivity
linking currently. We also have similar specificity data for some of our T1D studies that we are working on.
In fact, it is the curation of this study in the ADC that caused me to create #704 "Receptor object issues when used in real life...". To my knowledge this is the first "real life" use of the AIRR Cell/Receptor/ReceptorReactivity/CellReactivity
data model - hence the fact that we are coming across issues 8-)
In this case, with the new model, we will be creating CellReactivity
objects and NOT ReceptorReactivity
objects for this study.
If some of the Cells
in this study are known Receptors
that have a know epitope specificity in IEDB, and we identify these Cells
we could create Receptor
objects for those Cells
and then link that Receptor
object to the relevant object in IEDB. If we were keeners, we could then replicate the specificity information from IEDB into a set of ReceptorReactivity
objects.
At that point, the end user of the ADC could search for Cells with reactivity against specific epitopes or antigens that were acquired through direct experiment (CellReactivity
) AND through Receptor
association (ReceptorReactivity
). I think that is the end game goal here. The end user wants to say, give me all of the Cells
that have SARS-CoV2 specific reactivity. They ideally would get all such Cells
that the experiments in the ADC identified as well as any other Cells
in the ADC that have such specificity through known Receptor
reactivity.
@bussec I have added a start of CellReactivity. Has a cell_id rather than study_id, and I removed receptor_hash as I don't think it applies??? I also removed receptor_reactivity from Cell.
@bussec Are you aware of any publications that have reported this type of data?
Lots of this going on in the B cell world: https://pubmed.ncbi.nlm.nih.gov/34848871/ https://pubmed.ncbi.nlm.nih.gov/37935199/ https://pubmed.ncbi.nlm.nih.gov/36708513/ https://pubmed.ncbi.nlm.nih.gov/37440409/
...though many of those don't do more than an SRA deposit, in terms of reporting.
I'm a bit confused here. If there's reactivity data without any associated AIRR-seq data, then isn't this out of scope for the AIRR schema?
The COVID-19 paper with the oligo tagged MHC:peptide assay is measuring both the TCR and the reactivity of the TCR simultaneously (via the 10x protocol). The purpose of these assays is to determine the reactivity of individual TCRs. Wouldn't this be a case for Receptor -> ReceptorReactivity?
2. Reactivity of individual cells: In this case cells are incubated with a bait (e.g., a protein or carbohydrate for B cells or an MHC:peptide multimere for T cells) that is labeled, e.g., with a fluorescent or DNA barcode tag. The presence of the tag on the cells (measured, e.g., as fluorescence or as barcode counts in the sequencing data to stick with the previous examples) is then interpreted as reactivity that can be attributed to the Ig/TCR expressed by the cell. We therefore had the idea to use
ReceptorReactivity
to capture this information,
@bussec @bcorrie isn't this already covered by CellExpression
though? In particular, we have "antigen_bait_binding_by_fluorescence_intensity" and "antigen_bait_binding_by_dna_barcode_count" as recommended values for CellExpression.property_type
.
I could see how it might make sense to separate this into a new CellReactivity
object, since antigens are not "expressed," but (a) we should make the boundaries/use cases clear and (b) the above should be added to CellReactivity.reactivity_method
along with corresponding changes to readout.
I'm a bit confused here. If there's reactivity data without any associated AIRR-seq data, then isn't this out of scope for the AIRR schema?
The COVID-19 paper with the oligo tagged MHC:peptide assay is measuring both the TCR and the reactivity of the TCR simultaneously (via the 10x protocol). The purpose of these assays is to determine the reactivity of individual TCRs. Wouldn't this be a case for Receptor -> ReceptorReactivity?
I guess that "measuring both the TCR and the reactivity of the TCR simultaneously" is not quite true since (a) you can have a reactivity measurement for a cell from which a TCR was not captured (and vice versa) and (b) an allelic inclusion (or cell doublet) would result in a single reactivity measurement but >1 possible receptor.
@bussec Are you aware of any publications that have reported this type of data?
We are working on curating this paper: https://doi.org/10.1038/s41590-022-01184-4 which uses MHC/peptide dextramer tags to determine cells that have specificity to specific MHC/peptide complexes.
This study is curated in IEDB. It has 891 receptors, 44 assays, 18 epitopes and 5 antigens. What is AIRR trying to curate that isn't already in IEDB?
I guess that "measuring both the TCR and the reactivity of the TCR simultaneously" is not quite true since (a) you can have a reactivity measurement for a cell from which a TCR was not captured (and vice versa) and (b) an allelic inclusion (or cell doublet) would result in a single reactivity measurement but >1 possible receptor.
@scharch good point. Hrm. Is the main challenge here that the global Receptor object is too ambitious? Hard to update, will require inferences that aren't simple, etc? In which case, I think the proposed CellReactivity would end up being the default route people take because of convenience.
@bussec Are you aware of any publications that have reported this type of data?
We are working on curating this paper: https://doi.org/10.1038/s41590-022-01184-4 which uses MHC/peptide dextramer tags to determine cells that have specificity to specific MHC/peptide complexes.
This study is curated in IEDB. It has 891 receptors, 44 assays, 18 epitopes and 5 antigens. What is AIRR trying to curate that isn't already in IEDB?
The AIRR-seq, Cell, and GEX data from the study.
This seems like a perfect use case for linking the ADC and IEDB since the study is currently curated in both. Currently the AIRR-seq/Cell/Gex data is in the ADC and the Epitope specificity data is in IEDB. There is currently no way to determine that there is related data in both repositories except through searching for the publication in each platform. In fact I didn't know it was curated in IEDB. Isn't that the crux of the problem 8-)
I suppose that this would mean that if we wanted to be complete in the curation of this study, we would need to create 891 Receptors
and link the Cells
we have loaded (https://gateway.ireceptor.org/samples/cell?query_id=93707) to each of the related Receptor
object, which in turn would have a link to the related IEDB object. I suspect there is an N to 1 relationship between Cell
and Receptor
in this study and it is likely that all Cells
will have a Receptor
as we are only curating the epitope specific Cells
from the study. So they should all have made it into IEDB.
I suppose that points out in a bit more detail what is different between IEDB and the ADC. You can go from Receptor
back to the Cell's
that provided the evidence, and then look at either the more detailed sequence annotation and/or the gene expression of those Cells
. IEDB on the other hand has all the relevant information about the specific Receptor
including any other studies that provide evidence for the given Receptor
specificity.
I could see how it might make sense to separate this into a new
CellReactivity
object, since antigens are not "expressed," but (a) we should make the boundaries/use cases clear and (b) the above should be added toCellReactivity.reactivity_method
along with corresponding changes to readout.
Yes, so far we have just copied the fields in ReceptorReactivity
to CellReactivity
and have not revisited the changes that @bussec suggested here: https://github.com/airr-community/airr-standards/pull/705#issuecomment-1891205323
I suppose that this would mean that if we wanted to be complete in the curation of this study, we would need to create 891
Receptors
and link theCells
we have loaded (https://gateway.ireceptor.org/samples/cell?query_id=93707) to each of the relatedReceptor
object, which in turn would have a link to the related IEDB object. I suspect there is an N to 1 relationship betweenCell
andReceptor
and it is likely that allCells
will have aReceptor
as we are only curating the epitope specificCells
from the study.
So for this study's reactivity data it seems like the minimal would be:
CellReactivity
object for each measured reactivity between a Cell
and an epitope target. This is linked back to the Cell
through the cell_id
field. These are the actual measured experimental results.In this way:
Cell
level by looking at CellReactivity
Cells
C that are linked to Receptor R and then give me all the CellReactivity
for all the Cells
in set C"ReceptorReactivity
.This is non-trivial, but at least "semantically" distinct???
In the AKC world, this is all going to be trivial, right 8-)
- One can find reactivity data in the ADC at the Receptor level with a more complex query. You would need to do something like "give me the set of
Cells
C that are linked to Receptor R and then give me all theCellReactivity
for all theCells
in set C"
I wonder if CellReactivity
should have a receptor_id
so that a Cell->Epitope
reaction can be associated with both a Cell
and a Receptor
, which would avoid the above issue... It can be null if no Receptor
is created, but if there is a Cell->Receptor
association, should we maintain this in the CellReactivity
? Possibly not, as this makes internal consistency challenging.
I have added repertoire_id and data_processing_id to CellReactivity
similar to how we have it with CellExpression
. I think we need to link the reactivity back to the Repertoire
, no?
@scharch good point. Hrm. Is the main challenge here that the global Receptor object is too ambitious? Hard to update, will require inferences that aren't simple, etc? In which case, I think the proposed CellReactivity would end up being the default route people take because of convenience.
It seems to me that the experiments are measuring CellReactivity
. Isn't the fact that there is ReceptorReactivity
inferred in some sense? If we define a Receptor
in this case aren't we inferring that the reactivity is between the Receptor and the epitope.
@bussec Are you aware of any publications that have reported this type of data?
We are working on curating this paper: https://doi.org/10.1038/s41590-022-01184-4 which uses MHC/peptide dextramer tags to determine cells that have specificity to specific MHC/peptide complexes.
This study is curated in IEDB. It has 891 receptors, 44 assays, 18 epitopes and 5 antigens. What is AIRR trying to curate that isn't already in IEDB?
The AIRR-seq, Cell, and GEX data from the study.
This seems like a perfect use case for linking the ADC and IEDB since the study is currently curated in both. Currently the AIRR-seq/Cell/Gex data is in the ADC and the Epitope specificity data is in IEDB. There is currently no way to determine that there is related data in both repositories except through searching for the publication in each platform. In fact I didn't know it was curated in IEDB. Isn't that the crux of the problem 8-)
Yes, though I meant specifically about reactivity. My concern was duplication of curation between the two resources about receptors and reactivity, especially if there isn't coordination, then you get two separate curations, leading to confusion for users. The link that you want between the two resources doesn't happen automatically, so yeah that's the crux of the problem but ADC doing it's own curation that duplicates IEDB doesn't sound like a good idea.
I suppose that this would mean that if we wanted to be complete in the curation of this study, we would need to create 891
Receptors
and link theCells
we have loaded (https://gateway.ireceptor.org/samples/cell?query_id=93707) to each of the relatedReceptor
object, which in turn would have a link to the related IEDB object.
Or you can just point the AIRR Study to the IEDB Study and then you know there is information in both. Or have the AIRR Rearrangements point to the IEDB records. Or have AIRR Cells point to IEDB records. Or what you said... There are many options, it all depends upon what your goals are based upon some driving use cases.
In the AKC world, this is all going to be trivial, right 8-)
If not trivial then at least easier and more consistent. Part of my confusion with this whole thread is understanding where the boundaries are. Many things seem to be exactly what the AKC data harmonization effort is doing. If AIRR is trying to create standards for the integration effort and data harmonization that AKC is doing, then it feels a bit like the cart before the horse.
@schristley said:
Or you can just point the AIRR Study to the IEDB Study and then you know there is information in both. Or have the AIRR Rearrangements point to the IEDB records. Or have AIRR Cells point to IEDB records. Or what you said... There are many options, it all depends upon what your goals are based upon some driving use cases.
We (the iReceptor team) chose the Minervina et al. study because it provides a good scientific use case that tested all of the new AIRR Standard constructs - Subject HLA, Rearrangement, Cell, CellExpression, Receptor, and ReceptorReactivity that were available after the release of AIRR v1.4/iReceptor v4.0. Thus in a way it was chosen to help us both poke holes in the AIRR model as well as to understand the boundaries between the ADC and other resources better.
In order to poke holes in the AIRR model we wanted to curate the study as completely as possible. So we want to curate all of the HLA, Rearrangement, Cell, CellExpression, Receptor, and CellReactivity data that was produced in the study in the appropriate ADC constructs. Although we have done all of the others, we have only recently been trying to curate the Minervina Receptor, ReceptorReactivity, or CellReactivity data - and that is what started this whole discussion 8-)
@schristley also said:
If not trivial then at least easier and more consistent. Part of my confusion with this whole thread is understanding where the boundaries are. Many things seem to be exactly what the AKC data harmonization effort is doing. If AIRR is trying to create standards for the integration effort and data harmonization that AKC is doing, then it feels a bit like the cart before the horse.
For me the AIRR Receptor and Reactivity objects are designed to enable data harmonization and integration, not necessarily perform that integration. That is why Receptor
has receptor_refs
which point to external sources of information (such as IEDB) that might have information about that object that doesn't exist in the ADC. There is no intent to duplicate data needlessly.
The reason there is overlap in terms of Reactivity data between the ADC and IEDB in this case is that the data from the Minervina study has been reused for both purposes. This is a good thing, and to me these uses are complementary but different. The links between Cell, CellExpression, and CellReactivity are all part of the Minverina immune response study, and are only found in the ADC, so I do believe that the CellReactivity (linked back to the actual Cell) from the study should be captured in the ADC.
This study also provide evidences for specificity of known Receptors in IEDB, so the data has also been curated and captured there.
Our challenge, like usual, is to figure out how to link the two so that the user doesn't get confused... 8-)
Minor tweak in CellReactivity
and ReceptorReactivity
- shouldn't reactivity_unit
be a Unit Ontology (UO) field?
reactivity_unit:
type: string
description: The unit of the measurement
example: pg/ml
x-airr:
nullable: false
adc-query-support: true
From the call:
Receptor
.ReceptorReactivity
.Receptor.reactivity_measurements
to reactivity_ref
.CellReactivity
to Reactivity
.Minor tweak in
CellReactivity
andReceptorReactivity
- shouldn'treactivity_unit
be a Unit Ontology (UO) field?
Answering this myself. I don't think it can because there is a need to have a Reactivity as a true/false - which I can't find in UO. But I am also not sure how to represent the boolean this cell binds to this antigen/epitope/peptide?
Do we really know the fields reactivity_readout
and reactivity_method
to have these as enums? Should they be strings with a recommended limited vocabulary instead?
From the call:
- Keep lean version of
Receptor
.- Remove
ReceptorReactivity
.
Done
Receptor.reactivity_measurements
toreactivity_ref
.
This was already removed. There is now only a Receptor.receptor_ref
It seems to me like receptor_ref
is a better name for this field. At least when it points to an IEDB record, it is pointing to a IEDB.Receptor
object which likely records many documented reactivity records for that Receptor
. I think the intent here is we are pointing to external definitions of the Receptor
that provide more info like "receptor reactivity".
- Can we rename
CellReactivity
toReactivity
.
This seems more appropriately named CellReactivity
to me, since it is actually linked to the Cell
. I don't think this is particularly critical since there is a cell_id
field that makes the implicit link to Cell
clear.
But it seems to me that since we have CellExpression
as an object it seems like CellReactivity
is more consistent.
Note to self: Changes in the docs are still pending, will wait for #628 to complete.
But it seems to me that since we have
CellExpression
as an object it seems likeCellReactivity
is more consistent.
No objection from me.
It seems to me like receptor_ref is a better name for this field. At least when it points to an IEDB record, it is pointing to a IEDB. Receptor object which likely records many documented reactivity records for that Receptor. I think the intent here is we are pointing to external definitions of the Receptor that provide more info like "receptor reactivity".
I don't think that was the goal. I thought we were replacing the functionality of the reactivity object itself. @bussec, am I mistaken?
I don't think that was the goal. I thought we were replacing the functionality of the reactivity object itself. @bussec, am I mistaken?
The Receptor
object now points to a bunch of external entities that contain information about that Receptor
, noted with CURIEs. Those entities (e.g. IEDB_RECEPTOR:10) contain information about receptor reactivity, but they currently are not pointing to different reactivity records, you are correct.
receptor_ref:
type: array
description: Array of receptor identifiers defined for the Receptor object
title: Receptor cross-references
items:
type: string
example: ["IEDB_RECEPTOR:10"]
x-airr:
nullable: true
adc-query-support: true
I think we could link to individual external reactivity records, but that would be moving more towards replicating information that is in those other repositories. The above IEDB Receptor (https://www.iedb.org/receptor/10) has three assays that show positive results:
https://www.iedb.org/assay/1650226 https://www.iedb.org/assay/1650227 https://www.iedb.org/assay/1755178
So we could have:
receptor_ref:
type: array
description: Array of receptor identifiers defined for the Receptor object
title: Receptor cross-references
items:
type: string
example: ["IEDB_ASSAY":"1650226","IEDB_ASSAY":"1650227","IEDB_ASSAY:"1755178"]
x-airr:
nullable: true
adc-query-support: true
But as I say, you can get that from IEDB by linking to a single record (and then asking IEDB for more info), and you are now replicating all of the links to specific assay data that is curated in IEDB. Some Receptors would have MANY assays, so that would be painful to manage on the ADC side.
It makes more sense to me to be linking to the external object at the same conceptual level (e.g. Receptor).
It seems to me like receptor_ref is a better name for this field. At least when it points to an IEDB record, it is pointing to a IEDB. Receptor object which likely records many documented reactivity records for that Receptor. I think the intent here is we are pointing to external definitions of the Receptor that provide more info like "receptor reactivity".
I don't think that was the goal. I thought we were replacing the functionality of the reactivity object itself. @bussec, am I mistaken?
The way I interpreted our discussions in the last call was to replace the (AIRR-C-defined) ReceptorReactivity
object with links to objects defined and hosted by other repositories. This is different from removing reactivity information completely from the Receptor
object. IMO there is value for users to be able to directly assess how much (or how little) information there is about a given receptor, without having to do lookups with other repos.
The way I interpreted our discussions in the last call was to replace the (AIRR-C-defined)
ReceptorReactivity
object with links to objects defined and hosted by other repositories. This is different from removing reactivity information completely from theReceptor
object. IMO there is value for users to be able to directly assess how much (or how little) information there is about a given receptor, without having to do lookups with other repos.
So I am not sure what this means in terms of the specification... 8-)
My understanding was that we were removing the ReceptorReactivity
object, and we were creating a field in Receptor
that linked to related external objects. Is that correct?
Both of my suggestions in https://github.com/airr-community/airr-standards/pull/705#issuecomment-1937835071 meet this requirement. One links to external Receptors while one links to external Assays that provide Receptor Reactivity information.
@bussec do you like the Assay link better? Do we want both. Personally I think the link to "IEDB_RECEPTOR:10" is pretty critical so I don't think we want to get rid of that (that is one of the key points of the Receptor
object). Do we want another field that links to something like the IEDB Assay also ("IEDB_ASSAY":"1650226","IEDB_ASSAY":"1650227","IEDB_ASSAY:"1755178").
So do we have Receptor.receptor_refs
and Receptor.receptor_reactivity_refs
I personally think this is a bit of can worms, as keeping the two in sync is scary... When an external repository (e.g. IEDB) adds more Assays for a external Receptor, the data in my repository is no longer in sync. At this point we are duplicating data in other repositories that is very challenging to keep in sync.
To me, this seems out of scope for the AIRR Standard.
@bcorrie @javh I had another look at this and did some minor fixes with regard to enums of experimental properties in CellReactivity
. Otherwise:
reactivity_ref
property from the Receptor
object and only retain the links to an "receptor-like" object in an external database. While I would still prefer to have the reactivity information in there, I also think that @bcorrie's concerns with regard to logical structure and update frequency of reactivity information are valid.CellReactivity.repertoire_id
and CellReactivity.data_processing_id
, given that this information is already present in the Cell
object?*Reactivity
objects.ReceptorReactivity
object was designed with a low-throughput experimental setup in mind than what will get from experiments that would use a CellReactivity
record (i.e., 10,000-1,000,000 cells stained with the same set of baits). Irrespective of normalization in the backend database of a repo, do we need to split CellReactivity
into a "reagent" and a "measurement" object? Or will file compression save us?As the last point is not directly connected to the original topic of this PR, I would like to suggest that we make the changes for 1-3 and merge and then have a new issue for 4.
- I would be ok with removing the
reactivity_ref
property from theReceptor
object and only retain the links to an "receptor-like" object in an external database. While I would still prefer to have the reactivity information in there, I also think that @bcorrie's concerns with regard to logical structure and update frequency of reactivity information are valid.
This is currently the situation, I had not yet added reactivty_ref
into the Receptor
object, so there is currently only a receptor_ref
. So this is "done" unless we decide otherwise.
2. In which scenario do we need
CellReactivity.repertoire_id
andCellReactivity.data_processing_id
, given that this information is already present in theCell
object?
Currently, unless I am mistaken (gasp), all observed objects are connected to their respective Repertoire
and the relative _id
fields.
Although the use case might not be obvious, making it possible for the data consumer to find all CellReactivity
information related to a Repertoire
without having to find all Cells
and the search CellReactivity
for a large number of Cells
seems like a good idea to me. It future proofs us from things that we might not think would be common.
In fact, now that I think about it from an ADC perspective, one use case would be I want to download all of the Cells
, CellExpression
, and CellReactivity
data from a specific Sample
. I would do this by processing the related repertoire_id
fields from that Sample
and query the appropriate end point once.
As the last point is not directly connected to the original topic of this PR, I would like to suggest that we make the changes for 1-3 and merge and then have a new issue for 4.
Good idea, since I don't understand 4 8-)
Fixes #704