Closed sharikrish closed 1 year ago
Hence, it might be logical to have extension properties for cell_id similar to those discussed for race or ethnicity etc.
@srlak These should be properties of the sample, not of the cell. Similarly, rearrangement_id
alone isn't enough to recover the sequences. To solve both, Cell
should include required, non-nullable fields for repertoire_id
and sample_id
.
Hence, it might be logical to have extension properties for cell_id similar to those discussed for race or ethnicity etc.
@srlak These should be properties of the sample, not of the cell.
@scharch (assuming that this comment referred to race
and ethnicity
): They are actually at the Subject
level, but IMO were just there as an example for the HPG-XT (#318) and the (currently unresolved) way how to determine whether an XT is active or not.
Similarly, rearrangement_id alone isn't enough to recover the sequences. To solve both, Cell should include required, non-nullable fields for repertoire_id and sample_id.
Because rearrangement_id
is not a globally unique identifier? No objections against adding the other fields, which then would be:
sample_id
: 1-to-1 reference, as an individual cell can only come from a single tuberepertoire_id
: 1-to-n reference, as our repertoire definition allows overlapping repertoires (e.g. IgG class-switched B cells
and CD27+ B cells
).Correct?
@bussec Correct.
@bussec @srlak in the original single cell extension proposal here:
There were fields around DOIs, keywords, and expression value/expression marker data...
Should we have place holders for this in this spec?
One question as to the rearrangements
array in the Cell
object. If each rearrangement has a cell_id
field, and this presumably matches the cell_id in the Cell
object, do we need to have a list of rearrangements
at the Cell
level? Now each Cell
presumably would only ever be linked with a small number of rearrangements
so maybe this is not the end of the world from a data size perspective, but linking both ways can lead to data consistency issues...
We can accomplish this link by having one or the other, but we don't need both... We might want both of efficiency, but we don't need them...
What if the Rearrangement
object instead of having a duplicate of the cell_id
inherits the whole defined Cell
object?, this would be then also triggered only when the single cell extension is required. Does this make sense and could solve the potential consistency issues?. It's true that for small lists of rearrangement_id
s it might not make a big difference but as far I understand from @srlak's advances on the API (correct me if I am wrong) it gets more tricky when we start defining with how to include other single-cell specifics such as flow cytometry markers
sample_id: 1-to-1 reference, as an individual cell can only come from a single tube
sample_id
or sample_processing_id
? sample_id
refers to just the MiAIRR sample object excluding cell processing down to sequencing, while sample_processing_id
refers to the set of objects from sample down to sequencing.
If each rearrangement has a cell_id field, and this presumably matches the cell_id in the Cell object, do we need to have a list of rearrangements at the Cell level?
@bussec I guess we need to deprecate cell_id
in rearrangements?
@srlak Should there be a receptors
array with the list of receptors in the cell?
The Cell object isn't compatible with storing Rearrangement records in a TSV, so you would still need the cell_id
in Rearrangement for both the flat/simple representation of single-cell and as a way to link from Rearrangement to Cell records (outside of a json/yaml/database representation of Rearrangement records).
the cell_id in Rearrangement for both the flat/simple representation of single-cell and as a way to link from Rearrangement to Cell records.
I was thinking we wanted to deprecate cell_id
in rearrangements because it allows that rearrangement to only be associated with one cell. Based upon @bussec comment, it seems like we need to support a n-to-n relationship.
I was thinking we wanted to deprecate
cell_id
in rearrangements because it allows that rearrangement to only be associated with one cell. Based upon @bussec comment, it seems like we need to support a n-to-n relationship.
I don't think we want to do that. That would mean dropping single-cell support from the TSV (and other simple tabular representations). We need to support an n-to-n relationship, but not required it, because n-to-n is not the most common use case.
I'm viewing Cell like Alignment - a solution for when a single reference alignment result (which is what Rearrangement supports) isn't sufficient.
* `repertoire_id`: 1-to-n reference, as our repertoire definition allows overlapping repertoires (e.g. `IgG class-switched B cells` and `CD27+ B cells`).
@bussec @scharch Would this then mean that cell_id
is associated with multiple repertoire_id
which then points to the associated meta-data for each repertoire_id
?
There were fields around DOIs, keywords, and expression value/expression marker data...
Should we have place holders for this in this spec?
@bcorrie : yes you are right and we will include it but before we need to sort out cell_id
mapping and linkage (explained below). Excluding that we have updated the spec as below.
In addition to this, consider a case where we have two repertoires from a single sample that has different types of Bcells (eg. IgG+Bcell and CD27+) and this would have same rearrangements that are just duplicated with different rearrangement_id
and would then be linked to cell_id
with their expression data etc. and the question is how do we link all of them to Cell
Object? Referring to #181 . and we have updated the spec below.
Cell:
discriminator: AIRR
type: object
required:
- cell_id #redefined cell_id > how to centralize it in the yaml
- rearrangements
- virtual
properties:
cell_id:
type: string
description: >
Identifier defining the cell of origin for the query sequence.
example: W06_046_091
x-airr:
miairr: false
required: true
nullable: false
adc-api-optional: false
name: Cell index
rearrangements:
type: array
description: >
Array of rearrangement identifiers defined for the Rearrangement object
items:
type: string
example: [id1, id2] #empty vs NULL?
x-airr:
miairr: false
required: true
nullable: true
adc-api-optional: false
raw_data:
type:object
description:
properties:
study_method:
type: string
enum:
flow cytometry
single-cell transcriptome
description: >
keyword describing the methodology used to assess expression. This values for this field MUST come from a controlled vocabulary
doi:
type: string
description: >
DOI of raw data set containing the current event
index:
type: string
description: >
Index addressing the current event within the raw data set.
expression:
type: object
description: >
Expression definitions for single-cell
properties:
expression_maker:
type: string
description: >
standardized designation of the transcript or epitope
example: CD27
expression_value:
type: integer
description: >
transformed and normalized expression level.
example: 14567
virtual:
type: boolean
description: >
boolean to indicate if pairing was inferred.
x-airr:
miairr: false
required: true
nullable: false # assuming only done for sc experiments, otherwise does not exist
adc-api-optional: true
If each rearrangement has a
cell_id
field, and this presumably matches the cell_id in theCell
object, do we need to have a list ofrearrangements
at theCell
level?
@bcorrie : Yes. This makes sense. A naive question is cell_id
is not a required field in Rearrangement
object and shouldn't this be required field if it is referring to a single-cell study?
@srlak Should there be a
receptors
array with the list of receptors in the cell?
@schristley Indeed, we should then have a list of receptor_id
in the cell.
* `repertoire_id`: 1-to-n reference, as our repertoire definition allows overlapping repertoires (e.g. `IgG class-switched B cells` and `CD27+ B cells`).
@bussec @scharch Would this then mean that
cell_id
is associated with multiplerepertoire_id
which then points to the associated meta-data for eachrepertoire_id
?
Oh, ugh, this could actually get really messy: a rearrangement_id
is supposed to be universally unique, but we haven't (AFAIK) allowed for a single Rearrangement
object to be linked to multiple Repertoire
s (ie Rearrangement
includes only a single repertoire_id
field). If so, then a Cell
that is linked to multiple Repertoire
s might also require rearrangement_id
s within each one?? I guess the uniqueness would at least mean that we don't have to explicitly specify which rearrangement_id
s go with which Repertoire
...
Overall, I think it would be better to restrict each Cell
to a single Repertoire
, even if that might mean a proliferation of duplicate Cell
objects --it's a least consistent with how we expect Rearrangements
to be handled in a similar case.
@bcorrie : Yes. This makes sense. A naive question is
cell_id
is not a required field inRearrangement
object and shouldn't this be required field if it is referring to a single-cell study?
@srlak in principle, sure, but I think we want to avoid "contigently-required" fields...
@srlak raw_data
and (especially) expression
seem like they need to be arrays.
Overall, I think it would be better to restrict each
Cell
to a singleRepertoire
, even if that might mean a proliferation of duplicateCell
objects --it's a least consistent with how we expectRearrangements
to be handled in a similar case.
Lets say we want to evaluate CD27 and IgG expression using flow-cytometry; four regions were delineated for CD27+ ( let say A), CD27+IgG+ (B), CD27-IgG+(C) and CD27-IgG-(D). Total RNA is extracted from sorted population [A,B and C] followed by sequencing. In order to screen for CD27+ from the sequenced data, population from A and B are pooled and similarly B and C are pooled together to screen for IgG+.
Assuming from the current Repertoire
definitions where rearrangement_id
s are globally unique and each Cell
is restricted to single Repertoire
.
In this case, we assume A is associated with two rearrangement_id
s [ex : 101, 102], B with two rearrangement_id
s [201, 202] and C with two rearrangement_id
s [301,302]. However, for simplicity we assign just two rearrangement_id
s for each but it could be even more than 2. In this case CD27+ will end up having four sets of rearrangement_id
s say [401, 402, 403, 405] that are same as [101,102,201,202] because rearrangement_id
s should be unique. Similarly IgG+ will be a combination of [201,202,301,303] but with just different identifiers say [501,502,503, 504]. Is this assumption correct?
Again for simplicity for now, lets consider rearrangement_ids are same as cell_id
. In this study we will have only [101,102,201,202,301,302] and 401 and 501 will have 101 as cell_id
as they come from that same cell. Similarly, 402 and 502 will have 201 cell_id
so on and so forth. Is this correct?
However a potential problem with these definitions would then be if a user queries for a particular cell_id
lets say 101 and the rearrangements associated with 101 are 101,401,501 which are triplicates containing exact same redundant information; Wouldn't this be like multiple records holding same information leading to artificially inflating the amount of information we provide per cell_id
?
@srlak I think that's exactly my point, yes, but I want to rework your nomenclature a bit to make sure we are really talking about the same thing:
Population A (CD27+IgG-) contains cells (not cell_id
s) A1, A2... and similarly for populations B, C, and D.
Cell A1 contains rearrangements (not rearrangement_id
s) A1H and A1L. Similarly for cells A2, B1, and so on.
Repertoire R1 is defined as CD27+ (populations A+B) and repertoire R2 is defined as IgG+ (so B+C). Thus cells B1, B2... and rearrangements B1H, B1L, B2H, B2L... are included in both repertoires.
Under the current Rearrangement
schema, there are therefore otherwise identical Rearrangement
objects with rearrangement_id
s B1H.R1 and B1H.R2 (or B1L.R1/B1L.R2, B2H.R1/B2H.R2 etc etc). However, under the proposed Cell
schema, there is a single Cell
object with cell_id
=B1.global and repertoire_id
=[R1,R2]. What, then, should be the value of rearrangement_id
for this Cell
? It seems like it would have to be [B1H.R1,B1H.R2,B1L.R1,B1L.R2], which, as you point out is redundant and potentially confusing. I also think there is probably potential for inefficiencies in data retrieval, since you can't know ahead of time which rearrangement_id
s are in which Repertoire
(at least without further complicating the Cell
data structure).
The most elegant solution, of course, would be have to a single Rearrangement
object for B1H (etc) with rearrangement_id
=B1H.global and repertoire_id
=[R1,R2]. I worry, though, that that would irreparably break the TSV format. (Although maybe not, if repertoire_id
is mostly considered an ADC API field --@javh?) I could also see it causing problems with the value of repertoire_id
needing to be updated for potentially millions of records if I later create an new Repertoire
R3 to look at all four populations together (and then maybe R4, R5, and R6 when I decide I need more data and sort another vial from the same donor and time point). Even if the update itself isn't a problem, it would probably be very hard to keep track of/know when an update is in order.
Barring that, then, my proposal is to instead emend the Cell
schema so that we have "duplicate" Cell
objects in the same way that there are "duplicate" Rearrangement
s. So you would have
{ "cell_id":"B1.R1", "repertoire_id":"R1", "rearrangement_id":["B1H.R1","B1L.R1"]}
{ "cell_id":"B1.R2", "repertoire_id":"R2", "rearrangement_id":["B1H.R2","B1L.R2"]}
I think that takes care of most of the problems. It is true that if you queried the API for (say) all IgG+ Cell
s that the response would contain these duplicates, but that's already true if you ask for eg all VH1-69-JH5 Rearrangement
s --you would get both B1H.R1 and B1H.R2. The sample_processing_id
s can be compared as a partial filter, but I'm not sure there's a great answer.
@srlak @scharch One thing that I've been trying to do with thinking about these (clone, cell, receptor, etc) schema enhancements is avoiding solutions that require duplicating rearrangements. The rearrangement data is large, and duplicating that data (that is, all annotations values are identical except for a field like cell_id
or clone_id
) is a waste of time and space. Now in some cases, this duplication is created by the researcher, e.g. one repertoire contains sequencing data A while a second repertoire contains sequencing data A+B, but that's their decision and I hope we can avoid "requiring/forcing" users to create duplication.
I currently don't see why a cell needs to be restricted to a single repertoire, but let me walk through this analysis first to see where it leads. Let's stick with the four biological populations (A, B, C, D). Now here you have the choices:
Sounds like we are mainly talking about 2 (otherwise why sort just to re-combine). That means we sequence A, B, C, D separately. Now we have choices for the repertoires:
Hopefully you can see that with 1, there is no duplication of rearrangements. There are always 4 repertoires for the 4 populations, analysis is done on a single repertoire or a list of repertoires. With 2, there is the possibility of duplicate rearrangements because multiple repertoires reference the same sequence data.
I don't consider either 1 or 2 to be right or wrong, as I think you can come up with valid use cases for both scenarios, but going back to my first paragraph, I prefer 1 over 2 because it represents the same concept (pooling data) but doesn't entail creating extra rearrangement records in order to implement that pooling/grouping.
Now, let's consider what we actually require for cells. In the biology, a cell is a single entity and should in theory be represented with a single inferred cell (cell_id
) though lots of things (experimental and/or computational) may break that 1-to-1 relationship. An open question is how to handle cells that are found in multiple repertoires, e.g. repertoires A, [A, B] or A+B.
cell_id
?cell_id
?This question is irrespective of how we pool repertoires. Option 4 is an easy yes, this is where the "cell inference algorithm" treats the repertoires as independent, any correspondence in cell_id
would just be coincidence. Option 3 is a harder yes, as it requires the "cell inference algorithm" to know about those multiple repertoires (and possibly process them together) to insure the same cell_id
is used across them for the same cell.
Note that option 3 is pretty similar to clone argument I made, i.e. the desire to track a clone across multiple repertoires, which is facilitated by having the same clone_id. Presumably there might be a similar desire to track a cell across multiple repertoires.
These options imply different things for the Cell
schema. Option 4 is easier, because each cell_id
is unique, the Cell
schema only needs to store information for a single data processing run of the "cell inference algorithm". I think the original Cell
schema above satisfies this.
Option 3 is harder because the single Cell
schema with that single cell_id
must store information about (possibly) multiple data processing runs of the "cell inference algorithm". Here it is important to know which repertoires where processed together. Let's say a cell is in repertoire A, and it would also be in pooled repertoire A+B and [A, B]. If the Cell
object just listed a set of repertoires, that wouldn't inherently give processing information about which rearrangements where used. I think we need to determine if Option 3 is important enough that we design the schema to handle it?
Finally, back to the pooling/grouping options. For option 4, if pool option 1 is used, then there is no duplication of rearrangements, so multiple cells may point to the same rearrangement. But of course, the rearrangement's cell_id
cannot point back. If pool option 2 is used, then each repertoire has its own set of rearrangements, and thus multiple cells point to different rearrangements. In this case, the rearrangement's cell_id
can point back but only because we created all of the extra rearrangement records.
Okay, after walking through this, I feel that users can actually use either pool option 1 or pool option 2 with the schema. I still prefer that we encourage users to use option 1, and I don't see any reason why we need to restrict a cell to a single repertoire.
@schristley In your example, I was proposing to answer option 3 as "no" under the assumption that pool option 2 would be the default, making implementation of 3 hard. I can see the merits of encouraging pool option 1, but I only just realized that it's already in the documentation ...I wonder how many people read carefully past "Repertoire
1-to-n with Sample
" at the beginning of the line.
Is there a case under pooling option 1 in which a Sample
would be included in multiple Repertoire
s (as opposed to a list of Repertoire
s)? If not, then the answer to 3 is still no, since a Cell
must by definition be associated with a single Sample
. (As opposed to a Receptor
, which could be associated with multiple Cell
s both within and across Repertoire
s...)
I can see the merits of encouraging pool option 1, but I only just realized that it's already in the documentation. I wonder how many people read carefully past "Repertoire 1-to-n with Sample" at the beginning of the line.
@scharch Do you mean pool option 2? I consider that to be more like pool option 2. We don't have anything in the AIRR Data Model currently that explicitly supports pool option 1, though the topic was discussed a little with the clones schema. I consider pool option 1 to be like defining repertoire groups (control, treatment, etc) for intra- and inter-group comparisons.
But I think I understand your point, people not reading carefully may assume that array
is the way to put "samples" together for comparison purposes. Even worse, a tool might write code that solidifies that assumption, causing endless confusion. I think if we had a standard way to define repertoire groups (i.e. pool option 1) that might eliminate that confusion.
And even so, when reading it, I realize that it is still imprecise. Using the word Sample
isn't exactly correct, it should probably be SampleProcessing
to signify the whole sequence of steps from Sample
to SequencingRun
.
Is there a case under pooling option 1 in which a Sample would be included in multiple Repertoires (as opposed to a list of Repertoires)?
There is but I think it would only be a case where the researcher explicitly does that, versus where the AIRR Data Model requires that. The simplest use case I can imagine is when you have replicates, say you pull 3 aliquots from the same tube and sequence each separately. You might have 3 repertoires for those 3 replicates so you can analyze them separately and compare, but then you might put those 3 replicates together to get a more complete "repertoire" for other analysis. The combined repertoire will duplicate rearrangements, and a cell that resides in a sample would appear in two repertoires.
I'm a bit behind on this, but a couple earlier things...
(1) I didn't notice the raw_data
and expression
bits earlier. What's the intent here? If this is just meant to store a small number of genes or surface markers, then I think an array (as @scharch mentioned) in this object is okay.
However, is the intent is to store a large number of features, then I don't think this will work because you'll need to use a gene x cell
sparse matrix to store and analyze that kind of data. That's deeply ingrained in scRNA-seq analysis tools. So maybe just links to the appropriate matrices that are keyed on cell_id.
(2) A rearrangement is supposed to be an observation. Whether or not you have observations that qualify as duplicate, by whatever criteria, is something I think we should avoid trying to design around. You can algorithmically collapse duplicate entries then assign those collapsed observations a new rearrangement_id
to reduce storage requirements. Trying to map each Rearrangement record to multiple Repertoires seems like a rabbit hole.
What's the intent here? If this is just meant to store a small number of genes or surface markers, then I think an array (as @scharch mentioned) in this object is okay.
However, is the intent is to store a large number of features, then I don't think this will work
Hmm I was thinking like you sorted on a panel of 4-12 markers and you want to store the MFI for each probe for each cell. But yeah, for something like transcriptome data it should link out to a DOI. Not sure exactly where to draw the line...
(1) I didn't notice the
raw_data
andexpression
bits earlier. What's the intent here? If this is just meant to store a small number of genes or surface markers, then I think an array (as @scharch mentioned) in this object is okay.However, is the intent is to store a large number of features, then I don't think this will work because you'll need to use a
gene x cell
sparse matrix to store and analyze that kind of data. That's deeply ingrained in scRNA-seq analysis tools. So maybe just links to the appropriate matrices that are keyed on cell_id.
I think this is an important question... Perhaps we should create a separate issue for this discussion. I think we will be able to get some input from 10X on this in a couple of weeks (they are busy at the moment). The sparse matrix approach seems to me to be required at least as part of the solution (if we want to support platforms like 10X), probably referring to external data through a DOI (to large to store as part of the actual AIRR-seq data)???
Yeah, it's probably a separate topic. The cartoon here is a good representation of the data structure typically used for analysis of RNA-seq, scRNA-seq, etc: https://www.bioconductor.org/help/course-materials/2019/BSS2019/04_Practical_CoreApproachesInBioconductor.html
Where "Samples" are cells. So repertoire data would be some sort of column data (sample/cell annotations). Not necessarily obeying 1-to-1 rules, because that can be worked around with the implementation as long as there is a mechanism to reduce the data into 1-to-1 Cell:Receptor relationships or slice 1-to-many out somehow.
However, is the intent is to store a large number of features, then I don't think this will work because you'll need to use a
gene x cell
sparse matrix to store and analyze that kind of data. That's deeply ingrained in scRNA-seq analysis tools. So maybe just links to the appropriate matrices that are keyed on cell_id.
Indeed. Storing it via sparse matrix seems to be a good approach and as @bcorrie mentioned the idea was to link to a DOI for external data.
Hey @bussec , we discussed this a little bit in the recent CRWG call, but I wanted to followup with some more detail.
One of the reasons for the design of Repertoire
as a composite object was so that an ADC API query could return all of that relevant data in a single request. CRWG had initially thought about having endpoints like /study
, /subject
, /sample
and so on, which you can see would mean the user would have to do many separate queries to get the same data. Plus it introduces problems with identifiers because subject_id
, sample_id
, and so on, as they not guaranteed to be unique...
We discussed that with these new objects (Cell
, Receptor
, Clone
) that we can/should go with a normal-form design. The implication is that multiple query requests will need to be performed to get "all" the data. For example, let's take the query example above. As the cell object only holds the rearrangement identifiers, you would need to do additional URL requests to get the rearrangement data:
# individual requests
curl https://host/airr/v1/rearrangement/507
curl https://host/airr/v1/rearrangement/678
# or a combined request
curl --data '{"filters":{"op":"in","content":{"field":"rearrangement_id","value":["507","678"]}}}' https://host/airr/v1/rearrangement
Now I'm fine with this, and maybe you already realized this, so we are all good. The same applies for repertoire_id
, receptor_id
, etc.
When writing this I noticed something, how do we know where (i.e. the host
) the rearrangements are stored so that we can do those additional requests?
When writing this I noticed something, how do we know where (i.e. the
host
) the rearrangements are stored so that we can do those additional requests?
Presumably if one does
curl https://host/airr/v1/cell
and gets back something like:
{ "Cells": [ { "cell_id": "1086", "rearrangements": ["507","678"], "virtual": false } ]}
One would go to the same host...
curl --data '{"filters":{"op":"in","content":{"field":"rearrangement_id","value":["507","678"]}}}' https://host/airr/v1/rearrangement
One would go to the same host...
I guess my question wasn't clear. What if the rearrangements are stored in another data repository? Does this mean that I'm not allowed to download a study from IPA, do some new cell analysis, then load that cell analysis into VDJServer? Or if I do, I have to duplicate the whole study into VDJServer so that links can be followed?
I was aware of this issue before and didn't give it much thought but I think it might be an issue in iReceptor+. If a user queries some data, which comes from multiple data repositories, then does some analysis, some downstream process (visualization, load/publish of results, etc) may need to follow identifiers. I can see how information about the original data repository might get lost.
We would like to discuss the the relation between different entities for cell schema.
Cell is n-to-1 with Repertoire: A direct/observed repertoire holds multiple cells and a cell can be only found in one direct/observed repertoire
Cell is n-to-n with Rearrangement: A rearrangement may be observed in multiple cells and a cell can contain multiple rearrangements
Cell is n-to-1 with Clone: A clone can be represented from multiple cells and a cell represent a single clone
Cell is n-to-n with Receptor: A cell would contain multiple receptors and a receptor can be present in multiple cells
These are the best possible entity relationship we could define and let us know if any discrepancies.
Link for complete documentation on cell object specification : https://docs.google.com/document/d/1vOOxk2-gvw8fKMs9MqSJ_M6a5_RBxz57I1Q3CvF9jwI/edit?usp=sharing
Cell is n-to-n with Rearrangement: A rearrangement may be observed in multiple cells and a cell can contain multiple rearrangements
I don't think this is correct is it. Remember that a rearrangement is an observed sequence from a specific experiment that is annotated in a certain way. Therefore, is a rearrangement not 1:1 with a Cell? That is a specific observation of a sequence must come from a specific cell and each rearrangement record has a single cell_id that identifies the cell from which it was observed???
Am I confused???
Sorry, I think that should be N to 1... That is a Cell may contain more than one rearrangement, but a rearrangement can only be observed from one Cell.
@bcorrie we have all sorts of use cases where raw reads are grouped and manipulated together in various ways (eg UMI consensus generation). It would be perfectly reasonable/valid to also collapse identical Rearrangements
from multiple cells into a single line/object, resulting in the N-to_n relationship. See also #340:
- Tools should not assume that sequence_id contains a value that references a sequence in the raw sequencing files.
Sorry, I think that should be N to 1... That is a Cell may contain more than one rearrangement, but a rearrangement can only be observed from one Cell.
I agree with this point that at an abstract level we first observer a cell have multiple rearrangements and each rearrangements are associated with a single cell. However, as @scharch mentioned when we group the reads and we do observe identical rearrangements from multiple cells and tats why it is N:N.
@bcorrie we have all sorts of use cases where raw reads are grouped and manipulated together in various ways (eg UMI consensus generation). It would be perfectly reasonable/valid to also collapse identical
Rearrangements
from multiple cells into a single line/object, resulting in the N-to_n relationship. See also #340:
I may be confused, but...
I would question whether that grouping is still a Rearrangement
or not... That sounds like a different type of object that is grouping rearrangements logically across cells based on some "identity" criterion into something that has a complex N-N relationship. Would this group be from within the same Repertoire
or could the grouping span Repertoires
? And what would a typical "identity" criterion be (how do you determine if two Rearrangements
are identical)?
As you say, we already represent "collapsed" rearrangements (e.g. merge reads by removing duplicates, building UMI consensus sequences, or aggregating clonotypes as per https://github.com/airr-community/airr-standards/issues/246#issuecomment-592728685) but this seems different. When you collapse to get a consensus sequence, you essentially throw away the lower level rearrangement information and replace it with a consensus rearrangement that represents N more "fundamental" rearrangements. You observed 1 thing N times, so you represent it as 1 thing with a count of N. This grouping makes sense because of the nature of how the sequencing is performed.
The grouping we are talking about here seems more biological to me (I know, it is scary when I start trying to talk biology). My understanding (which may be totally wrong) is that in this context you want to have a grouping of Rearrangements
where a given Rearrangement
(an annotation of a sequence) from a specific Repertoire
is grouped with other Rearrangements
from other Repertoires
(or maybe just the same Repertoire???) based on some "identity" criteria. From my understanding, this isn't really a single Rearrangement
that comes from N cells, but rather a conceptual object (a Chain
?) that captures the essence of the rearrangement (the identity criteria). You then want to be able to say that you saw the equivalent of that object (Chain
C) in N cells and that they are all the same (based on an identity criteria). It seems like this is more similar to the Receptor
concept to me without the paired nature of the Receptor
, where perhaps a Receptor
is composed of two Chains
???
I think the bottom line for me is that it seems like there should be some value in saying that this Rearrangement
came from cell A and that Rearrangement
came from that cell B, and they are considered "identical" and here is that thing that defines their identity (the Chain
)...
@bcorrie Two notes from the single-cell side, that might resolve some of your concerns:
cell
will always partitions a set of Rearrangements
, as we assume that
a. each experimentally observed nucleic acid MUST be derived from a single cell
b. that we have experimental means (e.g. barcodes) to unambiguously perform this assignment.
Therefore cell
:rearrangement
should be 1:N.Repertoire
for the schema to work. We discussed e.g. about overlapping cell populations above, and for the current schema, overlapping populations (i.e. Repertoires
) MUST NOT exist (we likely will have to introduce a new object to perform grouping). Otherwise either pretty much everything becomes N:N or we would have to start copying large amounts of cell
and rearrangement
objects, which I consider nonsensical as single-cell is a lot about preserving these identities.I'm experiencing much ambivalence. I'm inclined to agree with @bcorrie. I think @bussec's comment might address the concern, but.. It looks like there are two competing uses cases defined in the schema:
Genes x Cells/Samples
and a data frame that is keyed by Cells/Samples with N Rearrangement/Receptor annotations per cell.Rearrangement/Receptor x Cells/Samples
.In the first case, Cells is 1:N with Rearrangement/Receptor and the individual Rearrangements are observations within cells.
In the second case, Rearrangement/Receptor is 1:N with Cells/Samples/Repertoires, because the question you are trying to answer is how often do I see this feature (unique Receptor sequence/pair) in multiple cells/samples. Super common and very important, but not what I thought the purpose of the schema was.
Is that correct? Are we trying to capture both these cases in the same schema?
@javh: only scenario 1. But that still admits for N-to-N mapping (sort of). The idea is that the rows of the Rearrangement/Receptor
data frame that you describe could be duplicated between N cells. However, it's not a full mapping, because we really only expect 1 cell_id
per Rearrangement/Receptor
in those TSVs. So I think in practice, it would still typically be N-to-1 as @bcorrie wants, the only question is whether to allow N-to-N if the user decides that the rearrangement-to-cell mapping isn't important ...which I think means I've talked myself into agreeing with @bcorrie after all, though for a somewhat different reason.
However, it's not a full mapping, because we really only expect 1
cell_id
perRearrangement/Receptor
in those TSVs. So I think in practice, it would still typically be N-to-1 as @bcorrie wants, the only question is whether to allow N-to-N if the user decides that the rearrangement-to-cell mapping isn't important ..
I agree and as everyone pointed out we always expect 1 cell_id
per Rearrangement
but on grouping we do observe N:N (unique rearrangements in multiple cells). So, as @scharch pointed out, for us it would be important to know if we need to allow the N:N Cell:Rearrangment relationship in the Schema?
it would be important to know if we need to allow the N:N Cell:Rearrangment relationship in the Schema?
A simple (though maybe uncommon) scenario where multiple cells can reference the same rearrangement is running the "cell inference algorithm" multiple times (e.g. with different parameters) on the same repertoire.
The proposed Cell
schema does support N:N and I think we should keep it. The singular cell_id
in rearrangements doesn't support N:N but is fine for the more common 1:N scenario. Tools will just need to be aware of that and use the rearrangements array in Cell
when necessary.
CCing myself into this to get some 10x involvement and get caught up to speed @wyattmcdonnell
@wyattmcdonnell good luck, it is a long conversation 8-)
Documentation for the proposed extension with an early idea on the contents of the Cell object are on the single_cell_ext
branch here:
@sharikrish As there has been quite a bit of discussion so far, it would probably be a good time to put the Cell
schema into airr-schema.yaml
on a branch that incorporate all the comments, then people can see a more fully realized schema.
@schristley Regarding your post above: Yes, this is also the way we think about it. Is this (as a general approach of the ADC API) documented somewhere?
Concerning the question of the host
, we assumed that this would be the same FQDN and any type of redirection would be transparent to the user. But for a larger federation of repos like iR+, it would be worthwhile to think about mechanisms for this. But this is probably an generic issue and not only related to single-cell.
Is this (as a general approach of the ADC API) documented somewhere?
Not really, except maybe somewhat in the CRWG minutes where it was decided early on to have only two API entrypoints. What drove a lot of that decision was ease-of-use for the end user. When we got to talking about multiple entrypoints and having a "clean, conceptual" interface for each object (study, sample, etc), a non-technical person would invariably ask, "so I can download everything in one file right? I don't have to download a bunch of files and figure out how they all go together?" Furthermore, when we collected a bunch of use case queries, it became clear that many combined fields from different objects (e.g. TCR [pcr_target object] in humans [subject object]).
From there it become a discussion about how to structure all the MiAIRR metadata to support that simple API, balanced against a schema that would be easy to use by analysis tools, and by data entry screens. For me, the composite design pattern seemed the most appropriate, as it allowed for all the objects in MiAIRR to be treated uniformly (for query and access). One could argue the ADC API design follows the facade design pattern, which hides the complexity of the MiAIRR metadata and it's relationships behind a simple interface.
I don't know if I want to take a general AIRR stance, that it should be this way or that way (whatever "this" and "that" are) for all AIRR schema. I'm fine with going by a case-by-case basis. I do think it's important to take 3 stakeholders into account: the users doing queries, the analysis tools, and data entry, when thinking about how the data will be used. When it comes to these new object (cell, receptor, etc.), my intuition tells me that packing more and more into the repertoire object will cause us problems later, so better for these to be loosely coupled. But when it comes to how each of those object are organized, I feel there are trade-offs to be considered.
Imagine this, you can make a valid argument about what a user wants when querying for cells, that 1) a user rarely queries for a single cell object, but almost always wants a set of cells, and 2) they will invariably always need both the repertoire metadata and the rearrangement annotations. Is 1) and 2) true? If yes, then maybe the normal-form is a poor design for that common query behavior. It implies that the user will need to perform many (thousands?) of additional URL requests to gather all of the information they need. Analysis tools may not care, because they likely assume there are just two files, one with cells and one with rearrangements, and linking the two is no problem.
Here's another example. Would we expect that users would like to answer queries such as "give me all the cells for a specific V gene?" A query like this on a /cell
entrypoint might be:
{
"filters": {
"op":"=",
"content": {
"field":"rearrangements.v_call",
"value":"IGHV6-foo*bar"
}
}
}
which isn't immediately supported by the proposed structure. Or does it mean that Cell
like Clone
should have some of those rearrangements fields promoted up to the Cell
object.
So in a long-winded way (can you tell I'm in quarantine ;-D), I don't think the design is set, we will probably want more discussions about the trade-offs. It might be very useful for CRWG to gather query use cases for these new objects.
@sharikrish As there has been quite a bit of discussion so far, it would probably be a good time to put the
Cell
schema intoairr-schema.yaml
on a branch that incorporate all the comments, then people can see a more fully realized schema.
@schristley : Proposed schema is available now for review in #358
Imagine this, you can make a valid argument about what a user wants when querying for cells, that 1) a user rarely queries for a single cell object, but almost always wants a set of cells, and 2) they will invariably always need both the repertoire metadata and the rearrangement annotations. Is 1) and 2) true?
Yes
Here's another example. Would we expect that users would like to answer queries such as "give me all the cells for a specific V gene?"
Definitely!
Had another look at the definitions in the schema, I think there are still a couple of details that we need to hash out before we can close this:
cell_id
definition, instead of copying it (like for SampleProcessing
)?miairr
status of the other "_id" fields show be reassessed. Do they require an identifier
flag?nullable
status for cell_id
?array
type and record structure for expression_raw
and expression_tabular
field.
- Can we refer to a
cell_id
definition, instead of copying it (like forSampleProcessing
)?
Do you mean this definition: https://github.com/airr-community/airr-standards/blob/b0202934e816e181d624163aafe97bcbe28f2d5b/specs/airr-schema.yaml#L2551
And by copy do you mean having the same YAML description?
When sample_processing_id
is defined for both Repertoire
and Rearrangement
the YAML is duplicated, so not sure what you mean here???
- Can we tolerate a
nullable
status forcell_id
?
Again, is this cell_id
in the Cell
object? I would think that a Cell
isn't a Cell
without a cell_id
(nullable:false
), but not 100% sure about that. It also seems logical that cell_id
in Rearrangement
is nullable:true
. I don't think they need to be the same do they?
Followed by a discussion with @franasa , @bussec; we have come up with the below schema for cell object.
From an API perspective we have implemented cell endpoint which provides all cell associated properties such as
rearrangements
,virtual
flag etc as per the discussion in #273For instance, the cell endpoint would return a json upon a query and it is represented as below
curl -i -H 'Content-Type: application/json' -X POST 'http://127.0.0.1:5000/api/cells?cell_id=1086
Some of the questions that we would like to address are as follows:
In the
Cell
object specificationsmiairr
is set to befalse
which is different from thecell_id
properties defined in the rearrangement object. In addition, it is obvious for single cell experiments to havecell_id
associated with them and hence it cannot have a null value. Hence, it might be logical to have extension properties forcell_id
similar to those discussed for race or ethnicity etc. in Human population genetics extension #318 to trigger the properties defined in the extension if study has acell_id
. The question would be how to centralizecell_id
properties with the original yaml file?rearrangements
defined inCell
object would contain a list ofrearrangement_id
associated with thecell_id
. From a single cell context, a cell can either have a single or sometimes none and multiple rearrangements. For instance, when one queries for acell_id
, all the rearrangements associated withcell_id
needs to be retrieved and one of the best ways to define rearrangements would be as a list containing allrearrangement_id
as defined for clone object #294 [Line:2359]