Closed bcorrie closed 4 years ago
Hmm, in rereading this, I think this is actually more precisely defining and is helping to disambiguate sequence_id rather than pair_id.
I must've missed something along the way... That definition of pair_id
seems rather odd to me.
Having pair_id
work exactly the same as clone_id
and cell_id
makes more sense to me (just a grouping variable). In most protocols, I would expect pair_id
and cell_id
to be identical.
Seems like it'd be a lot easier to find collisions too, because you'd just have to look for pair_id
values that map to >2 rows.
Hi Brian, I will take the chance here to add a point to the issue. I was also about to open a ticket concerning pair_id
. I'm not sure if this is still valid but in #169 you defined it as:
pair_id: identifier for rearrangements that have a chain pairing, unique ID within the repertoire for each unique pair. cell_id: identifier for rearrangements that come from a single cell, unique ID within the repertoire for each unique cell.
a "Valid sequence_id that was determined by experimental or computational means [...]" sounds also a bit strange to me, although I'm not yet an expert. Can it be that the definition from #169 was to be updated in the spec? (sorry if this was already discussed, I am still diving a bit adrift in the ocean of tickets)
About my question:
as Jason points, in the single cell context, one cell will normally have two sequences for one rearrangement, thus one cell will have a unique cell_id
and a unique pair_id
shared by two sequences within one cell. However, not so infrequently, we also see the presence of three sequences (one heavy and one kappa and one lambda) and in some cases, even four meaning two or three different pair_id
's for a unique cell_id
.
given that each row in our data is a single cell, for creating a unique pair_id
corresponding to its belonging cell and avoid repeating each row "pair_id
-x" number of times, we wonder if it would make sense to have an extra ?table? with the sequence in question, the cell_id
and the pair_id
that maps to the original row?
as Jason points, in the single cell context, one cell will normally have two sequences for one rearrangement, thus one cell will have a unique
cell_id
and a uniquepair_id
shared by two sequences within one cell. However, not so infrequently, we also see the presence of three sequences (one heavy and one kappa and one lambda) and in some cases, even four meaning two or three differentpair_id
's for a uniquecell_id
.
Ah, yeah, good point. Ignoring things like allelic inclusion and doublets, you can still have six biologically valid receptor sequences in a normal B cell. 1 productive heavy, 1 productive light, and 4 unproductive (1 heavy, 3 light). To cover that with the current pair_id
definition you'd need to force a pair of "master" pair_id
values for the entire set or do some sort of categorical clustering thingy to track down all members of a given cell.
I'm a bit fuzzy on why we needed pair_id
in addition to cell_id
. Something to do with semantics being wrong for pairing done by PCR ligation protocols? Not sure. I can't recall off the top of my head... I'll have to look back through the issues.
a "Valid sequence_id that was determined by experimental or computational means [...]" sounds also a bit strange to me, although I'm not yet an expert. Can it be that the definition from #169 was to be updated in the spec? (sorry if this was already discussed, I am still diving a bit adrift in the ocean of tickets)
Actually, that was my mistake. The current spec has "Valid sequence_id that was determined by experimental or computational means [...]" and I made up the other definition because I actually thought it was the same as for cell_id (as @javh suggested). Sorry for the confusion. The current definition is for the sequence_id, but I agree with @javh that it makes sense for that to be changed unless there is a reason not to...
I'm a bit fuzzy on why we needed
pair_id
in addition tocell_id
. Something to do with semantics being wrong for pairing done by PCR ligation protocols? Not sure. I can't recall off the top of my head... I'll have to look back through the issues.
Would this make sense. For a single B cell with six receptor sequences, all would have the same cell_id. In this case, I assume the productive heavy and light are the paired chains, so they would have the same pair_id, but the other four unproductive would not have a pair_id???
In this case, I assume the productive heavy and light are the paired chains, so they would have the same pair_id, but the other four unproductive would not have a pair_id???
Yes, I think so. But then the heavy chain from a cell with an allelic inclusion should actually have two pair_id
s: one for the H+K pairing and one for the H+L.
As a historical matter, I think pair_id
was introduced for Adaptive's pairing protocol, which does not involve single cells. Still, I think the nuances we are talking about here are worth being able to distinguish anyway.
Would this make sense. For a single B cell with six receptor sequences, all would have the same cell_id. In this case, I assume the productive heavy and light are the paired chains, so they would have the same pair_id, but the other four unproductive would not have a pair_id???
This would make pair_id:null
entirely redundant with productive:False
. Which is not awful, but I think not a reason to have pair_id
.
As a historical matter, I think pair_id was introduced for Adaptive's pairing protocol, which does not involve single cells. Still, I think the nuances we are talking about here are worth being able to distinguish anyway.
They are still inferred a cell association though, correct? If so, cell_id
seems appropriate to me.
We have discussed this issue today, and came up with some points. I've tried to illustrate them with these two example tables from our sciReptor data. I think this does not solve the question about the uniquenes of the pair_id
but it might help to bring some ideas.
in the single cell context:
pair_id
is not a pointer to one sequence_id
but rather it groups different sequence_id
's belonging to one cell creating end to end relationships to its originall cell_id
[table 2 mapping to table 1].as a very rare example we see in cell_id
= 296, a relation between 2 H to 1 K and 1 L,all are productive and therefore we would assign pair_id
s grouping those sequnces [ids 5 to 8]
[less rare are i.e. 115 and 119 with one productive H and two productive L (Kappa and/or Lambda)].
pair_id
is a representation of a receptor that has other meaningful biological information [#207 ], in our case it is the starting point to annotate reactivity.cell_id | sequence_id | consensus_rank | locus | length | igblast_productive |
---|---|---|---|---|---|
115 | 1249 | 2 | K | 419 | 1 |
115 | 165 | 1 | K | 436 | 1 |
115 | 887 | 1 | H | 539 | 1 |
119 | 1460 | 1 | K | 420 | 1 |
119 | 846 | 2 | L | 442 | 1 |
119 | 679 | 1 | L | 443 | 0 |
119 | 1258 | 1 | H | 548 | 1 |
296 | 1228 | 1 | K | 430 | 1 |
296 | 970 | 1 | L | 442 | 1 |
296 | 1044 | 2 | H | 543 | 1 |
296 | 845 | 1 | H | 545 | 1 |
The second table is what we have come up with, and something that i wanted to bring for discussion as a solution to map pair_id
's to cell_id
's for each single cell database. note that the cell_id
column is not necesary [is here for visualization purposes as the sequence_id_
column names] as each sequence_id
from the second table will lead to the cell_id
of the first one.
pair_id | sequence_id_heavy | sequence_id_light | cell_id |
---|---|---|---|
1 | 887 | 165 | 115 |
2 | 887 | 1249 | 115 |
3 | 1258 | 846 | 119 |
4 | 1258 | 1460 | 119 |
5 | 845 | 1228 | 296 |
6 | 845 | 970 | 296 |
7 | 1044 | 1228 | 296 |
8 | 1044 | 970 | 296 |
pair_id
and cell_id
in paralell do not create complete redundancy in some experiments. With this, non productive sequences can still be under a single cell_id
but won't have a pair_id
as in the case of sequence_id
= 679 from the first table.This would make pair_id:null entirely redundant with productive:False. Which is not awful, but I think not a reason to have pair_id.
No, because you could have a cell with only the heavy chain detected, in which case pair_id:null but productive:True. Or any bulk application, where all of the rearrangements have pair_id:null
They are still inferred a cell association though, correct? If so,
cell_id
seems appropriate to me.
I dug up the following:
- We probably need an
pair_id
field inRearrangement
to represent chain pairing separate fromcell_id
. Corresponds withis_paired=True
(wherever we put that - at or downstream fromSoftProcessing
).
Originally posted by @javh in https://github.com/airr-community/airr-standards/issues/169#issuecomment-452063537
Unfortunately, the minutes from that meeting, in their entirety, are:
We exclusively discussed GitHub issue #169. Complications across different types of protocols led us to conclude the following: The metadata should have a boolean flag for whether the rearrangement data is paired at the SoftwareProcessing level (is_paired), along with the current "single_cell" boolean flag at the CellProcessing level for whether it's single-cell. Each of these flags lets the user know whether "pair_id" (a new rearrangement field) and/or "cell_id" are meaningfully set in the associated data file.
Maybe @laserson remembers more?
I also found this:
With the wisdom of hindsight: See #169 and #206 => Use of
cell_id
to annotate chain linkage has been deprecated, instead an additionalpaired_id
field was introduced (as single cell and chain pairing are not always associated).
Originally posted by @bussec in https://github.com/airr-community/airr-standards/issues/165#issuecomment-541961406
So it sounds like this is already integrated into the spec...
@franasa I think in your example data, cells 119 and 296 should not have pair IDs, because it is unclear which is the biological pairing. (I'm assuming that such a scenario is a doublet; if you can really prove that they are true single cells with both heavy and light chain inclusions then your assignments are correct.)
Either way, the AIRR Rearrangement
s format would have:
sequence_id | productive | locus | pair_id | cell_id |
---|---|---|---|---|
1249 | T | IGK | 1 | 115 |
165 | T | IGK | 2 | 115 |
887 | T | IGH | 1,2 | 115 |
etc, where I've left out non-AIRR "custom" fields and converted productive
and locus
to match the schema. The primary identifier here is sequence_id
, from which you can get both pair_id
and cell_id
.
Going back to @bcorrie's original question, I believe that sequence_id
is unique within a RearrangementSet
, and a rearrangement_set_id
is unique within a Repertoire
. Or maybe sequence_id
is directly unique within a Repertoire
. Either way, it seems to me that pair_id
and cell_id
should follow that same rule. So if you have just the pair_id
and you want to get cell_id
, you'd query the Repertoire
for all Rearrangement
s matching the former field and then look at the latter. Does that meet your needs?
Thanks, @scharch. That's immensely helpful.
No, because you could have a cell with only the heavy chain detected, in which case pair_id:null but productive:True. Or any bulk application, where all of the rearrangements have pair_id:null
Ah, that's a good point. So if I'm following the intent correctly here, then pair_id:null
is basically meant to serve the same purpose as is_paired:False
, but at the individual sequence level? If that's correct, then we should make that explicit in the schema docs.
Either way, it seems to me that pair_id and cell_id should follow that same rule.
I agree, and I believe that rule should be they are a grouping variable with a user-defined value, which should be unique within a DataProcessing
for a Repertoire
.
In other words, I think the pair_id
description should be changed from this:
Valid sequence_id that was determined by experimental or computational means to
be associated with the current Rearrangement on the cellular level.
to something more like this similar to the cell_id
description
Identifier defining paired rearrangements that was determined by
experimental or computational means.
A discussion in Haifa introduced a potential problem. It was mentioned that in a single cell, an IGH may form a pair with an IGL, but that the same IGH could form a pair with an IGK. This implies that a single rearrangement record can be in more than one pair, and thus pair_id
is not a single value but potentially a list of values. I guess looking back through the comments, I see mention of this, as @scharch shows 1,2
as a pair_id
value. The problem with using a list is that simple queries cannot be used to extract the pairs. Likewise, we've been trying to avoid lists in the AIRR TSV format.
For what it’s worth, this is also observed in alpha beta T cells where they can express two alpha chains, so it would be good to have this worked out for both cell types.
Maybe in such cases it makes sense to list the heavy/beta chain twice? So my previous example would become:
sequence_id | productive | locus | pair_id | cell_id |
---|---|---|---|---|
1249 | T | IGK | 1 | 115 |
165 | T | IGK | 2 | 115 |
887 | T | IGH | 1 | 115 |
887 | T | IGH | 2 | 115 |
...except that I assume we'd have to give the duplicated heavy/beta chains separate sequence_id
s, which may probably raises a different set of issues...
...except that I assume we'd have to give the duplicated heavy/beta chains separate sequence_ids, which may probably raises a different set of issues...
Yeah, unfortunately, duplicating the rearrangement records will raise other issues, especially with calculating quantities and counts...
Maybe we should start allowing arrays to be represented in some select fields, like pair_id
, where we have greater ability to specify the content. Having 1,2
as the value in the TSV seems a lot simpler than other ideas.
Following @schristley, do we agree that we could represent some fields as arrays and if yes does the below suggestion for pair_id
spec looks reasonable?
pair_id:
type: array
description: Valid list of sequence_id that was determined by experimental or computational means to be associated with the current Rearrangement on the cellular level.
example:
- ABC314159
- ABC314160
x-airr:
miairr: true
required: true
nullable: true
adc-api-optional: false
set: 6
subset: data (processed sequence)
name: Paired-chain index
Hrm. I don't think we have consensus per se, but I'm strongly opposed to pair_id
mapping to sequence_id
as that's rather anti-tidy. You can't perform a group-by operation on the column and it would differ from behavior of other *_id
fields in the spec. You could certainly enforce the convention in your specific dataset that the pair_id
is the same as the sequence_id
of the heavy/beta chain in the pair_id
set, but I don't think that should be part of the spec.
As far as the list goes, officially supporting lists (nested columns) in the TSV would break the rules. Currently, we say:
Nested delimiters are not supported by the schema explicitly and should be avoided. However, if multiple values must be reported in a single column for an application specific reason, then the use of a comma as the delimiter is recommended.
http://docs.airr-community.org/en/latest/datarep/format.html#formatspecification
We discourage nested delimiters because it's a bit of an anti-pattern. Changing that would be a very large change, in my opinion. If we truly need a nested representation for pairing, and we might, then I would first reach for a supplemental TSV or yaml to define those pairings before changing the TSV spec.
So... I guess the big question for me is, what does a list of pair_id
get you in terms of a practical use case? To use @scharch's example from above, how does this:
sequence_id | productive | locus | pair_id | cell_id |
---|---|---|---|---|
1249 | T | IGK | 1 | 115 |
165 | T | IGK | 2 | 115 |
887 | T | IGH | 1,2 | 115 |
Facilitate analyses in a way that the following does not?
sequence_id | productive | locus | pair_id | cell_id |
---|---|---|---|---|
1249 | T | IGK | 1 | 115 |
165 | T | IGK | 1 | 115 |
887 | T | IGH | 1 | 115 |
Some downstream determination needs to be made about how to analyze the VH:VL pairs. Is it somehow made easier by the former?
(You could set some rule that the first table represents doublets or allelic inclusion and the later a bispecific antibody, but that seems confusing and error prone.)
Hrm. I don't think we have consensus per se, but I'm strongly opposed to
pair_id
mapping tosequence_id
as that's rather anti-tidy. You can't perform a group-by operation on the column and it would differ from behavior of other*_id
fields in the spec. You could certainly enforce the convention in your specific dataset that thepair_id
is the same as thesequence_id
of the heavy/beta chain in thepair_id
set, but I don't think that should be part of the spec.
Yes, I think we were leaning towards @javh's approach where the pair_id is "just" an ID that groups things.
We discourage nested delimiters because it's a bit of an anti-pattern. Changing that would be a very large change, in my opinion.
I agree with this, and think we would only want to do that if absolutely necessary... In particular, if as a result of the single cell extension pull request (#211) we define a Cell object (and maybe a Receptor object?) that can capture this relationship much more accurately, I would be very hesitant to make such a fundamental change to the Rearrangement spec without having the other entities sorted out first.
It seems like the wrong approach to force what seem to be increasingly complex relationships into the Rearrangement object.
As far as the list goes, officially supporting lists (nested columns) in the TSV would break the rules.
Actually, I think the proposed usage might fit within the rules. Allelic inclusion is expected to be uncommon, and representing it could perhaps be considered "multiple values [that] must be reported in a single column for an application specific reason"...?
Facilitate analyses in a way that the following does not?
I think the main objection would be that it sacrifices clarity/explicitness. It also makes it closer to redundant with cell_id
, since I think the current non-single-cell pairing approaches can't see inclusions.
Actually, I think the proposed usage might fit within the rules. Allelic inclusion is expected to be uncommon, and representing it could perhaps be considered "multiple values [that] must be reported in a single column for an application specific reason"...?
What we meant by "application specific reason" was something along the lines of "if your tool really needs to do this then go ahead and use a comma, but our API isn't going to parse it and other people's tools are under no obligation to do so either".
For example, we support nested delimiters in the v_call
, d_call
and j_call
columns in changeo for ambiguous alignments, but that's just one of those historical requirements of the tool, not part of the spec.
Should've put this in the same comment above. Oh well, typing faster than I could think...
We could take the same attitude as the *_call
fields with pair_id
. If the use case for a list is rare enough, then just have the official spec be a single id value, but maybe the occasional tool would use a list in a non-official manner.
We could take the same attitude as the
*_call
fields withpair_id
Yeah, that's exactly what I had in mind. Not wedded to it, but seems easier than anything else we've discussed.
TL;DR Let's trash pair_id
and use a "virtual" cell
object instead.
Given the pains that pair_id
is creating, @srlak, @franasa and me had a brainstroming session, the outcome of which I will try to summarize below.
Markup: biological entity and data object
cell
object that has a unique cell_id
identifier.
d. A cell can have additional observable properties (e.g., flow profile, transcriptome), which should be attributable to the cell
object.receptor
object that has a unique receptor_id
identifier.
d. A receptor can have additional observable properties (e.g., affinity against an antigen), which should be attributable to the receptor
object.cell
and receptor
objects:
a. A cell
MAY be associated with one or more receptor
objects.
b. A receptor
MAY be associated with one or more cell
objects.
c. Associating a cell
and a receptor
is a valid operation if and only if the cell
is associated with all rearrangement
objects required to encode the receptor
.Importantly, these assumption do NOT REQUIRE that the cell has to be observed directly (e.g. by flow or microscopy). It is sufficient if the experimental setup creates a trackable single-cell context (e.g., microfluidics with unique barcodes for each droplet) which allows the subsequent association of an observed rearrangement
with a cell
.
One of the reasons why we introduced the pair_id
, was to be able to annotate stochastic pairing data, like PAIRseq. In these experimental setups, rearrangements
are inferred to be associated with each other, based on the correlated presence and absence of the rearrangements
in multiple sub-samples under limiting dilution conditions. As these setup do not create a single-cell context, we did not want to use the cell_id
field to represent chain linkage, as we thought that this should be reserved for "true" single-cell setups. However, as noted above, most of our single-cell data will not come from directly observed single cells, but from workflows that only preferentially create a single-cell context and subsequently attempt to filter out doublets etc. Finally, the technical reason why stochastic pairing works, is because the DNA/RNA molecules are in the same cell, not because they encode components of the same receptor. Therefore, it will also allow pairing of non-productive chains (although some software might filter those out).
Therefore, we would like to pitch the idea that stochastic pairing should not be represented by pair_id
but instead via a "virtual" cell
. These cell
objects would contain a virtual: true
key to indicate that they were inferred. As we now can represent all chain paired data via cell
(rearrangement
would link n:1 to cell
), pair_id
could be deprecated. In this model, receptor
would be reserved for data that is based on direct observation and would not be linked from rearrangement
.
Therefore, we would like to pitch the idea that stochastic pairing should not be represented by
pair_id
but instead via a "virtual"cell
. Thesecell
objects would contain avirtual: true
key to indicate that they were inferred. As we now can represent all chain paired data viacell
(rearrangement
would link n:1 tocell
),pair_id
could be deprecated. In this model,receptor
would be reserved for data that is based on direct observation and would not be linked fromrearrangement
.
If I understand correctly you are suggesting that at the Rearrangement
object level we only need cell_id
and we no longer need pair_id
? That is cell_id
is sufficient to capture both uses? I think this is right, but wanted to confirm...
@bussec I think this is suggesting that with further hindsight, the decision we made to have both (see https://github.com/airr-community/airr-standards/issues/165#issuecomment-541961406) was not necessary, correct?
@bcorrie Yes, we only need cell_id
. The information that the cell was not directly observed is moved to the cell
object. Therefore pair_id
is not necessary anymore. Any yes, our hindsight (as in #165) is getting close to 20/20.
TL;DR Let's trash
pair_id
and use a "virtual"cell
object instead.
@bussec Very good, this looks good to me. It should work well from an API perspective as well with the obvious creation of cell
and receptor
endpoints.
The only thing I'm unclear about it this, can you describe a use case where this will occur?
b. A
receptor
MAY be associated with one or morecell
objects.
The only thing I'm unclear about it this, can you describe a use case where this will occur?
b. A
receptor
MAY be associated with one or morecell
objects.
Clonally expanded T cells. I just wanted to state that this is a situation that we need to be able to represent, when designing the cell
and receptor
objects. This should not be in conflict with clone_id
as the latter one captures a sequence (not receptor) based grouping of cells.
Clonally expanded T cells.
Ah I see, so you are thinking of receptors abstractly, that is a single receptor object can represent multiple physical receptor entities (assuming the sequence is identical, and actually I guess the additional observable properties as well)?
Correct.
Ok, how far can that extend? Can that single receptor object be used across multiple repertoires, multiple studies?
From an API perspective, the id
is globally (or repository) unique, so you can do a directly query to return the information. So technically it is possible to allow a receptor object to be very broad.
/receptor/:receptor_id
The relationship you defined between cell
and receptor
is n-to-n, so I'm mainly trying to think about how that will be represented and where it will be stored.
If this requirement was changed, that is each receptor object represents a single physical receptor, that means it could only belong to a single cell, the relationship becomes 1-to-n and is easier to store and manage.
I understand that you would like to avoid n-to-n if possible, but in this case we need it, at least for single-cell. Otherwise we are buying the simplicity of an 1-to-n relation with an eventually non-normalized data structure, which I would like to avoid.
Regarding the scope of receptor_id
: Yes, theoretically we can make it global. We just need to be aware that once we move beyond the subject
, we need support for multiple MHC haplotypes when annotating reactivity.
From the call:
pair_id
never seems to have made it into a release, so we can delete it without having to deprecate it.Given the above, do you think we can close this issue. Paired discussion will continue on in the Cell/Receptor discussion and we are removing pair_id from rearrangements... I would like to close this issue unless anyone objects...
I am ok with closing and moving the discussion to #320.
Are we going to remove pair_id
?
Yes, at least that was my understanding from the January DataRep call (see @javh's comment above).
Okay, I missed that call so wanted to check.
We are starting to look at paired sequencing data and I am starting to wonder how one determines uniqueness of a pair of sequences. This is related to #246 but I thought it worthy of a separate discussion so that we might be able to get a quick resolution on the issue.
From the spec it says the pair_id is a "Valid sequence_id that was determined by experimental or computational means to be associated with the current Rearrangement on the cellular level".
I see that sequence_id is supposed to be unique within a "file". I assume this means the uniqueness comes from the sequence_id assignment provided at the fasta file level when sequencing was done. Is that correct? So I believe it is expected that sequence_id is not unique within a repository???
If this is the case, I am wondering about our definition of pair_id. pair_id is a unique sequence_id to which the given sequence has been determined to be a paired chain. If as above, I assume that the sequence_id is not unique within the repository, then it is possible to have collisions on pair_id. What would be the disambiguation rule for a pair_id? It seems like one would require a sample_processing_id (which is where the fasta level sequence_id would be assigned) and a data_processing_id (as one might have a paired chain from a single SampleProcessing annotated with two different annotation tools).
So the query to find a paired sequence for a given sequence S is:
This is somewhat complex, and hard to explain in documentation - I have inferred this from my understanding which may or may not be correct.
Given that the relationship between two sequences in a paired chain is similar in scope to single cell relationships, I wonder if we should have a similar definition for pair_id as we do for cell_id, which is defined as follows: "Identifier defining the cell of origin for the query sequence." Basically have a unique identifier for each pair in the paired chain?
I am not sure this solves the problem, in that we still have the uniqueness issue that needs to be sorted out.