airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

pair_id disambiguation #273

Closed bcorrie closed 4 years ago

bcorrie commented 4 years ago

We are starting to look at paired sequencing data and I am starting to wonder how one determines uniqueness of a pair of sequences. This is related to #246 but I thought it worthy of a separate discussion so that we might be able to get a quick resolution on the issue.

From the spec it says the pair_id is a "Valid sequence_id that was determined by experimental or computational means to be associated with the current Rearrangement on the cellular level".

I see that sequence_id is supposed to be unique within a "file". I assume this means the uniqueness comes from the sequence_id assignment provided at the fasta file level when sequencing was done. Is that correct? So I believe it is expected that sequence_id is not unique within a repository???

If this is the case, I am wondering about our definition of pair_id. pair_id is a unique sequence_id to which the given sequence has been determined to be a paired chain. If as above, I assume that the sequence_id is not unique within the repository, then it is possible to have collisions on pair_id. What would be the disambiguation rule for a pair_id? It seems like one would require a sample_processing_id (which is where the fasta level sequence_id would be assigned) and a data_processing_id (as one might have a paired chain from a single SampleProcessing annotated with two different annotation tools).

So the query to find a paired sequence for a given sequence S is:

       "op":"and",
        "content": [
            { "op":"=","content": {"field":"sequence_id","value":S.pair_id}},
            { "op":"=","content": {"field":"data_processing_id","value":S.data_processing_id}},
            { "op":"=","content": {"field":"sample_processing_id","value":S.sample_processing_id}},
    ]

This is somewhat complex, and hard to explain in documentation - I have inferred this from my understanding which may or may not be correct.

Given that the relationship between two sequences in a paired chain is similar in scope to single cell relationships, I wonder if we should have a similar definition for pair_id as we do for cell_id, which is defined as follows: "Identifier defining the cell of origin for the query sequence." Basically have a unique identifier for each pair in the paired chain?

I am not sure this solves the problem, in that we still have the uniqueness issue that needs to be sorted out.

bcorrie commented 4 years ago

Hmm, in rereading this, I think this is actually more precisely defining and is helping to disambiguate sequence_id rather than pair_id.

javh commented 4 years ago

I must've missed something along the way... That definition of pair_id seems rather odd to me.

Having pair_id work exactly the same as clone_id and cell_id makes more sense to me (just a grouping variable). In most protocols, I would expect pair_id and cell_id to be identical.

Seems like it'd be a lot easier to find collisions too, because you'd just have to look for pair_id values that map to >2 rows.

franasa commented 4 years ago

Hi Brian, I will take the chance here to add a point to the issue. I was also about to open a ticket concerning pair_id. I'm not sure if this is still valid but in #169 you defined it as:

pair_id: identifier for rearrangements that have a chain pairing, unique ID within the repertoire for each unique pair. cell_id: identifier for rearrangements that come from a single cell, unique ID within the repertoire for each unique cell.

a "Valid sequence_id that was determined by experimental or computational means [...]" sounds also a bit strange to me, although I'm not yet an expert. Can it be that the definition from #169 was to be updated in the spec? (sorry if this was already discussed, I am still diving a bit adrift in the ocean of tickets)

About my question: as Jason points, in the single cell context, one cell will normally have two sequences for one rearrangement, thus one cell will have a unique cell_id and a unique pair_id shared by two sequences within one cell. However, not so infrequently, we also see the presence of three sequences (one heavy and one kappa and one lambda) and in some cases, even four meaning two or three different pair_id's for a unique cell_id.

given that each row in our data is a single cell, for creating a unique pair_id corresponding to its belonging cell and avoid repeating each row "pair_id-x" number of times, we wonder if it would make sense to have an extra ?table? with the sequence in question, the cell_id and the pair_id that maps to the original row?

javh commented 4 years ago

as Jason points, in the single cell context, one cell will normally have two sequences for one rearrangement, thus one cell will have a unique cell_id and a unique pair_id shared by two sequences within one cell. However, not so infrequently, we also see the presence of three sequences (one heavy and one kappa and one lambda) and in some cases, even four meaning two or three different pair_id's for a unique cell_id.

Ah, yeah, good point. Ignoring things like allelic inclusion and doublets, you can still have six biologically valid receptor sequences in a normal B cell. 1 productive heavy, 1 productive light, and 4 unproductive (1 heavy, 3 light). To cover that with the current pair_id definition you'd need to force a pair of "master" pair_id values for the entire set or do some sort of categorical clustering thingy to track down all members of a given cell.

I'm a bit fuzzy on why we needed pair_id in addition to cell_id. Something to do with semantics being wrong for pairing done by PCR ligation protocols? Not sure. I can't recall off the top of my head... I'll have to look back through the issues.

bcorrie commented 4 years ago

a "Valid sequence_id that was determined by experimental or computational means [...]" sounds also a bit strange to me, although I'm not yet an expert. Can it be that the definition from #169 was to be updated in the spec? (sorry if this was already discussed, I am still diving a bit adrift in the ocean of tickets)

Actually, that was my mistake. The current spec has "Valid sequence_id that was determined by experimental or computational means [...]" and I made up the other definition because I actually thought it was the same as for cell_id (as @javh suggested). Sorry for the confusion. The current definition is for the sequence_id, but I agree with @javh that it makes sense for that to be changed unless there is a reason not to...

bcorrie commented 4 years ago

I'm a bit fuzzy on why we needed pair_id in addition to cell_id. Something to do with semantics being wrong for pairing done by PCR ligation protocols? Not sure. I can't recall off the top of my head... I'll have to look back through the issues.

Would this make sense. For a single B cell with six receptor sequences, all would have the same cell_id. In this case, I assume the productive heavy and light are the paired chains, so they would have the same pair_id, but the other four unproductive would not have a pair_id???

scharch commented 4 years ago

In this case, I assume the productive heavy and light are the paired chains, so they would have the same pair_id, but the other four unproductive would not have a pair_id???

Yes, I think so. But then the heavy chain from a cell with an allelic inclusion should actually have two pair_ids: one for the H+K pairing and one for the H+L.

As a historical matter, I think pair_id was introduced for Adaptive's pairing protocol, which does not involve single cells. Still, I think the nuances we are talking about here are worth being able to distinguish anyway.

javh commented 4 years ago

Would this make sense. For a single B cell with six receptor sequences, all would have the same cell_id. In this case, I assume the productive heavy and light are the paired chains, so they would have the same pair_id, but the other four unproductive would not have a pair_id???

This would make pair_id:null entirely redundant with productive:False. Which is not awful, but I think not a reason to have pair_id.

As a historical matter, I think pair_id was introduced for Adaptive's pairing protocol, which does not involve single cells. Still, I think the nuances we are talking about here are worth being able to distinguish anyway.

They are still inferred a cell association though, correct? If so, cell_id seems appropriate to me.

franasa commented 4 years ago

We have discussed this issue today, and came up with some points. I've tried to illustrate them with these two example tables from our sciReptor data. I think this does not solve the question about the uniquenes of the pair_id but it might help to bring some ideas.

in the single cell context:

as a very rare example we see in cell_id = 296, a relation between 2 H to 1 K and 1 L,all are productive and therefore we would assign pair_ids grouping those sequnces [ids 5 to 8] [less rare are i.e. 115 and 119 with one productive H and two productive L (Kappa and/or Lambda)].

cell_id sequence_id consensus_rank locus length igblast_productive
115 1249 2 K 419 1
115 165 1 K 436 1
115 887 1 H 539 1
119 1460 1 K 420 1
119 846 2 L 442 1
119 679 1 L 443 0
119 1258 1 H 548 1
296 1228 1 K 430 1
296 970 1 L 442 1
296 1044 2 H 543 1
296 845 1 H 545 1

The second table is what we have come up with, and something that i wanted to bring for discussion as a solution to map pair_id's to cell_id's for each single cell database. note that the cell_id column is not necesary [is here for visualization purposes as the sequence_id_ column names] as each sequence_id from the second table will lead to the cell_id of the first one.

pair_id sequence_id_heavy sequence_id_light cell_id
1 887 165 115
2 887 1249 115
3 1258 846 119
4 1258 1460 119
5 845 1228 296
6 845 970 296
7 1044 1228 296
8 1044 970 296
scharch commented 4 years ago

This would make pair_id:null entirely redundant with productive:False. Which is not awful, but I think not a reason to have pair_id.

No, because you could have a cell with only the heavy chain detected, in which case pair_id:null but productive:True. Or any bulk application, where all of the rearrangements have pair_id:null

They are still inferred a cell association though, correct? If so, cell_id seems appropriate to me.

I dug up the following:

  • We probably need an pair_id field in Rearrangement to represent chain pairing separate from cell_id. Corresponds with is_paired=True (wherever we put that - at or downstream from SoftProcessing).

Originally posted by @javh in https://github.com/airr-community/airr-standards/issues/169#issuecomment-452063537

Unfortunately, the minutes from that meeting, in their entirety, are:

We exclusively discussed GitHub issue #169. Complications across different types of protocols led us to conclude the following: The metadata should have a boolean flag for whether the rearrangement data is paired at the SoftwareProcessing level (is_paired), along with the current "single_cell" boolean flag at the CellProcessing level for whether it's single-cell. Each of these flags lets the user know whether "pair_id" (a new rearrangement field) and/or "cell_id" are meaningfully set in the associated data file.

Maybe @laserson remembers more?

I also found this:

With the wisdom of hindsight: See #169 and #206 => Use of cell_id to annotate chain linkage has been deprecated, instead an additional paired_id field was introduced (as single cell and chain pairing are not always associated).

Originally posted by @bussec in https://github.com/airr-community/airr-standards/issues/165#issuecomment-541961406

So it sounds like this is already integrated into the spec...

scharch commented 4 years ago

@franasa I think in your example data, cells 119 and 296 should not have pair IDs, because it is unclear which is the biological pairing. (I'm assuming that such a scenario is a doublet; if you can really prove that they are true single cells with both heavy and light chain inclusions then your assignments are correct.)

Either way, the AIRR Rearrangements format would have:

sequence_id productive locus pair_id cell_id
1249 T IGK 1 115
165 T IGK 2 115
887 T IGH 1,2 115

etc, where I've left out non-AIRR "custom" fields and converted productive and locus to match the schema. The primary identifier here is sequence_id, from which you can get both pair_id and cell_id.

Going back to @bcorrie's original question, I believe that sequence_id is unique within a RearrangementSet, and a rearrangement_set_id is unique within a Repertoire. Or maybe sequence_id is directly unique within a Repertoire. Either way, it seems to me that pair_id and cell_id should follow that same rule. So if you have just the pair_id and you want to get cell_id, you'd query the Repertoire for all Rearrangements matching the former field and then look at the latter. Does that meet your needs?

javh commented 4 years ago

Thanks, @scharch. That's immensely helpful.

No, because you could have a cell with only the heavy chain detected, in which case pair_id:null but productive:True. Or any bulk application, where all of the rearrangements have pair_id:null

Ah, that's a good point. So if I'm following the intent correctly here, then pair_id:null is basically meant to serve the same purpose as is_paired:False, but at the individual sequence level? If that's correct, then we should make that explicit in the schema docs.

schristley commented 4 years ago

Either way, it seems to me that pair_id and cell_id should follow that same rule.

I agree, and I believe that rule should be they are a grouping variable with a user-defined value, which should be unique within a DataProcessing for a Repertoire.

In other words, I think the pair_id description should be changed from this:

                Valid sequence_id that was determined by experimental or computational means to
                be associated with the current Rearrangement on the cellular level.

to something more like this similar to the cell_id description

                Identifier defining paired rearrangements that was determined by
                experimental or computational means.
schristley commented 4 years ago

A discussion in Haifa introduced a potential problem. It was mentioned that in a single cell, an IGH may form a pair with an IGL, but that the same IGH could form a pair with an IGK. This implies that a single rearrangement record can be in more than one pair, and thus pair_id is not a single value but potentially a list of values. I guess looking back through the comments, I see mention of this, as @scharch shows 1,2 as a pair_id value. The problem with using a list is that simple queries cannot be used to extract the pairs. Likewise, we've been trying to avoid lists in the AIRR TSV format.

lgcowell commented 4 years ago

For what it’s worth, this is also observed in alpha beta T cells where they can express two alpha chains, so it would be good to have this worked out for both cell types.

scharch commented 4 years ago

Maybe in such cases it makes sense to list the heavy/beta chain twice? So my previous example would become:

sequence_id productive locus pair_id cell_id
1249 T IGK 1 115
165 T IGK 2 115
887 T IGH 1 115
887 T IGH 2 115

...except that I assume we'd have to give the duplicated heavy/beta chains separate sequence_ids, which may probably raises a different set of issues...

schristley commented 4 years ago

...except that I assume we'd have to give the duplicated heavy/beta chains separate sequence_ids, which may probably raises a different set of issues...

Yeah, unfortunately, duplicating the rearrangement records will raise other issues, especially with calculating quantities and counts...

Maybe we should start allowing arrays to be represented in some select fields, like pair_id, where we have greater ability to specify the content. Having 1,2 as the value in the TSV seems a lot simpler than other ideas.

sharikrish commented 4 years ago

Following @schristley, do we agree that we could represent some fields as arrays and if yes does the below suggestion for pair_id spec looks reasonable?

pair_id:
            type: array
            description: Valid list of sequence_id that was determined by experimental or computational means to be associated with the current Rearrangement on the cellular level.
            example: 
                - ABC314159
                - ABC314160
            x-airr:
                miairr: true
                required: true
                nullable: true
                adc-api-optional: false
                set: 6
                subset: data (processed sequence)
                name: Paired-chain index
javh commented 4 years ago

Hrm. I don't think we have consensus per se, but I'm strongly opposed to pair_id mapping to sequence_id as that's rather anti-tidy. You can't perform a group-by operation on the column and it would differ from behavior of other *_id fields in the spec. You could certainly enforce the convention in your specific dataset that the pair_id is the same as the sequence_id of the heavy/beta chain in the pair_id set, but I don't think that should be part of the spec.

As far as the list goes, officially supporting lists (nested columns) in the TSV would break the rules. Currently, we say:

Nested delimiters are not supported by the schema explicitly and should be avoided. However, if multiple values must be reported in a single column for an application specific reason, then the use of a comma as the delimiter is recommended.

http://docs.airr-community.org/en/latest/datarep/format.html#formatspecification

We discourage nested delimiters because it's a bit of an anti-pattern. Changing that would be a very large change, in my opinion. If we truly need a nested representation for pairing, and we might, then I would first reach for a supplemental TSV or yaml to define those pairings before changing the TSV spec.

So... I guess the big question for me is, what does a list of pair_id get you in terms of a practical use case? To use @scharch's example from above, how does this:

sequence_id productive locus pair_id cell_id
1249 T IGK 1 115
165 T IGK 2 115
887 T IGH 1,2 115

Facilitate analyses in a way that the following does not?

sequence_id productive locus pair_id cell_id
1249 T IGK 1 115
165 T IGK 1 115
887 T IGH 1 115

Some downstream determination needs to be made about how to analyze the VH:VL pairs. Is it somehow made easier by the former?

(You could set some rule that the first table represents doublets or allelic inclusion and the later a bispecific antibody, but that seems confusing and error prone.)

bcorrie commented 4 years ago

Hrm. I don't think we have consensus per se, but I'm strongly opposed to pair_id mapping to sequence_id as that's rather anti-tidy. You can't perform a group-by operation on the column and it would differ from behavior of other *_id fields in the spec. You could certainly enforce the convention in your specific dataset that the pair_id is the same as the sequence_id of the heavy/beta chain in the pair_id set, but I don't think that should be part of the spec.

Yes, I think we were leaning towards @javh's approach where the pair_id is "just" an ID that groups things.

bcorrie commented 4 years ago

We discourage nested delimiters because it's a bit of an anti-pattern. Changing that would be a very large change, in my opinion.

I agree with this, and think we would only want to do that if absolutely necessary... In particular, if as a result of the single cell extension pull request (#211) we define a Cell object (and maybe a Receptor object?) that can capture this relationship much more accurately, I would be very hesitant to make such a fundamental change to the Rearrangement spec without having the other entities sorted out first.

It seems like the wrong approach to force what seem to be increasingly complex relationships into the Rearrangement object.

scharch commented 4 years ago

As far as the list goes, officially supporting lists (nested columns) in the TSV would break the rules.

Actually, I think the proposed usage might fit within the rules. Allelic inclusion is expected to be uncommon, and representing it could perhaps be considered "multiple values [that] must be reported in a single column for an application specific reason"...?

Facilitate analyses in a way that the following does not?

I think the main objection would be that it sacrifices clarity/explicitness. It also makes it closer to redundant with cell_id, since I think the current non-single-cell pairing approaches can't see inclusions.

javh commented 4 years ago

Actually, I think the proposed usage might fit within the rules. Allelic inclusion is expected to be uncommon, and representing it could perhaps be considered "multiple values [that] must be reported in a single column for an application specific reason"...?

What we meant by "application specific reason" was something along the lines of "if your tool really needs to do this then go ahead and use a comma, but our API isn't going to parse it and other people's tools are under no obligation to do so either".

For example, we support nested delimiters in the v_call, d_call and j_call columns in changeo for ambiguous alignments, but that's just one of those historical requirements of the tool, not part of the spec.

javh commented 4 years ago

Should've put this in the same comment above. Oh well, typing faster than I could think...

We could take the same attitude as the *_call fields with pair_id. If the use case for a list is rare enough, then just have the official spec be a single id value, but maybe the occasional tool would use a list in a non-official manner.

scharch commented 4 years ago

We could take the same attitude as the *_call fields with pair_id

Yeah, that's exactly what I had in mind. Not wedded to it, but seems easier than anything else we've discussed.

bussec commented 4 years ago

TL;DR Let's trash pair_id and use a "virtual" cell object instead.

Given the pains that pair_id is creating, @srlak, @franasa and me had a brainstroming session, the outcome of which I will try to summarize below.

Markup: biological entity and data object

Basic assumptions

  1. The two biological entities that we want to represent for single-cell(-like) data are cell and receptor.
  2. Properties of a cell: a. A cell is modeled as a bag that contains DNA/RNA molecules encoding Ig/TCR chains. b. A cell can express one or more receptor species. c. A cell is represented by a cell object that has a unique cell_id identifier. d. A cell can have additional observable properties (e.g., flow profile, transcriptome), which should be attributable to the cell object.
  3. Properties of a receptor: a. A receptor is modeled as the combination of two non-identical types of Ig/TCR polypeptide chains (i.e., ignoring #297) b. As polypeptides, the chains must be encoded by productively rearranged sequences. c. A receptor is represented by a receptor object that has a unique receptor_id identifier. d. A receptor can have additional observable properties (e.g., affinity against an antigen), which should be attributable to the receptor object.
  4. Relationships between cell and receptor objects: a. A cell MAY be associated with one or more receptor objects. b. A receptor MAY be associated with one or more cell objects. c. Associating a cell and a receptor is a valid operation if and only if the cell is associated with all rearrangement objects required to encode the receptor.

Importantly, these assumption do NOT REQUIRE that the cell has to be observed directly (e.g. by flow or microscopy). It is sufficient if the experimental setup creates a trackable single-cell context (e.g., microfluidics with unique barcodes for each droplet) which allows the subsequent association of an observed rearrangement with a cell.

One of the reasons why we introduced the pair_id, was to be able to annotate stochastic pairing data, like PAIRseq. In these experimental setups, rearrangements are inferred to be associated with each other, based on the correlated presence and absence of the rearrangements in multiple sub-samples under limiting dilution conditions. As these setup do not create a single-cell context, we did not want to use the cell_id field to represent chain linkage, as we thought that this should be reserved for "true" single-cell setups. However, as noted above, most of our single-cell data will not come from directly observed single cells, but from workflows that only preferentially create a single-cell context and subsequently attempt to filter out doublets etc. Finally, the technical reason why stochastic pairing works, is because the DNA/RNA molecules are in the same cell, not because they encode components of the same receptor. Therefore, it will also allow pairing of non-productive chains (although some software might filter those out).

Therefore, we would like to pitch the idea that stochastic pairing should not be represented by pair_id but instead via a "virtual" cell. These cell objects would contain a virtual: true key to indicate that they were inferred. As we now can represent all chain paired data via cell (rearrangement would link n:1 to cell), pair_id could be deprecated. In this model, receptor would be reserved for data that is based on direct observation and would not be linked from rearrangement.

bcorrie commented 4 years ago

Therefore, we would like to pitch the idea that stochastic pairing should not be represented by pair_id but instead via a "virtual" cell. These cell objects would contain a virtual: true key to indicate that they were inferred. As we now can represent all chain paired data via cell (rearrangement would link n:1 to cell), pair_id could be deprecated. In this model, receptor would be reserved for data that is based on direct observation and would not be linked from rearrangement.

If I understand correctly you are suggesting that at the Rearrangement object level we only need cell_id and we no longer need pair_id? That is cell_id is sufficient to capture both uses? I think this is right, but wanted to confirm...

bcorrie commented 4 years ago

@bussec I think this is suggesting that with further hindsight, the decision we made to have both (see https://github.com/airr-community/airr-standards/issues/165#issuecomment-541961406) was not necessary, correct?

bussec commented 4 years ago

@bcorrie Yes, we only need cell_id. The information that the cell was not directly observed is moved to the cell object. Therefore pair_id is not necessary anymore. Any yes, our hindsight (as in #165) is getting close to 20/20.

schristley commented 4 years ago

TL;DR Let's trash pair_id and use a "virtual" cell object instead.

@bussec Very good, this looks good to me. It should work well from an API perspective as well with the obvious creation of cell and receptor endpoints.

The only thing I'm unclear about it this, can you describe a use case where this will occur?

b. A receptor MAY be associated with one or more cell objects.

bussec commented 4 years ago

The only thing I'm unclear about it this, can you describe a use case where this will occur?

b. A receptor MAY be associated with one or more cell objects.

Clonally expanded T cells. I just wanted to state that this is a situation that we need to be able to represent, when designing the cell and receptor objects. This should not be in conflict with clone_id as the latter one captures a sequence (not receptor) based grouping of cells.

schristley commented 4 years ago

Clonally expanded T cells.

Ah I see, so you are thinking of receptors abstractly, that is a single receptor object can represent multiple physical receptor entities (assuming the sequence is identical, and actually I guess the additional observable properties as well)?

bussec commented 4 years ago

Correct.

schristley commented 4 years ago

Ok, how far can that extend? Can that single receptor object be used across multiple repertoires, multiple studies?

From an API perspective, the id is globally (or repository) unique, so you can do a directly query to return the information. So technically it is possible to allow a receptor object to be very broad.

/receptor/:receptor_id

The relationship you defined between cell and receptor is n-to-n, so I'm mainly trying to think about how that will be represented and where it will be stored.

If this requirement was changed, that is each receptor object represents a single physical receptor, that means it could only belong to a single cell, the relationship becomes 1-to-n and is easier to store and manage.

bussec commented 4 years ago

I understand that you would like to avoid n-to-n if possible, but in this case we need it, at least for single-cell. Otherwise we are buying the simplicity of an 1-to-n relation with an eventually non-normalized data structure, which I would like to avoid.

Regarding the scope of receptor_id: Yes, theoretically we can make it global. We just need to be aware that once we move beyond the subject, we need support for multiple MHC haplotypes when annotating reactivity.

javh commented 4 years ago

From the call:

bcorrie commented 4 years ago

Given the above, do you think we can close this issue. Paired discussion will continue on in the Cell/Receptor discussion and we are removing pair_id from rearrangements... I would like to close this issue unless anyone objects...

bussec commented 4 years ago

I am ok with closing and moving the discussion to #320.

schristley commented 4 years ago

Are we going to remove pair_id?

bussec commented 4 years ago

Yes, at least that was my understanding from the January DataRep call (see @javh's comment above).

schristley commented 4 years ago

Okay, I missed that call so wanted to check.