Open scharch opened 4 years ago
Should the Clone definition also contain both chains? Right now it seems to support only one.
@schristley, I think it will have to support germline_alignment
(and all the related fields) as an array, of some sort, yes.
In a separate call, @bussec and I discussed how to do this flexibly. It would be nice not to be limited to strictly two chains. It also is hard to come up with a terminology that covers both T and B cells. There was also the desire to be able to annotate non-productive chains. Using a dictionary or array object should allow multiple entries. Using a controlled vocabulary, we could use T and B cell specific terms to annotate/tag the chains. At the same time, we should make it easy to access the primary annotations directly.
It also is hard to come up with a terminology that covers both T and B cells.
This is a rather vexing problem. We've been using "heavy" for IGH, TRB and TRD and "light" for IGK/L, TRA and TRG, which is wrong. Maybe long_chain
and short_chain
?
It also is hard to come up with a terminology that covers both T and B cells.
This is a rather vexing problem. We've been using "heavy" for IGH, TRB and TRD and "light" for IGK/L, TRA and TRG, which is wrong. Maybe
long_chain
andshort_chain
?
I heard a suggestion like "d-containing chain" and "not-d" but there's the concern it's not very robust. My question would be, do we have to have the same name? Can we call them heavy_chain, light_chain, alpha_chain, beta_chain, etc., with a controlled vocabulary specific to cell and chain type?
Sure, tools would have to handle them specifically, but wouldn't they kinda have to do that anyways, like tools would want to know regardless if it was IGH versus TRB?
Can we call them heavy_chain, light_chain, alpha_chain, beta_chain, etc., with a controlled vocabulary specific to cell and chain type?
@schristley I was just coming here to suggest essentially the same thing.
It'll still get complicated, though: if each chain is a dict with keys something like {id, type, is_productive}
, then a Cell
would be an array of those and the "members" of Clone
ends up being an array of arrays of dicts. Does that seem workable?
At the same time, we should make it easy to access the primary annotations directly.
Each Cell
in the Clone
has a cell_id
and a list of sequence_id
s that link back to the rearrangements TSV - do you think that is sufficient?
Can we call them heavy_chain, light_chain, alpha_chain, beta_chain, etc., with a controlled vocabulary specific to cell and chain type?
This is hard to use (have to check every object for field presence before fetching data), set required fields for (none or all are required?), and convert to a TSV (lots of missing data). But, it would be more explicit and support dual BCR+TCR expressing cells if you believe in such things:
@javh are we really trying to support conversion from a clones.json file to TSV? I have so many questions about how that would work even aside from this.
Anyway, I think that having a type
field would help with the parsing you are concerned about.
{
cell:'cell_id',
type:'b_cell'
heavy_chain: [ 'sequence_id1' ],
light_chain: ['sequence_id2', 'sequence_id3' ]
}
But probably even better would be something like
{
cell:'cell_id',
type:'b_cell',
chains:[
{ sequence:'sequence_id1', type:'heavy_chain',... },
{ sequence:'sequence_id2', type:'light_chain',... },
{ sequence:'sequence_id3', type:'light_chain',... },
]
}
Can we call them heavy_chain, light_chain, alpha_chain, beta_chain, etc., with a controlled vocabulary specific to cell and chain type?
This is hard to use (have to check every object for field presence), set required fields for (none or all are required?), and convert to a TSV (lots of missing data). But, it would be more explicit and support dual BCR+TCR expressing cells if you believe in such things:
I still need to think through the Cell-Clone relationship, but focussing purely on Clone right now, we could still have explicit fields name, but with generic names (chain_1, chain_2, primary_chain, secondary_chain, long_chain, short_chain). Actually, as a matter of fact, maybe keep the exact same Clone fields we have right now (v_call, j_call, etc.) but just add new fields for the second chain. And we require that the main fields be the heavy/long chain, while the second chain is the other. So something like this
v_call:
type: string
chain_type:
type: string
enum:
- IGH
- TRB
v_call_1:
type: string
chain_type_1:
type: string
enum:
- IGL
- TRA
This supports the main idea of two (productive) chains directly, with little ambiguity about what's what. Tools which don't "think" about this would just use the current Clone object as it. We could then have an optional dictionary/array where additional chains can be enumerated.
@javh are we really trying to support conversion from a clones.json file to TSV? I have so many questions about how that would work even aside from this.
I don't know. Probably only if a need arises. Though, naively, it looks trivial to my eye. You use clone_id
as the row key and exclude the sequences
field. If you need the individual sequence level data, you'd then search the Rearrangement data by clone_id
. Then it's just a clone summary table. But, that's without considering Cell
.
Some sort of type
field seems like it might be a solution. Though, you'd still have to do a check of some kind, but it would be a simpler check.
The way Clone
is setup right now seems really geared towards IGH/TRB/TRD data only. Hrm.
Each
Cell
in theClone
has acell_id
and a list ofsequence_id
s that link back to the rearrangements TSV - do you think that is sufficient?
I'm still thinking through this. A single Clone
object is suppose to represent the whole clonal lineage, all cells and corresponding rearrangements? If that's the case, it's likely better for each Cell
to point to its Clone
versus having Clone
contain a list of cells. Furthermore, if you gather up all the rearrangements for all those Cell
s, is that the same list of rearrangements in Clone
's sequences
array?
And we require that the main fields be the heavy/long chain, while the second chain is the other. So something like this
I think this could work, but the way you've sketched it out, it's hard to see how we'd account for non-productive rearrangements. Maybe that's rare enough or unimportant enough that it doesn't matter, but I typically bring them along and use them as additional evidence when doing clonality calculations.
A single
Clone
object is suppose to represent the whole clonal lineage, all cells and corresponding rearrangements? If that's the case, it's likely better for eachCell
to point to itsClone
versus havingClone
contain a list of cells.
Yes but why treat Cell
s differently than Rearrangement
s here? Biologically, the Clone
is comprised of Cell
s, not Rearrangement
s...
Furthermore, if you gather up all the rearrangements for all those
Cell
s, is that the same list of rearrangements inClone
'ssequences
array?
Sort of? Not the way it's currently set up with only one chain, but this should be correct under the extension models we are discussing.
And we require that the main fields be the heavy/long chain, while the second chain is the other. So something like this
I think this could work, but the way you've sketched it out, it's hard to see how we'd account for non-productive rearrangements. Maybe that's rare enough or unimportant enough that it doesn't matter, but I typically bring them along and use them as additional evidence when doing clonality calculations.
An optional extended data structure like you suggested above for providing additional chains.
A single
Clone
object is suppose to represent the whole clonal lineage, all cells and corresponding rearrangements? If that's the case, it's likely better for eachCell
to point to itsClone
versus havingClone
contain a list of cells.Yes but why treat
Cell
s differently thanRearrangement
s here? Biologically, theClone
is comprised ofCell
s, notRearrangement
s...
"better" only in a data structure sense. As a Cell
belongs to one Clone
, it could be represented with a single field clone_id
, while a Clone
containing many Cell
s would require an array of cell_id
s.
OK, we are currently implementing 10X data loading for rearrangements/clones/cells/expression.
We can currently load everything in principal and practice, based on the current AIRR Spec.
The problem arises when you try to map a specific tool chain (e.g. 10X cellranger) to the spec, in particular one that generates all of the data types as part of one processing run - when everything blows up.
I think this issue is the crux of the matter - and we appear to have been avoiding it since July 2020 8-)
In the 10X case you get:
clone_id
has multiple chains. I have seen two and three chains thus far for a single clone_idClone
object is focused on a single chain onlyClone
object that describe the clone (VDJ calls, junction, alignment, sequences) need to be different for each of the chains in the clone (not just the VDJ calls as discussed above). I count 18 fields based on a quick count. So we can't really load 10X data in a particularly logical or coherent fashion when you try to do all of Rearrangements/Clones/Cells in a single repository. I am pretty sure this would also mean that you couldn't represent said data in a set of files on disk using a Manifest
to tie them together...
This seems like something that should be pretty high on the priority list if we really want to claim that we have a working Rearrangement/Clone/Cell spec 8-)
So something like this
v_call: type: string chain_type: type: string enum: - IGH - TRB v_call_1: type: string chain_type_1: type: string enum: - IGL - TRA
This supports the main idea of two (productive) chains directly, with little ambiguity about what's what. Tools which don't "think" about this would just use the current Clone object as it. We could then have an optional dictionary/array where additional chains can be enumerated.
I am not sure this would work, given that I think there are on the order of 18 fields that would need different values for multiple chains...
It seems to me that a Clone
object should be an array of N CloneChains
(where N is small (1-3?) but flexible) with each of the 18 fields that describe the "inferred ancestor of the clone" in the CloneChain
object???
I also wonder if we should drop the sequences
array, since you can look up the sequences associated with a CloneChain
using the clone_id
in the Rearrangements
Alternatively we could leave the Clone object as is, but treat it as a CloneChain object, and store multiple CloneChain objects with the same clone_id. You would then link multiple chains that are associated with the same Clone through the clone_id.
This is how we are going to load data for now, as this is really the only way to link multiple chains...
I don't have a great suggestion for this right now, but I think this intersects with how we might want to think about Receptor
. There's a couple things going on in the current Clone
schema - properties of specific observed sequences (sequences
, junction
, etc) and properties of the naive ancestor that are common to all the observed members of a clone (v_call
, germline_alignment
, etc). The latter seems like something we can separate out into an object and use for both Clone
and Receptor
and then nest under primary_chain
and secondary_chain
.
TRA is going to be a major problem, as it's pretty common to get more than one productive TRA transcript. Do we want Clone
to support more than two chains? If so, do we add more chains or nest under the relevant chain somehow? If not, do we want to allow for multiple clone_id
per sequence?
TRA is going to be a major problem, as it's pretty common to get more than one productive TRA transcript. Do we want
Clone
to support more than two chains? If so, do we add more chains or nest under the relevant chain somehow? If not, do we want to allow for multipleclone_id
per sequence?
Yes, I have seen some data from 10X that have multiple TRAs (3 chains per clone) - that is where I started down this rabbit hole of trying to figure out how we should curate this. I think the array of CloneChain
could be pretty flexible and maybe handle this...
Good point about separating out the info of the naive ancestor from the observed properties...
I still think that the simplest solution is to allow Clone
to contain Cell
s as an alternative to Rearrangement
s. Then each cell
can hold (point to)? an arbitrary number of rearrangement
s as needed. The inferred_ancestor
also becomes a cell
. And all this is more biologically "correct," too.
I also wonder if we should drop the
sequences
array, since you can look up the sequences associated with aCloneChain
using theclone_id
in theRearrangements
No, you can't --or at least you're not guaranteed to be able to. If you Clone
of interest is from the original/primary analysis, it might work, but if it's a secondary or reanalysis the cell
's clone_id
will point forever more only to the first one.
I still think that the simplest solution is to allow Clone to contain Cells as an alternative to Rearrangements.
I like this. I'm not sure how to implement it. We'll have to figure out what to do about the _count
fields, especially umi_count
. But, it also gets ahead of the issue of how to extend the Tree schema to paired VH:VL lineage reconstruction.
I think that umi_count
can be left as is (will be null for this case) and the definition of clone_count
can be expanded slightly to include the number of Cell
s in the Clone
.
It seems to me that the bigger lift will be letting tools know what type of data they are looking at, but maybe we can get away with just a cell/chain flag?
I think that
umi_count
can be left as is (will be null for this case) and the definition ofclone_count
can be expanded slightly to include the number ofCell
s in theClone
.
The clone_count
description has been updated to also mention cells...
It seems to me that the bigger lift will be letting tools know what type of data they are looking at, but maybe we can get away with just a cell/chain flag?
If you look at a rearrangement record for the clone, it will have a cell_id, that is one indication?
The
clone_count
description has been updated to also mention cells...
OK, I wasn't reading it like that, but you're right. @javh does this way of doing it satisfy you?
If you look at a rearrangement record for the clone, it will have a cell_id, that is one indication?
I guess, but I thought we've been trying to avoid that kind of two-step look up...
I still think that the simplest solution is to allow Clone to contain Cells as an alternative to Rearrangements.
I like this. I'm not sure how to implement it. We'll have to figure out what to do about the
_count
fields, especiallyumi_count
. But, it also gets ahead of the issue of how to extend the Tree schema to paired VH:VL lineage reconstruction.
I like it too. Our challenge is how to handle the identifiers. Right now Clone
has a sequences
fields which is all the rearrangement IDs. I have repertoires where there are thousand upon thousands of rearrangement records that make up a clone. Sticking such a huge array in Clone is kind of ridiculous... We are talking about a 1-N relationship, and it's always more efficient (from a data structure perspective) to store the link on the N side, i.e. the rearrangement table.
Now I suppose Clone
could have a cells
fields which references the cell IDs, and you might make the argument that there will be less cell records... But honestly, I'm seeing single cell experiments that do upwards of 100K cells, so how long will that hold?
It's the same situation with a 1-N relationship between clone and cell, so it makes sense to put the clone_id
inside of Cell
instead of having a list of cell IDs in clone.
If you look at a rearrangement record for the clone, it will have a cell_id, that is one indication?
I guess, but I thought we've been trying to avoid that kind of two-step look up...
My perception is if you were working in a single cell context, your workflow might be like this:
clone_id
in Cell.cell_id
.receptor_id
in Cell.So you will be working through the Cell objects to get to other data.
Sticking such a huge array in Clone is kind of ridiculous... We are talking about a 1-N relationship, and it's always more efficient (from a data structure perspective) to store the link on the N side, i.e. the rearrangement table.
I get it, but Cell
s can be members of multiple Clone
s and, more importantly, we've set it up so that a Cell
record is supposed to be more-or-less immutable in the ADC. So if my Clone
of interest was generated by some sort of post-publication meta/re-analysis, you have to (as far as I can see) put cell_id
into Clone
instead of vice versa...
My perception is if you were working in a single cell context, your workflow might be like this:
- Query the studies/repertoires of interest.
- Query the cells based upon the repertoire IDs.
- If clone data is wanted for cells, get using the
clone_id
in Cell.- If rearrangement data is wanted for cells, query rearrangements using
cell_id
.- If receptor data is wanted for cells, get using the
receptor_id
in Cell.So you will be working through the Cell objects to get to other data.
I have to think more about this, but as a general matter you are right that I will know if I'm working in a Cell
context or a Rearrangement
context. Is that enough, though?
I get it, but
Cell
s can be members of multipleClone
s
Hmm, that's challenging as that implies an N-N relationship, but this isn't the biology right as a cell only belongs to one clone. So the multiple Clones come from running different analyses?
and, more importantly, we've set it up so that a
Cell
record is supposed to be more-or-less immutable in the ADC. So if myClone
of interest was generated by some sort of post-publication meta/re-analysis, you have to (as far as I can see) putcell_id
intoClone
instead of vice versa...
I think this can handled with data_processing_id
, so Cell needs clone_id
but also the data_processing_id
that computed the clone. This doesn't handle all possibilities though, it is essentially turning the N-N relationship back into 1-N, but if we truly need N-N regardless of data processing, then we don't have much choice.
we've set it up so that a
Cell
record is supposed to be more-or-less immutable in the ADC.
I wonder if you mean immutable or if you mean a singleton? Nothing in the ADC is really immutable, any of the records could be updated with additional identifiers if new data processing is performed, which is then loaded into the ADC.
But I think I understand your point, if you pull out Cells from ADC then do a re-analysis of clones, it makes sense that you are generating clone data de novo while the Cell data stays the same...
isn't the biology right as a cell only belongs to one clone. So the multiple Clones come from running different analyses?
Yes, but it also could be a meta analysis combining samples (from the same individual, obviously) that were previously processed separately.
I wonder if you mean immutable or if you mean a singleton?
Yeah, you're right, I meant singleton - that a single biological/physical cell results in (at most) one Cell
object.
I think this can handled with
data_processing_id
, so Cell needsclone_id
but also thedata_processing_id
that computed the clone.
I don't follow, could you please explain more?
isn't the biology right as a cell only belongs to one clone. So the multiple Clones come from running different analyses?
Yes, but it also could be a meta analysis combining samples (from the same individual, obviously) that were previously processed separately.
Would a concrete example be where you had two time points? Maybe the first time point was collected and analyzed, then (say) 1 month later another time point is collected, now you do analysis over both because you expect the same Clone to be present in both time points, though they are different cells?
The reason I ask, and this is a digression, is I've been thinking about this type of time course analysis, and if a tool links clone(s) across time points, how can the tool signal that so the links are maintained when the data is loaded into the ADC?
I wonder if you mean immutable or if you mean a singleton?
Yeah, you're right, I meant singleton - that a single biological/physical cell results in (at most) one
Cell
object.I think this can handled with
data_processing_id
, so Cell needsclone_id
but also thedata_processing_id
that computed the clone.I don't follow, could you please explain more?
Ok, yeah, so in general, we could take the biological relations, like a cell belongs to one clone, and just implement that directly in the data structures, and the relationships between data objects matches the relationships between their associated biological objects. But as soon as we introduce the idea of multiple data processings, that throws a wrench into the whole design. What happens is many 1-N relationships get turned into N-N relationships, as with the cell/clone relationship.
The solutions, which isn't perfect, is to introduce a second identifier, in this case data_processing_id
which splits the N-N relationships into M number of 1-N relationships. M here being the number of different data processings. So how does that work concretely, well every object that is the "output" of a data processing (like Clone) has a data_processing_id
. Thus data_processing_id can be used to partition the whole Clone table into subsets. We've talked about this with rearrangements, imagine processing with IgBlast and Mixcr as two separate data processings, they can be stored together yet separated by their different data_processing_ids.
So the Cell could have a list of clones, which is a compound identifier (clone_id, data_processing_id)
clones: [{clone_id:123, data_processing_id:456}, {clone_id:abc1, data_processing_id:567}]
Would a concrete example be where you had two time points? Maybe the first time point was collected and analyzed, then (say) 1 month later another time point is collected, now you do analysis over both because you expect the same Clone to be present in both time points, though they are different cells?
The reason I ask, and this is a digression, is I've been thinking about this type of time course analysis, and if a tool links clone(s) across time points, how can the tool signal that so the links are maintained when the data is loaded into the ADC?
Sure, see eg these recent papers from the Nussenzweig group at Rockefeller: https://pubmed.ncbi.nlm.nih.gov/33461210/ https://pubmed.ncbi.nlm.nih.gov/34126625/
Or, from our work: https://www.ncbi.nlm.nih.gov/pubmed/24590074/ https://pubmed.ncbi.nlm.nih.gov/26468542/
And there are other similar cases.
I think Bryan Briney is working on a follow-up to this that will include some of the same donors. May or may not be different time points, not sure, but the goal is a little different than the previous examples, anyway.
The solutions, which isn't perfect, is to introduce a second identifier, in this case
data_processing_id
which splits the N-N relationships into M number of 1-N relationships. M here being the number of different data processings. So how does that work concretely, well every object that is the "output" of a data processing (like Clone) has adata_processing_id
. Thus data_processing_id can be used to partition the whole Clone table into subsets. We've talked about this with rearrangements, imagine processing with IgBlast and Mixcr as two separate data processings, they can be stored together yet separated by their different data_processing_ids.So the Cell could have a list of clones, which is a compound identifier (clone_id, data_processing_id)
clones: [{clone_id:123, data_processing_id:456}, {clone_id:abc1, data_processing_id:567}]
I see. This makes sense to me, you'd just update the Cell
record(s) in the repository to add a new compound identifier to the list. Seems reasonable enough...
@scharch @javh It's been awhile since the last discussion burst. Do you think we've enough concrete ideas to adjust the draft objects?
There's going to be a large set of single-cell studies coming down the pipe and going into the ADC, it would be good to implement some of these ideas and see how they work.
FYI we have loaded one 10X single cell study into the ADC already (with rearrangements, clones, cells, and GEX), and our clone compromise was to choose one of the chains for clone_id, create a single clone, and store consensus clone data (VDJ+Junction) from one chain. You can find the rearrangement for the other chain using the clone_id in the Rearrangement collection, but due to limits in our clone object we store only a single VDJ/Junction.
When you load the data into a repository you choose which chain you want to focus on.
Far from ideal but seemed like a decent compromise.
This one seems like a big one - I think we need to decide as to whether this gets fixed as part of v2.0 or is noted as a weakness/gap in the standard that is not currently addressed.
Yeah this is one of the ones on my personal to-do list...
We now have 4 single-cell 10X studies in ADC, and each of the study's Clone
data is loaded with a single chain only (as described above), even though the clone in this case is a paired chain clone.
It would be nice if we could fix this (although it means I would need to update a bunch of data) 8-)
15 single-cell 10X studies in ADC actually, though the studies in the VDJServer repository have not loaded Clone
data. The backlog of studies is still steadily growing. I think this is one of the high priority items that is needed if the ADC is to grow beyond just rearrangement data.
OK I'll try to put a PR together for discussion on the March call...
15 single-cell 10X studies in ADC actually, though the studies in the VDJServer repository have not loaded
Clone
data.
Yes, I meant there are 4 10X studies with loaded Clone
data - with that data loaded in an "unsatisfactory" way because Clone
is not oriented towards paired chains.
If we adapt Clone
so that it can contain Cell
s, do we need a way to connect/partition the Rearrangement
s within each Cell
? This goes beyond heavy/light/alpha/beta. Example: T cell clone with two TCRa chains, maybe even using the same V gene...
If we adapt
Clone
so that it can containCell
s, do we need a way to connect/partition theRearrangement
s within eachCell
? This goes beyond heavy/light/alpha/beta. Example: T cell clone with two TCRa chains, maybe even using the same V gene...
@scharch If I understand what you mean, this is already there with cell_id
in the rearrangement object. That let's you pull out the rearrangements for a specific Cell.
No, I mean Cell1 has rearrangements TRB123, TRA456, and TRA789. Cell2 has rearrangements TRB098, TRA765, and TRA432. Do we need to be able to link TRA456 as corresponding to TRA432 vs TRA765 (or even TRB098, though that's easier to code around).
No, I mean Cell1 has rearrangements TRB123, TRA456, and TRA789. Cell2 has rearrangements TRB098, TRA765, and TRA432. Do we need to be able to link TRA456 as corresponding to TRA432 vs TRA765 (or even TRB098, though that's easier to code around).
Sorry, I'm still not understanding. Is the "meaning" of the link to say that those are the "equivalent" chains in two different Cells
? If that's the case, won't the VDJ calls (plus maybe CDR3) be sufficient to imply this connection? I mean, if two Cells are in the same Clone, the TRB gene should be the same in both Cells. Likewise for the alpha chain. I can see there might be some ambiguity with B cells and SHM.
I guess another way to ask the question is how would you use that link? What problem would it solve for you?
If that's the case, won't the VDJ calls (plus maybe CDR3) be sufficient to imply this connection?
For the researcher looking at the data? Almost certainly. The question is if we need to make it easy to do by code.
how would you use that link? What problem would it solve for you?
Dunno. I was asking if it was something worth designing around when I'm trying to figure out an updated Clone
schema. If no one has a use case, then that's my answer :)
@scharch Even though it might not be in our list of requirements, I'll note that Clone
is perfectly amenable to a TSV format if only that pesky sequences
array is dealt with. There could be considerable benefit and uptake to the Clone
spec if toolchains like Immcantation and Repcalc, which are already processing clone TSV files (I may not be completely correct about that), don't need significant retooling to support AIRR. Bonus points in that programs that calculate things like gene usage and CDR3 length distributions that run on rearrangement TSVs, could run on Clone TSVs without change.
if only that pesky
sequences
array is dealt withthat programs that calculate things like gene usage and CDR3 length distributions that run on rearrangement TSVs, could run on Clone TSVs without change
It seems to me like you are imagining something entirely different, more a list of inferred naive ancestor across an entire Repertoire
. I can see the value in the that, but Clone
is more geared toward in-depth analysis of a small number of lineages. And, to be frank, it's very B cell biased. Hard to think of a T cell use case that would be worth an entire Clone
, but maybe that's your point.
BUT! I think #769 can solve this, too. There, I am proposing representing inferred naive ancestors as "nonphysical" Rearrangement
s (or Cell
s). So if I am understanding correctly what you want, you could just filter for nonphysical==True
et voila!
Edit: It's probably not that simple. You'd probably have to iterate through Clone
objects and extract the naive_ancestor
from each. But the point is that it would still be present as a nonphysical
rearrangement, and I'd bet it would be relatively straightforward to tweak that a little to help your use case.
It seems to me like you are imagining something entirely different, more a list of inferred naive ancestor across an entire
Repertoire
. I can see the value in the that, butClone
is more geared toward in-depth analysis of a small number of lineages. And, to be frank, it's very B cell biased. Hard to think of a T cell use case that would be worth an entireClone
, but maybe that's your point.
Hmm, probably, for T cells at least this is essentially a collapse of (potentially many) rearrangements records into a single record (with a count). Yeah, for B cells, is it a naive ancestral sequence? Or is it a consensus sequence? In either case you are right, it is a computationally inferred sequence (nonphysical
might not be right descriptor) versus being an observed sequence.
That's even assuming I care about the sequence. I'm likely thinking about it wrong, as a smaller, more compact representation of data in the rearrangements (though still potentially large), while you are thinking about it as actual biology.
Starting to think about this in the context of generating a lot of 10x VDJ data... it seems we will want to (eventually) have a way for
Clone
s to containcell
s (see https://github.com/airr-community/airr-standards/issues/273#issuecomment-568649516), instead of (or maybe in addition to)Rearrangement
s.Just a marker for now, need to think more about what kind of representation would make sense...
Issues to be resolved: