Extend Clone to single-cell context

scharch commented 4 years ago

Starting to think about this in the context of generating a lot of 10x VDJ data... it seems we will want to (eventually) have a way for Clones to contain cells (see https://github.com/airr-community/airr-standards/issues/273#issuecomment-568649516), instead of (or maybe in addition to) Rearrangements.

Just a marker for now, need to think more about what kind of representation would make sense...

Issues to be resolved:

[ ] How to represent multiple chains? Are they embedded in a single Clone object, do we have multiple Clone rows (which introduces other problems), do we create a separate CloneChain object, or something else?
[ ] What are the key relationships with other AIRR objects and how/where are the identifiers stored?

schristley commented 4 years ago

Should the Clone definition also contain both chains? Right now it seems to support only one.

scharch commented 4 years ago

@schristley, I think it will have to support germline_alignment (and all the related fields) as an array, of some sort, yes.

schristley commented 4 years ago

In a separate call, @bussec and I discussed how to do this flexibly. It would be nice not to be limited to strictly two chains. It also is hard to come up with a terminology that covers both T and B cells. There was also the desire to be able to annotate non-productive chains. Using a dictionary or array object should allow multiple entries. Using a controlled vocabulary, we could use T and B cell specific terms to annotate/tag the chains. At the same time, we should make it easy to access the primary annotations directly.

javh commented 4 years ago

It also is hard to come up with a terminology that covers both T and B cells.

This is a rather vexing problem. We've been using "heavy" for IGH, TRB and TRD and "light" for IGK/L, TRA and TRG, which is wrong. Maybe long_chain and short_chain?

schristley commented 4 years ago

It also is hard to come up with a terminology that covers both T and B cells.

This is a rather vexing problem. We've been using "heavy" for IGH, TRB and TRD and "light" for IGK/L, TRA and TRG, which is wrong. Maybe long_chain and short_chain?

I heard a suggestion like "d-containing chain" and "not-d" but there's the concern it's not very robust. My question would be, do we have to have the same name? Can we call them heavy_chain, light_chain, alpha_chain, beta_chain, etc., with a controlled vocabulary specific to cell and chain type?

Sure, tools would have to handle them specifically, but wouldn't they kinda have to do that anyways, like tools would want to know regardless if it was IGH versus TRB?

scharch commented 4 years ago

Can we call them heavy_chain, light_chain, alpha_chain, beta_chain, etc., with a controlled vocabulary specific to cell and chain type?

@schristley I was just coming here to suggest essentially the same thing.

It'll still get complicated, though: if each chain is a dict with keys something like {id, type, is_productive}, then a Cell would be an array of those and the "members" of Clone ends up being an array of arrays of dicts. Does that seem workable?

At the same time, we should make it easy to access the primary annotations directly.

Each Cell in the Clone has a cell_id and a list of sequence_ids that link back to the rearrangements TSV - do you think that is sufficient?

javh commented 4 years ago

Can we call them heavy_chain, light_chain, alpha_chain, beta_chain, etc., with a controlled vocabulary specific to cell and chain type?

This is hard to use (have to check every object for field presence before fetching data), set required fields for (none or all are required?), and convert to a TSV (lots of missing data). But, it would be more explicit and support dual BCR+TCR expressing cells if you believe in such things:

https://doi.org/10.1016/j.cell.2019.05.007

scharch commented 4 years ago

@javh are we really trying to support conversion from a clones.json file to TSV? I have so many questions about how that would work even aside from this.

Anyway, I think that having a type field would help with the parsing you are concerned about.

{ 
    cell:'cell_id',
    type:'b_cell'
    heavy_chain: [ 'sequence_id1' ],
    light_chain: ['sequence_id2', 'sequence_id3' ]
}

But probably even better would be something like

{
    cell:'cell_id',
    type:'b_cell',
    chains:[
                  { sequence:'sequence_id1', type:'heavy_chain',... },
                  { sequence:'sequence_id2', type:'light_chain',... },
                  { sequence:'sequence_id3', type:'light_chain',... },
               ]
}

schristley commented 4 years ago

Can we call them heavy_chain, light_chain, alpha_chain, beta_chain, etc., with a controlled vocabulary specific to cell and chain type?

This is hard to use (have to check every object for field presence), set required fields for (none or all are required?), and convert to a TSV (lots of missing data). But, it would be more explicit and support dual BCR+TCR expressing cells if you believe in such things:

I still need to think through the Cell-Clone relationship, but focussing purely on Clone right now, we could still have explicit fields name, but with generic names (chain_1, chain_2, primary_chain, secondary_chain, long_chain, short_chain). Actually, as a matter of fact, maybe keep the exact same Clone fields we have right now (v_call, j_call, etc.) but just add new fields for the second chain. And we require that the main fields be the heavy/long chain, while the second chain is the other. So something like this

v_call:
    type: string
chain_type:
    type: string
    enum:
        - IGH
        - TRB
v_call_1:
   type: string
chain_type_1:
    type: string
    enum:
        - IGL
        - TRA

This supports the main idea of two (productive) chains directly, with little ambiguity about what's what. Tools which don't "think" about this would just use the current Clone object as it. We could then have an optional dictionary/array where additional chains can be enumerated.

javh commented 4 years ago

@javh are we really trying to support conversion from a clones.json file to TSV? I have so many questions about how that would work even aside from this.

I don't know. Probably only if a need arises. Though, naively, it looks trivial to my eye. You use clone_id as the row key and exclude the sequences field. If you need the individual sequence level data, you'd then search the Rearrangement data by clone_id. Then it's just a clone summary table. But, that's without considering Cell.

Some sort of type field seems like it might be a solution. Though, you'd still have to do a check of some kind, but it would be a simpler check.

The way Clone is setup right now seems really geared towards IGH/TRB/TRD data only. Hrm.

schristley commented 4 years ago

Each Cell in the Clone has a cell_id and a list of sequence_ids that link back to the rearrangements TSV - do you think that is sufficient?

I'm still thinking through this. A single Clone object is suppose to represent the whole clonal lineage, all cells and corresponding rearrangements? If that's the case, it's likely better for each Cell to point to its Clone versus having Clone contain a list of cells. Furthermore, if you gather up all the rearrangements for all those Cells, is that the same list of rearrangements in Clone's sequences array?

scharch commented 4 years ago

And we require that the main fields be the heavy/long chain, while the second chain is the other. So something like this

I think this could work, but the way you've sketched it out, it's hard to see how we'd account for non-productive rearrangements. Maybe that's rare enough or unimportant enough that it doesn't matter, but I typically bring them along and use them as additional evidence when doing clonality calculations.

A single Clone object is suppose to represent the whole clonal lineage, all cells and corresponding rearrangements? If that's the case, it's likely better for each Cell to point to its Clone versus having Clone contain a list of cells.

Yes but why treat Cells differently than Rearrangements here? Biologically, the Clone is comprised of Cells, not Rearrangements...

Furthermore, if you gather up all the rearrangements for all those Cells, is that the same list of rearrangements in Clone's sequences array?

Sort of? Not the way it's currently set up with only one chain, but this should be correct under the extension models we are discussing.

schristley commented 4 years ago

And we require that the main fields be the heavy/long chain, while the second chain is the other. So something like this

I think this could work, but the way you've sketched it out, it's hard to see how we'd account for non-productive rearrangements. Maybe that's rare enough or unimportant enough that it doesn't matter, but I typically bring them along and use them as additional evidence when doing clonality calculations.

An optional extended data structure like you suggested above for providing additional chains.

schristley commented 4 years ago

A single Clone object is suppose to represent the whole clonal lineage, all cells and corresponding rearrangements? If that's the case, it's likely better for each Cell to point to its Clone versus having Clone contain a list of cells.

Yes but why treat Cells differently than Rearrangements here? Biologically, the Clone is comprised of Cells, not Rearrangements...

"better" only in a data structure sense. As a Cell belongs to one Clone, it could be represented with a single field clone_id, while a Clone containing many Cells would require an array of cell_ids.

bcorrie commented 2 years ago

OK, we are currently implementing 10X data loading for rearrangements/clones/cells/expression.

We can currently load everything in principal and practice, based on the current AIRR Spec.

The problem arises when you try to map a specific tool chain (e.g. 10X cellranger) to the spec, in particular one that generates all of the data types as part of one processing run - when everything blows up.

I think this issue is the crux of the matter - and we appear to have been avoiding it since July 2020 8-)

In the 10X case you get:

A single clone_id has multiple chains. I have seen two and three chains thus far for a single clone_id
Our current Clone object is focused on a single chain only
Pretty well all fields in the Clone object that describe the clone (VDJ calls, junction, alignment, sequences) need to be different for each of the chains in the clone (not just the VDJ calls as discussed above). I count 18 fields based on a quick count.

So we can't really load 10X data in a particularly logical or coherent fashion when you try to do all of Rearrangements/Clones/Cells in a single repository. I am pretty sure this would also mean that you couldn't represent said data in a set of files on disk using a Manifest to tie them together...

This seems like something that should be pretty high on the priority list if we really want to claim that we have a working Rearrangement/Clone/Cell spec 8-)

bcorrie commented 2 years ago

So something like this
v_call:
    type: string
chain_type:
    type: string
    enum:
        - IGH
        - TRB
v_call_1:
   type: string
chain_type_1:
    type: string
    enum:
        - IGL
        - TRA
This supports the main idea of two (productive) chains directly, with little ambiguity about what's what. Tools which don't "think" about this would just use the current Clone object as it. We could then have an optional dictionary/array where additional chains can be enumerated.

I am not sure this would work, given that I think there are on the order of 18 fields that would need different values for multiple chains...

It seems to me that a Clone object should be an array of N CloneChains (where N is small (1-3?) but flexible) with each of the 18 fields that describe the "inferred ancestor of the clone" in the CloneChain object???

I also wonder if we should drop the sequences array, since you can look up the sequences associated with a CloneChain using the clone_id in the Rearrangements

bcorrie commented 2 years ago

Alternatively we could leave the Clone object as is, but treat it as a CloneChain object, and store multiple CloneChain objects with the same clone_id. You would then link multiple chains that are associated with the same Clone through the clone_id.

This is how we are going to load data for now, as this is really the only way to link multiple chains...

javh commented 2 years ago

I don't have a great suggestion for this right now, but I think this intersects with how we might want to think about Receptor. There's a couple things going on in the current Clone schema - properties of specific observed sequences (sequences, junction, etc) and properties of the naive ancestor that are common to all the observed members of a clone (v_call, germline_alignment, etc). The latter seems like something we can separate out into an object and use for both Clone and Receptor and then nest under primary_chain and secondary_chain.

TRA is going to be a major problem, as it's pretty common to get more than one productive TRA transcript. Do we want Clone to support more than two chains? If so, do we add more chains or nest under the relevant chain somehow? If not, do we want to allow for multiple clone_id per sequence?

bcorrie commented 2 years ago

TRA is going to be a major problem, as it's pretty common to get more than one productive TRA transcript. Do we want Clone to support more than two chains? If so, do we add more chains or nest under the relevant chain somehow? If not, do we want to allow for multiple clone_id per sequence?

Yes, I have seen some data from 10X that have multiple TRAs (3 chains per clone) - that is where I started down this rabbit hole of trying to figure out how we should curate this. I think the array of CloneChain could be pretty flexible and maybe handle this...

Good point about separating out the info of the naive ancestor from the observed properties...

scharch commented 2 years ago

I still think that the simplest solution is to allow Clone to contain Cells as an alternative to Rearrangements. Then each cell can hold (point to)? an arbitrary number of rearrangements as needed. The inferred_ancestor also becomes a cell. And all this is more biologically "correct," too.

I also wonder if we should drop the sequences array, since you can look up the sequences associated with a CloneChain using the clone_id in the Rearrangements

No, you can't --or at least you're not guaranteed to be able to. If you Clone of interest is from the original/primary analysis, it might work, but if it's a secondary or reanalysis the cell's clone_id will point forever more only to the first one.

javh commented 2 years ago

I still think that the simplest solution is to allow Clone to contain Cells as an alternative to Rearrangements.

I like this. I'm not sure how to implement it. We'll have to figure out what to do about the _count fields, especially umi_count. But, it also gets ahead of the issue of how to extend the Tree schema to paired VH:VL lineage reconstruction.

scharch commented 2 years ago

I think that umi_count can be left as is (will be null for this case) and the definition of clone_count can be expanded slightly to include the number of Cells in the Clone.

It seems to me that the bigger lift will be letting tools know what type of data they are looking at, but maybe we can get away with just a cell/chain flag?

schristley commented 2 years ago

I think that umi_count can be left as is (will be null for this case) and the definition of clone_count can be expanded slightly to include the number of Cells in the Clone.

The clone_count description has been updated to also mention cells...

It seems to me that the bigger lift will be letting tools know what type of data they are looking at, but maybe we can get away with just a cell/chain flag?

If you look at a rearrangement record for the clone, it will have a cell_id, that is one indication?

scharch commented 2 years ago

The clone_count description has been updated to also mention cells...

OK, I wasn't reading it like that, but you're right. @javh does this way of doing it satisfy you?

If you look at a rearrangement record for the clone, it will have a cell_id, that is one indication?

I guess, but I thought we've been trying to avoid that kind of two-step look up...

schristley commented 2 years ago

I still think that the simplest solution is to allow Clone to contain Cells as an alternative to Rearrangements.

I like this. I'm not sure how to implement it. We'll have to figure out what to do about the _count fields, especially umi_count. But, it also gets ahead of the issue of how to extend the Tree schema to paired VH:VL lineage reconstruction.

I like it too. Our challenge is how to handle the identifiers. Right now Clone has a sequences fields which is all the rearrangement IDs. I have repertoires where there are thousand upon thousands of rearrangement records that make up a clone. Sticking such a huge array in Clone is kind of ridiculous... We are talking about a 1-N relationship, and it's always more efficient (from a data structure perspective) to store the link on the N side, i.e. the rearrangement table.

Now I suppose Clone could have a cells fields which references the cell IDs, and you might make the argument that there will be less cell records... But honestly, I'm seeing single cell experiments that do upwards of 100K cells, so how long will that hold?

It's the same situation with a 1-N relationship between clone and cell, so it makes sense to put the clone_id inside of Cell instead of having a list of cell IDs in clone.

schristley commented 2 years ago

If you look at a rearrangement record for the clone, it will have a cell_id, that is one indication?

I guess, but I thought we've been trying to avoid that kind of two-step look up...

My perception is if you were working in a single cell context, your workflow might be like this:

Query the studies/repertoires of interest.
Query the cells based upon the repertoire IDs.
If clone data is wanted for cells, get using the clone_id in Cell.
If rearrangement data is wanted for cells, query rearrangements using cell_id.
If receptor data is wanted for cells, get using the receptor_id in Cell.

So you will be working through the Cell objects to get to other data.

scharch commented 2 years ago

Sticking such a huge array in Clone is kind of ridiculous... We are talking about a 1-N relationship, and it's always more efficient (from a data structure perspective) to store the link on the N side, i.e. the rearrangement table.

I get it, but Cells can be members of multiple Clones and, more importantly, we've set it up so that a Cell record is supposed to be more-or-less immutable in the ADC. So if my Clone of interest was generated by some sort of post-publication meta/re-analysis, you have to (as far as I can see) put cell_id into Clone instead of vice versa...

scharch commented 2 years ago

My perception is if you were working in a single cell context, your workflow might be like this:

Query the studies/repertoires of interest.

Query the cells based upon the repertoire IDs.

If clone data is wanted for cells, get using the clone_id in Cell.

If rearrangement data is wanted for cells, query rearrangements using cell_id.

If receptor data is wanted for cells, get using the receptor_id in Cell.

So you will be working through the Cell objects to get to other data.

I have to think more about this, but as a general matter you are right that I will know if I'm working in a Cell context or a Rearrangement context. Is that enough, though?

schristley commented 2 years ago

I get it, but Cells can be members of multiple Clones

Hmm, that's challenging as that implies an N-N relationship, but this isn't the biology right as a cell only belongs to one clone. So the multiple Clones come from running different analyses?

and, more importantly, we've set it up so that a Cell record is supposed to be more-or-less immutable in the ADC. So if my Clone of interest was generated by some sort of post-publication meta/re-analysis, you have to (as far as I can see) put cell_id into Clone instead of vice versa...

I think this can handled with data_processing_id, so Cell needs clone_id but also the data_processing_id that computed the clone. This doesn't handle all possibilities though, it is essentially turning the N-N relationship back into 1-N, but if we truly need N-N regardless of data processing, then we don't have much choice.

schristley commented 2 years ago

we've set it up so that a Cell record is supposed to be more-or-less immutable in the ADC.

I wonder if you mean immutable or if you mean a singleton? Nothing in the ADC is really immutable, any of the records could be updated with additional identifiers if new data processing is performed, which is then loaded into the ADC.

But I think I understand your point, if you pull out Cells from ADC then do a re-analysis of clones, it makes sense that you are generating clone data de novo while the Cell data stays the same...

scharch commented 2 years ago

isn't the biology right as a cell only belongs to one clone. So the multiple Clones come from running different analyses?

Yes, but it also could be a meta analysis combining samples (from the same individual, obviously) that were previously processed separately.

I wonder if you mean immutable or if you mean a singleton?

Yeah, you're right, I meant singleton - that a single biological/physical cell results in (at most) one Cell object.

I think this can handled with data_processing_id, so Cell needs clone_id but also the data_processing_id that computed the clone.

I don't follow, could you please explain more?

schristley commented 2 years ago

isn't the biology right as a cell only belongs to one clone. So the multiple Clones come from running different analyses?

Yes, but it also could be a meta analysis combining samples (from the same individual, obviously) that were previously processed separately.

Would a concrete example be where you had two time points? Maybe the first time point was collected and analyzed, then (say) 1 month later another time point is collected, now you do analysis over both because you expect the same Clone to be present in both time points, though they are different cells?

The reason I ask, and this is a digression, is I've been thinking about this type of time course analysis, and if a tool links clone(s) across time points, how can the tool signal that so the links are maintained when the data is loaded into the ADC?

I wonder if you mean immutable or if you mean a singleton?

Yeah, you're right, I meant singleton - that a single biological/physical cell results in (at most) one Cell object.

I think this can handled with data_processing_id, so Cell needs clone_id but also the data_processing_id that computed the clone.

I don't follow, could you please explain more?

Ok, yeah, so in general, we could take the biological relations, like a cell belongs to one clone, and just implement that directly in the data structures, and the relationships between data objects matches the relationships between their associated biological objects. But as soon as we introduce the idea of multiple data processings, that throws a wrench into the whole design. What happens is many 1-N relationships get turned into N-N relationships, as with the cell/clone relationship.

The solutions, which isn't perfect, is to introduce a second identifier, in this case data_processing_id which splits the N-N relationships into M number of 1-N relationships. M here being the number of different data processings. So how does that work concretely, well every object that is the "output" of a data processing (like Clone) has a data_processing_id. Thus data_processing_id can be used to partition the whole Clone table into subsets. We've talked about this with rearrangements, imagine processing with IgBlast and Mixcr as two separate data processings, they can be stored together yet separated by their different data_processing_ids.

So the Cell could have a list of clones, which is a compound identifier (clone_id, data_processing_id)

clones: [{clone_id:123, data_processing_id:456}, {clone_id:abc1, data_processing_id:567}]

scharch commented 2 years ago

Would a concrete example be where you had two time points? Maybe the first time point was collected and analyzed, then (say) 1 month later another time point is collected, now you do analysis over both because you expect the same Clone to be present in both time points, though they are different cells?

The reason I ask, and this is a digression, is I've been thinking about this type of time course analysis, and if a tool links clone(s) across time points, how can the tool signal that so the links are maintained when the data is loaded into the ADC?

Sure, see eg these recent papers from the Nussenzweig group at Rockefeller: https://pubmed.ncbi.nlm.nih.gov/33461210/ https://pubmed.ncbi.nlm.nih.gov/34126625/

Or, from our work: https://www.ncbi.nlm.nih.gov/pubmed/24590074/ https://pubmed.ncbi.nlm.nih.gov/26468542/

And there are other similar cases.

I think Bryan Briney is working on a follow-up to this that will include some of the same donors. May or may not be different time points, not sure, but the goal is a little different than the previous examples, anyway.

scharch commented 2 years ago

The solutions, which isn't perfect, is to introduce a second identifier, in this case data_processing_id which splits the N-N relationships into M number of 1-N relationships. M here being the number of different data processings. So how does that work concretely, well every object that is the "output" of a data processing (like Clone) has a data_processing_id. Thus data_processing_id can be used to partition the whole Clone table into subsets. We've talked about this with rearrangements, imagine processing with IgBlast and Mixcr as two separate data processings, they can be stored together yet separated by their different data_processing_ids.

So the Cell could have a list of clones, which is a compound identifier (clone_id, data_processing_id)
clones: [{clone_id:123, data_processing_id:456}, {clone_id:abc1, data_processing_id:567}]

I see. This makes sense to me, you'd just update the Cell record(s) in the repository to add a new compound identifier to the list. Seems reasonable enough...

schristley commented 1 year ago

@scharch @javh It's been awhile since the last discussion burst. Do you think we've enough concrete ideas to adjust the draft objects?

There's going to be a large set of single-cell studies coming down the pipe and going into the ADC, it would be good to implement some of these ideas and see how they work.

bcorrie commented 1 year ago

FYI we have loaded one 10X single cell study into the ADC already (with rearrangements, clones, cells, and GEX), and our clone compromise was to choose one of the chains for clone_id, create a single clone, and store consensus clone data (VDJ+Junction) from one chain. You can find the rearrangement for the other chain using the clone_id in the Rearrangement collection, but due to limits in our clone object we store only a single VDJ/Junction.

When you load the data into a repository you choose which chain you want to focus on.

Far from ideal but seemed like a decent compromise.

bcorrie commented 9 months ago

This one seems like a big one - I think we need to decide as to whether this gets fixed as part of v2.0 or is noted as a weakness/gap in the standard that is not currently addressed.

scharch commented 9 months ago

Yeah this is one of the ones on my personal to-do list...

bcorrie commented 9 months ago

We now have 4 single-cell 10X studies in ADC, and each of the study's Clone data is loaded with a single chain only (as described above), even though the clone in this case is a paired chain clone.

It would be nice if we could fix this (although it means I would need to update a bunch of data) 8-)

schristley commented 9 months ago

15 single-cell 10X studies in ADC actually, though the studies in the VDJServer repository have not loaded Clone data. The backlog of studies is still steadily growing. I think this is one of the high priority items that is needed if the ADC is to grow beyond just rearrangement data.

scharch commented 9 months ago

OK I'll try to put a PR together for discussion on the March call...

bcorrie commented 9 months ago

15 single-cell 10X studies in ADC actually, though the studies in the VDJServer repository have not loaded Clone data.

Yes, I meant there are 4 10X studies with loaded Clone data - with that data loaded in an "unsatisfactory" way because Clone is not oriented towards paired chains.

scharch commented 9 months ago

If we adapt Clone so that it can contain Cells, do we need a way to connect/partition the Rearrangements within each Cell? This goes beyond heavy/light/alpha/beta. Example: T cell clone with two TCRa chains, maybe even using the same V gene...

schristley commented 9 months ago

If we adapt Clone so that it can contain Cells, do we need a way to connect/partition the Rearrangements within each Cell? This goes beyond heavy/light/alpha/beta. Example: T cell clone with two TCRa chains, maybe even using the same V gene...

@scharch If I understand what you mean, this is already there with cell_id in the rearrangement object. That let's you pull out the rearrangements for a specific Cell.

scharch commented 9 months ago

No, I mean Cell1 has rearrangements TRB123, TRA456, and TRA789. Cell2 has rearrangements TRB098, TRA765, and TRA432. Do we need to be able to link TRA456 as corresponding to TRA432 vs TRA765 (or even TRB098, though that's easier to code around).

schristley commented 9 months ago

No, I mean Cell1 has rearrangements TRB123, TRA456, and TRA789. Cell2 has rearrangements TRB098, TRA765, and TRA432. Do we need to be able to link TRA456 as corresponding to TRA432 vs TRA765 (or even TRB098, though that's easier to code around).

Sorry, I'm still not understanding. Is the "meaning" of the link to say that those are the "equivalent" chains in two different Cells? If that's the case, won't the VDJ calls (plus maybe CDR3) be sufficient to imply this connection? I mean, if two Cells are in the same Clone, the TRB gene should be the same in both Cells. Likewise for the alpha chain. I can see there might be some ambiguity with B cells and SHM.

I guess another way to ask the question is how would you use that link? What problem would it solve for you?

scharch commented 9 months ago

If that's the case, won't the VDJ calls (plus maybe CDR3) be sufficient to imply this connection?

For the researcher looking at the data? Almost certainly. The question is if we need to make it easy to do by code.

how would you use that link? What problem would it solve for you?

Dunno. I was asking if it was something worth designing around when I'm trying to figure out an updated Clone schema. If no one has a use case, then that's my answer :)

schristley commented 9 months ago

@scharch Even though it might not be in our list of requirements, I'll note that Clone is perfectly amenable to a TSV format if only that pesky sequences array is dealt with. There could be considerable benefit and uptake to the Clone spec if toolchains like Immcantation and Repcalc, which are already processing clone TSV files (I may not be completely correct about that), don't need significant retooling to support AIRR. Bonus points in that programs that calculate things like gene usage and CDR3 length distributions that run on rearrangement TSVs, could run on Clone TSVs without change.

scharch commented 8 months ago

if only that pesky sequences array is dealt with

that programs that calculate things like gene usage and CDR3 length distributions that run on rearrangement TSVs, could run on Clone TSVs without change

It seems to me like you are imagining something entirely different, more a list of inferred naive ancestor across an entire Repertoire. I can see the value in the that, but Clone is more geared toward in-depth analysis of a small number of lineages. And, to be frank, it's very B cell biased. Hard to think of a T cell use case that would be worth an entire Clone, but maybe that's your point.

BUT! I think #769 can solve this, too. There, I am proposing representing inferred naive ancestors as "nonphysical" Rearrangements (or Cells). So if I am understanding correctly what you want, ~~you could just filter for nonphysical==True et voila!~~

Edit: It's probably not that simple. You'd probably have to iterate through Clone objects and extract the naive_ancestor from each. But the point is that it would still be present as a nonphysical rearrangement, and I'd bet it would be relatively straightforward to tweak that a little to help your use case.

schristley commented 8 months ago

It seems to me like you are imagining something entirely different, more a list of inferred naive ancestor across an entire Repertoire. I can see the value in the that, but Clone is more geared toward in-depth analysis of a small number of lineages. And, to be frank, it's very B cell biased. Hard to think of a T cell use case that would be worth an entire Clone, but maybe that's your point.

Hmm, probably, for T cells at least this is essentially a collapse of (potentially many) rearrangements records into a single record (with a count). Yeah, for B cells, is it a naive ancestral sequence? Or is it a consensus sequence? In either case you are right, it is a computationally inferred sequence (nonphysical might not be right descriptor) versus being an observed sequence.

That's even assuming I care about the sequence. I'm likely thinking about it wrong, as a smaller, more compact representation of data in the rearrangements (though still potentially large), while you are thinking about it as actual biology.

airr-community / airr-standards

Extend Clone to single-cell context #317