airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Add a "nonphysical" keyword to Rearrangement and Cell #769

Open scharch opened 4 months ago

scharch commented 4 months ago

Closely related to #201, obviously, but I'm actually more thinking about #317 and efforts to simplify the Clone schema. For #201, all Rearrangements/Cells in a Repertoire would be nonphysical, which is why I suggested a Repertoire-level is_simulated keyword.

However, in the Clone space we have inferred intermediates/ancestors, which I guess would either be part of the same Repertoire as the observed Rearrangements/Cells they are based on, or maybe not part of a Repertoire at all.

Currently, we handle this by siloing them into the Clone schema, either directly in Clone (using fields like v_call, germline_alignment, etc) or by converting them into Node objects (which in turn requires Tree to be an object instead of just a field). That's what's making #317 hard, because we've set up Clone and Node to mimic Rearrangements and now we also want them to be able to mimic Cells.

If we instead store the inferred intermediates/ancestors as bona fide but nonphysical Rearrangements/Cells, then Clone can just have a generic array of members and the problem goes away. So crazy it just might work?

schristley commented 4 months ago

Ah, okay, I understand better what you are saying now. Creating inferred Rearrangments/Cells would be a nice way to re-use the schema, yes so crazy it might work! However, it creates the situation that a fake Repertoire needs to be created to hold it all together. But even that is not right, presumably you do have a real Repertoire for the data, but while doing DataProcessing, you are creating inferred Rearrangements and you don't want them to be accidentally included in other computation on the "real" rearrangements. Assigning those inferred Rearrangements to another Repertoire would tend to break the whole chain of processing.

Yes, this is particularly tricky and goes beyond just the idea of supporting "simulated" data sets. I'll ponder on this awhile, but my initial thought is that these inferred things need to be in their own "collections" separate from the other data, yet tied to it using an independent identifier.

scharch commented 4 months ago

My hope is that if we create a way to have a simulated repertoire, it could be relatively easily extended to a "fake" (inferred?) repertoire, as well. But I'm not as optimistic as @javh =P so I'm guessing it'll get pretty hairy.

schristley commented 4 months ago

My hope is that if we create a way to have a simulated repertoire, it could be relatively easily extended to a "fake" (inferred?) repertoire, as well. But I'm not as optimistic as @javh =P so I'm guessing it'll get pretty hairy.

But you still want it to be connected to a real repertoire with the experimental protocol, right? Because if I'm understanding properly, you are still doing a (say) single-cell experiment, which is described in a Repertoire, and that you process into rearrangements/cells, but when you start investigating clones and lineage, you are inferring new sequences?

That's slightly different from a simulated dataset where essentially everything is "fake"

scharch commented 4 months ago

Yes and yes. So unlike a simulated repertoire, those fields wouldn't be nulled.

schristley commented 4 months ago

There is an "easy" solution but it unfortunately creates significant churn. That is, add an identifier. Just like repertoire_id partitions rearrangements between repertoires, and then data_processing_id at the next level to partition rearrangements within the same repertoire for different data processings, you could add an identifier at a third level that further partitions between real vs inferred. The problem is that implies significant change across the whole tool chain, i.e. the ADC API and analysis tools, which have repertoire_id and data_processing_id baked into their code. All that would need to be rewritten to support a third identifier. So scratch that off the list.

We want to avoid breaking the existing tool chain, so that implies that the inferred rearrangements/cells need to have a different repertoire_id.

scharch commented 4 months ago

you could add an identifier at a third level that further partitions between real vs inferred

I don't think that would work, anyway: Clones will frequently be calculated on RepertoireGroups, so it wouldn't be obvious which Repertoire to put an inferred sequence in even if you could distinguish it by an _id.

schristley commented 4 months ago

you could add an identifier at a third level that further partitions between real vs inferred

I don't think that would work, anyway: Clones will frequently be calculated on RepertoireGroups, so it wouldn't be obvious which Repertoire to put an inferred sequence in even if you could distinguish it by an _id.

Ok, I missed that. So I guess when you say RepertoireGroup, you mean that you have multiple repertoires for a subject, e.g. time course or different tissues or such, and you want to combine them together when doing the clonal inference? Makes sense to me.