Open scharch opened 4 months ago
Ah, okay, I understand better what you are saying now. Creating inferred Rearrangments/Cells would be a nice way to re-use the schema, yes so crazy it might work! However, it creates the situation that a fake Repertoire
needs to be created to hold it all together. But even that is not right, presumably you do have a real Repertoire
for the data, but while doing DataProcessing
, you are creating inferred Rearrangements
and you don't want them to be accidentally included in other computation on the "real" rearrangements. Assigning those inferred Rearrangements
to another Repertoire
would tend to break the whole chain of processing.
Yes, this is particularly tricky and goes beyond just the idea of supporting "simulated" data sets. I'll ponder on this awhile, but my initial thought is that these inferred things need to be in their own "collections" separate from the other data, yet tied to it using an independent identifier.
My hope is that if we create a way to have a simulated repertoire, it could be relatively easily extended to a "fake" (inferred?) repertoire, as well. But I'm not as optimistic as @javh =P so I'm guessing it'll get pretty hairy.
My hope is that if we create a way to have a simulated repertoire, it could be relatively easily extended to a "fake" (inferred?) repertoire, as well. But I'm not as optimistic as @javh =P so I'm guessing it'll get pretty hairy.
But you still want it to be connected to a real repertoire with the experimental protocol, right? Because if I'm understanding properly, you are still doing a (say) single-cell experiment, which is described in a Repertoire
, and that you process into rearrangements/cells, but when you start investigating clones and lineage, you are inferring new sequences?
That's slightly different from a simulated dataset where essentially everything is "fake"
Yes and yes. So unlike a simulated repertoire, those fields wouldn't be nulled.
There is an "easy" solution but it unfortunately creates significant churn. That is, add an identifier. Just like repertoire_id
partitions rearrangements between repertoires, and then data_processing_id
at the next level to partition rearrangements within the same repertoire for different data processings, you could add an identifier at a third level that further partitions between real vs inferred. The problem is that implies significant change across the whole tool chain, i.e. the ADC API and analysis tools, which have repertoire_id
and data_processing_id
baked into their code. All that would need to be rewritten to support a third identifier. So scratch that off the list.
We want to avoid breaking the existing tool chain, so that implies that the inferred rearrangements/cells need to have a different repertoire_id
.
you could add an identifier at a third level that further partitions between real vs inferred
I don't think that would work, anyway: Clone
s will frequently be calculated on RepertoireGroup
s, so it wouldn't be obvious which Repertoire
to put an inferred sequence in even if you could distinguish it by an _id
.
you could add an identifier at a third level that further partitions between real vs inferred
I don't think that would work, anyway:
Clone
s will frequently be calculated onRepertoireGroup
s, so it wouldn't be obvious whichRepertoire
to put an inferred sequence in even if you could distinguish it by an_id
.
Ok, I missed that. So I guess when you say RepertoireGroup
, you mean that you have multiple repertoires for a subject, e.g. time course or different tissues or such, and you want to combine them together when doing the clonal inference? Makes sense to me.
Closely related to #201, obviously, but I'm actually more thinking about #317 and efforts to simplify the
Clone
schema. For #201, allRearrangement
s/Cell
s in aRepertoire
would be nonphysical, which is why I suggested aRepertoire
-levelis_simulated
keyword.However, in the
Clone
space we have inferred intermediates/ancestors, which I guess would either be part of the sameRepertoire
as the observedRearrangement
s/Cell
s they are based on, or maybe not part of aRepertoire
at all.Currently, we handle this by siloing them into the
Clone
schema, either directly inClone
(using fields likev_call
,germline_alignment
, etc) or by converting them intoNode
objects (which in turn requiresTree
to be an object instead of just a field). That's what's making #317 hard, because we've set upClone
andNode
to mimicRearrangement
s and now we also want them to be able to mimicCell
s.If we instead store the inferred intermediates/ancestors as bona fide but
nonphysical
Rearrangement
s/Cell
s, thenClone
can just have a generic array of members and the problem goes away. So crazy it just might work?