If I could redesign Repertoire and its buddies...

airr-community / airr-standards

AIRR Community Data Standards

https://docs.airr-community.org

Creative Commons Attribution 4.0 International

35 stars 23 forks source link

If I could redesign Repertoire and its buddies... #441

Closed schristley closed 3 years ago

schristley commented 4 years ago

It is said that hindsight is 20/20 and I'm not sure that I see that clearly, but given the expansion of the AIRR Data Model, some flaws are becoming apparent. Here's your chance to pile on!

What did we do right?

Repertoire as the abstract unit of analysis was the correct thing to do because it breaks the "one sample/one file" mentality, which is rigid and limited. It should allow flexibility in combining samples (sequencing data) for different analysis comparisons. It should also allow easy access to metadata, both experimental and computational.

What are the major flaws?

One flaw is that creating additional Repertoires that represent different mixtures of samples (sequencing data) is inherently expensive to do, when it should be easy and inexpensive. Those additional repertoires invariably require duplicating rearrangement data, because a single rearrangement record can only link to a single Repertoire. The answer isn't to change that field into a list...
Another flaw is the Repertoire object is trying to do too much, and thus it compromises to meet all the objectives. It's trying to be a structured record of study metadata. It's also trying to be a structured record of data processing protocols. It should restrict itself to just defining the unit of analysis and leave the other stuff to other objects.

What would I change?

Pull DataProcessing out of Repertoire and make it its own top-level independent object. It was initially put in so that "all study metadata" could be conveniently accessed in a single object. Then it needed to be an array to handle multiple analysis workflows. Now it's got its tendrils into everything...
After pulling out DataProcessing, change sample from an array to a single object, then give that whole thing (study, subject, sample) a new name. Maybe it's simply a SampleProcessing record.
Reorient Repertoire around Clone instead of Rearrangement. Clones really become the primary object for downstream analysis.

What does Repertoire look like then? Not sure exactly, but it would be a light weight composite object that mostly links to other objects. Creating a new repertoire should be easy and does not require updating ids, duplicating data, or otherwise changing the primary data.

For this to work, DataProcessing needs to be redesigned as well. While Repertoire is primary a grouping concept, DataProcessing is a transform concept that takes input and generates output, i.e. takes raw sequences to generate rearrangements, takes rearrangements to generate clones, and so forth. SampleProcessing is also a transform concept, going from biological stuff to a raw sequencing data. Though SampleProcessing is essentially constant because it's already been done, while DataProcessing is a changing construct as analysis proceeds. Finding the right relationships between the three would be key to any new design.

I will leave the above text untouched for historical purposes and will use this section to keep a running design.

What are our requirements?

A Repertoire should contain sets of sequences that represent a sample of the biological repertoire, i.e., all Ig/TCRs of one subject at one time point. The schema should enforce the one subject, one time point to avoid misinterpretation and incorrect curation.
We need an object(s) that relaxes these constraints so that we can represent and perform analyses such as:

a. comparing sets of sequences from one subject sampled at different time points b. comparing sets of sequences from multiple subjects sampled at the same time point c. unrestricted grouping of sets of sequences.

Having DataProcessing placed within Repertoire may no longer be appropriate if we want to better represent processing on other object like Clones, Cells, etc.

scharch commented 4 years ago

Thanks for this @schristley! To help clarify, are you thinking about making DataProcessing into a grouping concept, then (ie something that could function as a type of RepertoireSet)?

schristley commented 4 years ago

Thanks for this @schristley! To help clarify, are you thinking about making DataProcessing into a grouping concept, then (ie something that could function as a type of RepertoireSet)?

That's a good question. I still personally lean toward keeping DataProcessing a monolithic design ( #313 ) that encompasses all of the processing versus each object describing a singular data processing step. This would suggest as you indicate, DataProcessing says "here's a list of Repertoires that I processed". But flipping the (Repertoire/DataProcessing) relationship might not be sufficient, we are still left with the same dilemma.

To illustrate, if I have a Rearrangement record, I want to answer these questions:

What DataProcessing generated me?
What Repertoire am I part of, but if we go along with the idea to not duplicate Rearrangement records then we actually ask, what Repertoires (plural) am I part of?
If Repertoire is no longer is the primary holder of the study metadata, then what SampleProcessing am I from?

This essentially implies we need to have, at least, three identifiers so we can maintain the relationships. This is the crux of the problem.

Long, long time ago (but in this galaxy... ;-), the discussion of linking Repertoire and Rearrangement involved two identifiers, repertoire_id and rearrangement_set_id. The thought was, there may be a need to define sets of rearrangements independent from the set of rearrangements in a repertoire. However, it was decided that a single repertoire_id was probably sufficient, but now that original thought seems to be correct. We need an abstract layer between Repertoire and Rearrangement so that manipulating one doesn't require manipulating the other.

So if a "rearrangement set" isn't defined by a repertoire, what is it defined by? It is certainly related to DataProcessing somehow. In particular, some set of raw sequencing file(s) as input to DataProcessing produced this rearrangement set. It is also certainly related to SampleProcessing somehow, as that's the biological protocol which generated the raw sequencing file(s). However, DataProcessing is monolithic so there are other unrelated raw sequencing files as input, and there are other unrelated rearrangement sets as output. We essentially just described a classic n-to-n relationship.

This isn't too far off from our current Repertoire definition. It has a SampleProcessing array, so that's one side of the n-to-n relationship. But the other side is muddled with DataProcessing being an array, so you need both repertoire_id and data_processing_id within Rearrangement to get the other side of the n-to-n relation.

Getting back to your question, flipping the Repertoire/DataProcessing relationship so that DataProcessing has a list of Repertoires versus Repertoire having a list of DataProcessing as it is now, does "clean up" the n-to-n relationship. Now a Repertoire is only associated with a single DataProcessing, thus data_processing_id can be tossed out of Rearrangement. However, you see what we've done, Repertoire is no longer a "unit of analysis". It has strictly become the definition of a set of rearrangements, output by a single DataProcessing from a set of input SampleProcessing. We need to call this something else so we can use Repertoire for analysis, or we create a new name for the analysis object.

bussec commented 4 years ago

I agree with the general idea of Repertoire being an abstract unit and thinking beyond the scope of a single file. While I also agree that it is meaningful to combine some sets of sequences into one Repertoire, this is however not a universally meaningful operation to perform. In my interpretation, a Repertoire should contain sets of sequences that represent a sample of the biological repertoire, i.e., all Ig/TCRs of one subject at one time point. Therefore, it is perfectly fine to combine sequences derived from various tissues (e.g., spleen, bone marrow, peripheral blood) or different cell subsets (e.g., naive, memory, activated) from a single donor and time point. If, however, we allow mixing of donors and/or across time points we degrade this concept of Repertoire to a mere "bag of sequences", which IMO complicates the handling of this object. For the avoidance of doubt, this does not imply that I am questioning

a. the relevance of comparing sets of sequences from one subject sampled at different time points, b. the relevance of comparing sets of sequences from multiple subjects sampled at the same time point or c. that we should have an entity in our data model that allows unrestricted grouping of sets of sequences.

I am just saying that Repertoire is not the right object for this.

Within these limits, I am supportive of flexibility in combining sequence sets.

Reorient Repertoire around Clone instead of Rearrangement. Clones really become the primary object for downstream analysis.

I probably agree with this, as it seems easier to represent cells in this context. But I will have to think about this in more detail, likely also coming back to #317.

bcorrie commented 4 years ago

I think there are two or possibly three things we use Repertoire for, which is causing part of our problem:

Repertoire representing the most granular level of metadata. No arrays of samples, no arrays of data processing, just a big composite object with all AIRR spec metadata fields that provides all the metadata for a thing of interest (Rearrangement/Clone/Cell). This isn't a grouping at all, it is just a composite object that captures the metadata associated with another entity in the data model. This is essentially the simplest form and is the baseline for what the /repertoire API can (should?) return.
Repertoire in the biological sense as @bussec describes it above, a grouping that "...contain sets of sequences that represent a sample of the biological repertoire, i.e., all Ig/TCRs of one subject at one time point". I would suggest that this is the view from the perspective of the study designer/experimenter.
Repertoire as a grouping for analysis as @schristley describes it above, something that allows "... flexibility in combining samples (sequencing data) for different analysis comparisons. It should also allow easy access to metadata, both experimental and computational." I would suggest that this is the view from the perspective of the bioinformatician who is performing analyses possibly across several studies.

As @schristley says, one flaw "... is the Repertoire object is trying to do too much, and thus it compromises to meet all the objectives."

I suspect we need multiple entities... The challenge is how do we redesign this without blowing everything up! It seems to me like perhaps we need a different object that captures the analyses, as they seem the most flexible and dynamic?

scharch commented 4 years ago

What Repertoire am I part of, but if we go along with the idea to not duplicate Rearrangement records then we actually ask, what Repertoires (plural) am I part of?

To me, part of the point of a "repertoire set" is to eliminate the plural here. Rearrangement is always exactly 1:1 with Repertoire; both are static in the database. Instead of creating a new Repertoire, a secondary/meta/re-analysis creates a new "repertoire set" that points to the Repertoires that are included. This has the advantage of allowing subsetting Repertoires, in addition to supersetting them.

However, you see what we've done, Repertoire is no longer a "unit of analysis". It has strictly become the definition of a set of rearrangements, output by a single DataProcessing from a set of input SampleProcessing. We need to call this something else so we can use Repertoire for analysis, or we create a new name for the analysis object.

Exactly! The "repertoire set" becomes the analysis object, which simultaneously allows for things like analyzing clones across repertoires as discussed in https://github.com/airr-community/airr-standards/pull/251#issuecomment-541242498 or combining old and new data as in https://github.com/airr-community/airr-standards/issues/246#issuecomment-531381149.

I think there are two or possibly three things we use Repertoire for, which is causing part of our problem:

Uses 1 and 2 still seem compatible to me, and I think moving use 3 to a repertoire set provides a different way to clean up DataProcessing. If RepertoireSet is the unit of analysis, then we can say 1 SampleProcessing + 1 DataProcessing = 1 Repertoire while maintaining the "flexibility in combining samples (sequencing data) for different analysis comparisons.". Probably SampleProcessing needs to be tweaked a bit so that technical replicates and similar cases can still make it into the same single Repertoire, but that shouldn't be too hard. And I think such a set up addresses @bussec's concerns, too.

Reorient Repertoire around Clone instead of Rearrangement. Clones really become the primary object for downstream analysis.

I'm not sure this works, at least not without a bigger redefinition of terms. It certainly doesn't seem compatible with

a Repertoire should contain sets of sequences that represent a sample of the biological repertoire, i.e., all Ig/TCRs of one subject at one time point.

since Clones can pretty clearly span time points...

javh commented 4 years ago

I like the idea of separating out DataProcessing into its own top level object. If I understand you correctly @schristley, you're basically saying we should have separate objects for "Methods" and "Results" instead of forcing "Methods" to be nested under "Results"? That makes sense to me. Repertoire and RepertoireSet could be tailored around being top level representations of the data itself, whereas DataProcessing and SampleProcessing would be tailored to data provenance and describing the experimental or analytical methods used to generate the data. They'd be linked, but not nested. Yes?

scharch commented 4 years ago

@javh that's clarifying, thanks. So then maybe Manifest (#426) becomes the composite metadata object for use 1 that @bcorrie described above? My only concern is how that might impact the API endpoints...

schristley commented 4 years ago

In my interpretation, a Repertoire should contain sets of sequences that represent a sample of the biological repertoire, i.e., all Ig/TCRs of one subject at one time point. Therefore, it is perfectly fine to combine sequences derived from various tissues (e.g., spleen, bone marrow, peripheral blood) or different cell subsets (e.g., naive, memory, activated) from a single donor and time point.

This might be something we try to encode more explicitly in the schema. Right now the Repertoire schema does enforce the one subject, but it does not enforce the one time point. How might we do that? The obvious thing is to pull out the time-relevant fields out of the sample array, and put them where there can only be a single instance of the values.

If, however, we allow mixing of donors and/or across time points we degrade this concept of Repertoire to a mere "bag of sequences", which IMO complicates the handling of this object.

There clearly are analyses that need to do this, but I agree that we should make this a different concept, e.g. the RepertoireSet, where these constraints can be relaxed.

schristley commented 4 years ago

Therefore, it is perfectly fine to combine sequences derived from various tissues (e.g., spleen, bone marrow, peripheral blood) or different cell subsets (e.g., naive, memory, activated) from a single donor and time point.

Should we support this in the Repertoire object, which is similar to the current Repertoire, where there can be a completely different sets of sample processing? or should we restrict Repertoire even further, e.g. the only multiplicity allowed is biological/technical replicates, and combinations are handled through a different concept? This could make the Repertoire object simpler if we can come up with a simple way to represent replicates.

It also might have the benefit of eliminating the desire to want to identify and analyze those samples separately instead of the repertoire as whole, which entails another id that needs to incorporated and used to link the object in the data model.

schristley commented 4 years ago

I like the idea of separating out DataProcessing into its own top level object. If I understand you correctly @schristley, you're basically saying we should have separate objects for "Methods" and "Results" instead of forcing "Methods" to be nested under "Results"? That makes sense to me. Repertoire and RepertoireSet could be tailored around being top level representations of the data itself, whereas DataProcessing and SampleProcessing would be tailored to data provenance and describing the experimental or analytical methods used to generate the data. They'd be linked, but not nested. Yes?

That's an interesting analogy. Not sure it completely fits but you have the right idea. Having DataProcessing within Repertoire does imply a nesting. I'm not sure if we actually consider it nested, it just may be because we "denormalized" the relationship, it seems that way. But if DataProcessing is going to be more than just processing a Repertoire but also may be processing RepertoireSet, Clones, Cells, etc., then it feels like having it stuck in Repertoire is limiting. Make it a top level object that links to other things should provide flexibility and be conceptually simpler.

bcorrie commented 4 years ago

Uses 1 and 2 still seem compatible to me, and I think moving use 3 to a repertoire set provides a different way to clean up DataProcessing. If RepertoireSet is the unit of analysis, then we can say 1 SampleProcessing + 1 DataProcessing = 1 Repertoire

As @scharch says, I think the above is an important "thing" to maintain. Whether or not we call the thing on the right a Repertoire or something different remains to be seen. I will call this a Thing One.

So we now have: 1 SampleProcessing + 1 DataProcessing = 1 Thing One

I think for all of the "experimental observations" that we measure (sequence) or derive (from sequences) we should be able to associate back to a single Thing One. That is, each Rearrangement , Cell, and Clone would come from one (and only one) Thing One. Each Thing One would be a composition of a sample processing and data processing, and clearly they would be different for Rearrangement, Clone, and Cell, but they would uniquely describe how each "observation" was produced.

If you wanted to group these experimental observations in different ways, we could have Thing Two that would allow you to group things arbitrarily based on your analysis. Thing One is observational and study driven and Thing Two is analysis driven. Not sure what to call them, but this seems to make sense to me.

So then maybe Manifest (#426) becomes the composite metadata object for use 1 that @bcorrie described above? My only concern is how that might impact the API endpoints...

I think this is also important to consider. For me, conceptually, our current /repertoire endpoint returns /thing_one like entities and these are linked directly to rearrangements as I describe above. Each rearrangement belongs to a single <repertoire, sample processing, data processing> triple. If we have:

1 SampleProcessing + 1 DataProcessing = 1 Repertoire = 1 Thing One

then we have this simple model for our current endpoint that links Rearrangements to Repertoires. So this doesn't significantly change our current API structure.

One question I have is does the same apply for Cell and Clone. This gets a bit more complicated in that a Clone is typically derived computationally from a set of Rearrangements, but I think this could be captured by a more "sophisticated" DataProcessing object. If we could ideally use this same DataProcessing object to compose Thing Twos then maybe we have a solution? I think this is similar to what @javh was suggesting with DataProcessing being more like Methods, so we use them as building blocks to describe and compose both data curation (Thing One) and analysis (Thing Two), with DataProcessing used by both???

For a more detailed technical description of some ideas of the chaos one can get to if you don't carefully design and manage Thing One and Thing Two please refer to Link One and Link Two

schristley commented 4 years ago

It's an interesting exercise to eliminate Repertoire for a moment, and see how to connect objects without every having to have a Repertoire object.

Create a Study object, then create Subject objects, then create SampleProcessing objects. Note how with the current schema, none of these objects link to each other. A Subject doesn't know what Study it is in, likewise for a SampleProcessing, nor the Subject for the sample!

This can be resolved by adding id fields. Give SampleProcessing a subject_id and a study_id, and give Subject a study_id, for example.

Then the last thing to be MiAIRR compliant is a DataProcessing object, which at a minimal describes how the raw sequence data in SampleProcessing gets converted into annotated sequences (set 4 -> set 6). Of course, DataProcessing has the same issue in that it doesn't link to the other object, so id fields need to be added, in particular study_id and sample_processing_id. And viola, no need for Repertoire!

However, what is the link between a Rearrangement and these objects? Pause that question, and we will come back to it...

Furthermore, this structure is limited. There are occasions when multiple sequencing files need to be combined together, e.g. replicates where the sequences are combined and collapsed. Ok, no problem, just make sample_processing_id an array in DataProcessing so it can have multiple files as input.

The other limitation is that it only allows for one DataProcessing. Well, that too is easy to fix, give DataProcessing its own id with a data_processing_id field, now there can be multiple, each with their own id to tell them apart.

But here is an interesting side effect. Take a closer look at DataProcessing now, because it directly links to SampleProcessing objects, there cannot be a single DataProcessing which describes the whole study. You will need to create a separate DataProcessing object for each SampleProcessing.

Ok, well backtrack then, get rid of the sample_processing_id array from DataProcessing, now it can be a single object that describes the whole study.

Now at this point curation/study annotation is done, at least from a MiAIRR perspective. The id fields I mentioned aren't even required. MiAIRR doesn't require id fields to link the object, that's out of its scope, all it cares about is that the information exists. It's up to repositories and etc to manage and link it properly.

Now we come back to question of the link between a Rearrangement and the other objects. It's not so obvious how to link but it looks like you'd need study_id, a sample_processing_id array and a data_processing_id on each Rearrangement object.

This was one of the objectives that Repertoire was to solve. Eliminate as many of those id fields in the objects as possible, and Rearrangement would have a single repertoire_id, and Repertoire was a composite object that contained all the appropriate objects. Analysis tool now had a single link they could follow to get all of the study metadata.

This was all good until the question of multiple data processing came up. Then suddenly the single DataProcessing in Repertoire became an array, and then Rearrangement needed both repertoire_id and data_processing_id. This was probably the first mistake. We should have kept just a single DataProcessing and required that you create a new Repertoire object if you were doing different DataProcessing. That would have prevented the need to add data_processing_id into Rearrangement.

So getting to what @scharch mentioned:

1 SampleProcessing + 1 DataProcessing = 1 Repertoire

This isn't sufficient according to the current schema. You don't know the study nor the subject. This would be more accurate:

1 Study + 1 Subject + N SampleProcessing + 1 DataProcessing = 1 Repertoire

Now if you add id fields like suggested above for SampleProcessing, then you can make this simpler, for example:

N SampleProcessing + 1 DataProcessing = 1 Repertoire

or if you let DataProcessing point to its SampleProcessing, you can get all the way to this

1 DataProcessing = 1 Repertoire

where in fact DataProcessing doesn't look much different Repertoire, except it's in normal form instead of being denormalized.

bcorrie commented 4 years ago

This isn't sufficient according to the current schema. You don't know the study nor the subject. This would be more accurate:

1 Study + 1 Subject + N SampleProcessing + 1 DataProcessing = 1 Repertoire

I think the most basic representation, and essentially what MiAIRR gives us (MiAIRR does not "do hierarchies" 8-), is:

1 Study + 1 Subject + 1 SampleProcessing + 1 DataProcessing = 1 ThingOne (I don't want to call it a Repertoire - cause it isn't)

Essentially you have each MiAIRR field once, and only once, in each ThingOne.

This is not normalized, and duplicates information for sure, but captures everything that is required to describe how a given Rearrangement (or set of Rearrangements) is produced. If you have different entities at any of these levels (e.g. different Subjects) you just have another ThingOne. For a Study that has 4 Subjects, 2 Samples per Subject, and 2 DataProcessings per sample you have 1 4 2 * 2 = 16 ThingOnes. If each has a thingone_id then that can be used to group Rearrangements with sufficient completeness. I think that is the "simplest" design in that you are not enforcing any relationships at all at the cost of duplicating data.

One then has to ask, how does one want to group these things and/or normalize data so you aren't replicating information. If you want to avoid duplication of data and try to normalize things, then absolutely, you need _id fields and relationships.

Note that one can infer relationships in the simple MiAIRR model above if the guidelines in the spec are followed - that is study_id should be unique, and subject_id and sample_id should be unique within a Study.

schristley commented 4 years ago

captures everything that is required to describe how a given Rearrangement (or set of Rearrangements) is produced

except when it isn't everything, in which case you need:

1 Study + 1 Subject + N SampleProcessing + 1 DataProcessing = 1 ThingOne

because DataProcessing may take multiple sequencing files and combine/collapse them. This is required because DataProcessing is part of ThingOne, if you take it out then this is sufficient:

1 Study + 1 Subject + 1 SampleProcessing = 1 ThingOne

Then you just need to describe how ThingOne and DataProcessing are combined to get you rearrangements.

schristley commented 4 years ago

because DataProcessing may take multiple sequencing files and combine/collapse them

This isn't theoretical. There are studies in ADC that already do this, take Anne's study for example. For many of the repertoires (example), blood was taken and processed down to cDNA in a tube, then 2 or 3 aliquots were taken and sequenced separately. The whole purpose was to get greater coverage of that single biological repertoire. The sequencing files from those separate runs were combined as part of the data processing to get the set of rearrangements.

schristley commented 4 years ago

The more I think about it, the more I think Repertoire is okay. We could do a little refinement to make it better. We could also make different decisions on how to denormalize the relations, but they tend to look equivalent to each other...

I think the main decision for us is what to do about DataProcessing. Is it a single, monolith object for a whole study that describes all processing, not just raw data to rearrangements, but clonal analysis and onward? Or is DataProcessing more fine-grained, and describes individual data processing activities, the input and output file/repertoire/repertoire sets/object/etc involved, the tools used, and so on?

schristley commented 3 years ago

The more I think about it, the more I think Repertoire is okay. We could do a little refinement to make it better. We could also make different decisions on how to denormalize the relations, but they tend to look equivalent to each other...

@javh @scharch @bcorrie After additional discussion with @bussec , I'm of the opinion that the Repertoire structure for Study, Subject and Sample doesn't need to be redesigned. In particular, it is appropriate that sample is an array to handle re-sequencing, replicates, multiple tissues, etc.

If there is one thing that we could do, it is to take the time fields from Sample and push them into Repertoire to enforce the "one time point" requirement with the schema. Or we could leave as is, and just document it. I think enforcing the requirement with the schema is a good thing to do though.

DataProcessing is still an issue though, and we should still consider a re-design of its relationship with Repertoire. I'm on board with the idea of Repertoire being a "static" thing, and then using RepertoireSet as the flexible data structure for analysis.

Unless anybody has an alternative design to put forward, I'd like to close this issue as we have open issues to deal with the other things?

bcorrie commented 3 years ago

I think if the DataProcessing change that you are talking about is in the other issue/pull request (#453) then we are agreeing to make no changes at this time - assuming that RepertoireSet is the thing that will group things more dynamically (#445). Is that about right? 8-)

schristley commented 3 years ago

Is that about right? 8-)

Yes, ok, closing.