Closed allyhawkins closed 1 year ago
I have gone through all of the projects that we currently have data for and categorized each of the projects based on diagnosis, sequencing unit, and technology. Generally we want to integrate datasets that are within a project so we will keep samples grouped first by project, rather than by diagnosis. We also do not want to integrate samples that are from different sequencing units (e.g. cell and nucleus) so if a project has both of those collected, we will keep those samples separate.
Below is a summary of the projects that we have thus far, grouped by type of tissue and then by project. I specifically included a count of the most prevalent disease type, because I noticed that a lot of projects we have are mostly comprised of one diagnosis and then have a few other disease types sprinkled in. Some projects also have other 10X kits (technology) to consider as well, either with the same disease type or with different disease types.
Blood:
Brain:
Other:
Based on this, I think we should consider first starting with grabbing a group of ~5 samples from each project that are of the same diagnosis, seq unit, and technology where the only covariate is the patient. By using each of the above mentioned projects, we would have multiple integrated replicates for each tissue type specified, allowing us to identify any variability in integration that we might observe across tissues and projects before scaling up. This also allows us to focus on dealing with a single covariate at first. I think we should start here but then expand to include other covariates as a later step.
Then for the applicable projects we should expand to include a set of samples that have a different diagnosis or different technology, and finally the last set of samples would include projects that have both different diagnosis and different technology. That means we would have the following groups to compare to the control datasets (focusing on just the first bullet at first):
I believe this should cover all scenarios and give us enough of a sampling of what we will be working with in regards to integrating the projects as a whole, but perhaps this is overkill and we don't need to test using every project. I'm curious what others think regarding using all projects to have replicates across the tissue types or to make it simpler. Tagging @jaclyn-taroni and @jashapiro for any thoughts, ideas, and feedback.
At this point, I'd say it's good that you've outlined all of these scenarios. It's very possible that this is overkill for the benchmarking phase, but I believe it will depend very much on what we find with the control datasets (#2). That is to say, I don't think we have enough information yet to determine the breadth and depth we'd go into here.
Because this issue is related to benchmarking ScPCA data, and we have already tested integration with a couple of subsets of ScPCA projects, I'm going to close this. I will keep #159 open which contains a table with the breakdown of diagnosis, technology, etc. per ScPCA project.
Before testing any data integration, we will need to identify the subset of samples from ScPCA that we want to use for testing. In ScPCA projects we have a variety of covariates that will need to be considered when we go to merge datasets as well as trying to apply data integration across projects that represent different tissue types. All of these factors should be considered when identifying the subset of datasets to test with.
We will want to have a variety of tissue types represented (as we do with the control datasets) due to varying degrees of heterogeneity found in tissue types. We probably want to pick from projects that mirror the tissues we are using for controls, blood, brain, and kidney.
We also want to consider the different integration scenarios:
Projects that fall under these scenarios include:
We will want to identify a few samples from each of these groups that represent the covariates found in integrating the larger group to use for benchmarking.