Identify ScPCA data to use for exploration

allyhawkins commented 2 years ago

Before testing any data integration, we will need to identify the subset of samples from ScPCA that we want to use for testing. In ScPCA projects we have a variety of covariates that will need to be considered when we go to merge datasets as well as trying to apply data integration across projects that represent different tissue types. All of these factors should be considered when identifying the subset of datasets to test with.

We will want to have a variety of tissue types represented (as we do with the control datasets) due to varying degrees of heterogeneity found in tissue types. We probably want to pick from projects that mirror the tissues we are using for controls, blood, brain, and kidney.

We also want to consider the different integration scenarios:

integrating from the same diagnosis, same tissue type, same technology, but across patients
integrating from the same diagnosis, but different technologies and patients
integrating from the same tissue type, same technology, but different diagnosis and patients

Projects that fall under these scenarios include:

Murphy (covariate: patient)
Dyer, NB (covariate: patient, tech)
Green, LGG or HGG (covariate: patient, diagnosis)

We will want to identify a few samples from each of these groups that represent the covariates found in integrating the larger group to use for benchmarking.

allyhawkins commented 2 years ago

I have gone through all of the projects that we currently have data for and categorized each of the projects based on diagnosis, sequencing unit, and technology. Generally we want to integrate datasets that are within a project so we will keep samples grouped first by project, rather than by diagnosis. We also do not want to integrate samples that are from different sequencing units (e.g. cell and nucleus) so if a project has both of those collected, we will keep those samples separate.

Below is a summary of the projects that we have thus far, grouped by type of tissue and then by project. I specifically included a count of the most prevalent disease type, because I noticed that a lot of projects we have are mostly comprised of one diagnosis and then have a few other disease types sprinkled in. Some projects also have other 10X kits (technology) to consider as well, either with the same disease type or with different disease types.

Blood:

Gawad: 26 out of 30 samples are AML and all have the same technology and sequencing unit.
Mullighan: 94 out of 105 samples are B-ALL and same technology. The other 11 samples have 3 additional disease types and are a mix of 10X kits.
Teachey: 31 out of 60 samples are non-ETP ALL and all have the same technology and sequencing unit.

Brain:

Green/HGG: 16 of the 23 samples are the glioblastoma and all have the same technology and sequencing unit.
Green/LGG: 18 of the 26 samples are pilocytic astrocytoma and all have the same technology and sequencing unit.
Pugh: 12 of the 22 samples are Low grade glioma/astrocytoma with 10Xv2_5 prime. This project has a mix of diagnosis and technology but are all the same sequencing unit.

Other:

Murphy: All the same disease type, sequencing unit, and technology.
Dyer/RMS: All the same disease type, but different technology and sequencing units.
Dyer/NB: All the same disease type, but different technology and sequencing units.
Collins: All the same disease type, sequencing unit, and technology.

Based on this, I think we should consider first starting with grabbing a group of ~5 samples from each project that are of the same diagnosis, seq unit, and technology where the only covariate is the patient. By using each of the above mentioned projects, we would have multiple integrated replicates for each tissue type specified, allowing us to identify any variability in integration that we might observe across tissues and projects before scaling up. This also allows us to focus on dealing with a single covariate at first. I think we should start here but then expand to include other covariates as a later step.

Then for the applicable projects we should expand to include a set of samples that have a different diagnosis or different technology, and finally the last set of samples would include projects that have both different diagnosis and different technology. That means we would have the following groups to compare to the control datasets (focusing on just the first bullet at first):

5 libraries with same diagnosis, seq unit, tech from each project listed above (**splitting the Dyer projects based on cell and nucleus)
5 libraries with different diagnosis, but same seq unit/tech from Gawad, Mullighan, Teachey, Green, and Pugh
5 libraries with different tech, but same diagnosis and seq unit from Dyer/NB, Dyer/RMS, Mullighan, Pugh
5 libraries with different tech and different diagnosis from Mullighan and Pugh

I believe this should cover all scenarios and give us enough of a sampling of what we will be working with in regards to integrating the projects as a whole, but perhaps this is overkill and we don't need to test using every project. I'm curious what others think regarding using all projects to have replicates across the tissue types or to make it simpler. Tagging @jaclyn-taroni and @jashapiro for any thoughts, ideas, and feedback.

jaclyn-taroni commented 2 years ago

At this point, I'd say it's good that you've outlined all of these scenarios. It's very possible that this is overkill for the benchmarking phase, but I believe it will depend very much on what we find with the control datasets (#2). That is to say, I don't think we have enough information yet to determine the breadth and depth we'd go into here.

allyhawkins commented 1 year ago

Because this issue is related to benchmarking ScPCA data, and we have already tested integration with a couple of subsets of ScPCA projects, I'm going to close this. I will keep #159 open which contains a table with the breakdown of diagnosis, technology, etc. per ScPCA project.

AlexsLemonade / sc-data-integration

Identify ScPCA data to use for exploration #3