Closed allyhawkins closed 2 years ago
I've just started looking into this, beginning with counting different cell types across sim1
batches. It seems as though we already have variation in presence/absence. However, every batch does have groups 1 and 2, so that could be varied?
Code -
library(SingleCellExperiment)
sce_dir <- here::here("data", "scib_simulated", "sce")
count_cells <- function(i) {
dat <- readRDS(file.path(
sce_dir,
paste0("sim1_Batch", i, "_sce.rds")
))
table(colData(dat)$celltype)
}
purrr::map(1:6, count_cells)
Output -
[[1]]
Group1 Group2 Group3 Group4 Group5 Group6
1026 639 440 372 288 143
[[2]]
Group1 Group2 Group3 Group4 Group6 Group7
601 803 487 240 194 97
[[3]]
Group1 Group2 Group4 Group5 Group6
634 638 425 214 209
[[4]]
Group1 Group2 Group3 Group4 Group5
288 387 681 285 288
[[5]]
Group1 Group2 Group3 Group4 Group5 Group6 Group7
706 87 177 175 299 210 107
[[6]]
Group1 Group2 Group3 Group4
433 331 144 49
For sim2
, however, all batches have groups 1-4:
> count_cells <- function(i) {
+ dat <- readRDS(file.path(
+ sce_dir,
+ paste0("sim2_Batch", i, "_sce.rds")
+ ))
+ table(colData(dat)$celltype)
+ }
> purrr::map(1:4, count_cells)
[[1]]
Group1 Group2 Group3 Group4
1682 1058 1212 854
[[2]]
Group1 Group2 Group3 Group4
1198 1921 1303 483
[[3]]
Group1 Group2 Group3 Group4
1446 726 1673 964
[[4]]
Group1 Group2 Group3 Group4
481 1447 953 1917
I've just started looking into this, beginning with counting different cell types across sim1 batches. It seems as though we already have variation in presence/absence. However, every batch does have groups 1 and 2, so that could be varied?
Hmm thanks for looking into this. I think we need to think about the experimental design a little bit here and what exactly we would be testing. I think we are interested in testing the case when not every library that is getting integrated carries the same set of cell types to use for integration. Right now the simulated data all have at least some portion of there cells from Group 1 and Group 2 so in any of the MNN based methods (almost all of them except scVI), then those cells are going to be used in integrating. But what happens when only some of our libraries share overlap in Group 1 and then the rest of our libraries share overlap in Group 2 (or another set of 2 groups) so there are no two common groups that can be used to assist in integration.
So with that said I think this would still be worth doing if we were to adjust the proportions so that no cell type is present in every library (batch) and no library (batch) has the same set of cell types.
So with that said I think this would still be worth doing if we were to adjust the proportions so that no cell type is present in every library (batch) and no library (batch) has the same set of cell types.
I think this is a good approach. What do we think of heading towards something like this?
Batch | Cell types retained |
---|---|
1 | 1, 3, 4, 5, 6 |
2 | 2, 3, 4, 6, 7 |
3 | 1, 4, 5, 6 |
4 | 2, 3, 4, 5 |
5 | 5, 6, 7 |
6 | 1, 2, 3 |
Number of batches each cell type appears in: Group 1 - 3 Group 2 - 3 Group 3 - 4 Group 4 - 4 Group 5 - 4 Group 6 - 4
So with that said I think this would still be worth doing if we were to adjust the proportions so that no cell type is present in every library (batch) and no library (batch) has the same set of cell types.
This seems like an interesting hypothetical case, but is it something we really expect to see? I would expect at least some cell types to be shared across any set of samples we would want to integrate.
So with that said I think this would still be worth doing if we were to adjust the proportions so that no cell type is present in every library (batch) and no library (batch) has the same set of cell types.
This seems like an interesting hypothetical case, but is it something we really expect to see? I would expect at least some cell types to be shared across any set of samples we would want to integrate.
How are "no cell type is present in every library and no library has the same set of cell types" and "at least some cell types to be shared across any set of samples we would want to integrate" in opposition?
Is it the "no cell type is present in every batch" that you're objecting to specifically @jashapiro?
Yes, the "no cell type present in every batch" seems unlikely. I guess if we have very pure tumor samples, but even there wouldn't we expect some common blood cells?
I don't think we know with great certainty, and I'll add that the point of the exercise is that it is a hypothetical. So I'd err on the side of "Well, that's weird and bad."
I guess my thought is that if we do this we should have both the "complex but expected" and "complex and breaks assumptions" sets. So sim1
seems like something we should probably keep similar to its current form. Adding an additional sim would make sense, but I wouldn't replace sim1
. I might imagine 4 cases:
I guess my thought is that if we do this we should have both the "complex but expected" and "complex and breaks assumptions" sets.
Sure, agreed.
- No cell types shared across all samples, but every cell type in at least 2 samples with the ability to "link" all samples. i.e (1,2,3), (2,3,4), (3,4,5), (4,5)
To me, this seems like the most salient difference compared to https://github.com/AlexsLemonade/sc-data-integration/issues/151#issuecomment-1249302141
no cell types shared across all samples, but every cell type in at least 2 samples with the ability to "link" all samples. i.e (1,2,3), (2,3,4), (3,4,5), (4,5)
This is the approach I was trying to describe in my initial comment, where we are removing the some cell types that are present across all samples, but not necessarily removing the ability to link them. I just think we want to test at least one additional scenario where not every sample contains the exact same cell type that can be used to link them. This is probably more similar to what I expect to see when integrating scpca datasets where some subset of samples being integrating share one set of cell types while another subset share another subset of cell types. Or in the case of solid tumors where we have mostly tumor cells, we may be dealing with more granular cell states.
I think the approach outlined by Josh sounds good, except the sim2 as we have it is a bit wonky and has some nested batch effects that we would have to deal with so I'm going to propose we alter it a little and stick with using sim1:
all cell types shared across samples (take a subset of current sim1 so that they all share cell types across samples)
There are two ways to do this that I can see:
I did miss noticing the first time around that cell type 4 is always present!
Keep cell types 1, 2, 3, and 4 but remove batch 3
I like this approach just to keep the variety of cell types, I think still having 5 batches is good.
Here's something I came up with via the precise and accurate strategy of...eyeballing, with a bit of squinting.
# sim1a: All batches share all cell types (1, 2, 4). Batch 3 is removed.
sim1a_retain_celltypes <- tibble::tribble(
~batch, ~celltypes,
1, c(1, 2, 4),
2, c(1, 2, 4),
4, c(1, 2, 4),
5, c(1, 2, 4),
6, c(1, 2, 4),
)
# sim1b: Cell types are not shared across batches, but every cell type is in at
# least 2 samples with the ability to "link" all samples
sim1b_retain_celltypes <- tibble::tribble(
~batch, ~celltypes,
1, c(1, 3, 5),
2, c(1, 2, 7),
3, c(4, 5, 6),
4, c(2, 3, 4),
5, c(5, 6, 7),
6, c(1, 2, 3),
)
# sim1c: Cell types are not shared across batches, and some sets of samples are
# disjoint, without cell types that "link" all samples
sim1c_retain_celltypes <- tibble::tribble(
~batch, ~celltypes,
1, c(1, 3, 5, 6),
2, c(1, 2, 7),
3, c(1, 4, 6),
4, c(4, 5),
5, c(4, 6, 7),
6, c(2, 3, 4),
)
In sim1c
it looks like there are still links among all batches.
In
sim1c
it looks like there are still links among all batches
Yeah, I struggled with this scheme.. Anyone want to take a stab at it? Eventually I'll break down and code it probably if my brain can't do it...
I think for sim1c
you could do something like this:
sim1c_retain_celltypes <- tibble::tribble(
~batch, ~celltypes,
1, c(4,5,6),
2, c(4,6,7),
3, c(5,6),
4, c(1,2,3),
5, c(1,2)
)
I think for
sim1c
you could do something like this:
Ah ok, so super disjoint! Can do! Edit - I'm going with this (added in row 6):
# sim1c: Cell types are not shared across batches, and some sets of samples are
# disjoint, without cell types that "link" all samples
sim1c_retain_celltypes <- tibble::tribble(
~batch, ~celltypes,
1, c(4, 5, 6),
2, c(4, 6, 7),
3, c(5, 6),
4, c(1, 2, 3),
5, c(1, 2),
6, c(3, 4)
)
In a discussion with Jackie this morning about some of the integration results, we thought it would be a good idea to take a look at the simulated data in a situation where the data is not quite so perfect. In particular, for the sim1 group where we have 6 batches, each with the same cell types (although different proportions of cell types), we should test what happens when some of those batches don't have all the cell types. This would better mirror the situation when integrating samples that have non-overlapping cell types, similar to what we are seeing with the liver dataset and what we are likely to see with some of the ScPCA datasets.
To do this we can remove a cell type from each of the batches so that there are uneven cell types across the batches in the dataset. We should do this at the stage prior to the integration workflow when each batch is represented as an individual SCE object. After which we should have a set of SCE objects each with different cell types present. Then we can run those SCE objects through the integration workflow and identify if the same integration methods perform well with data lacking overlapping cell types.