Test simulated data (sim1) with uneven cell types across batches

allyhawkins commented 2 years ago

In a discussion with Jackie this morning about some of the integration results, we thought it would be a good idea to take a look at the simulated data in a situation where the data is not quite so perfect. In particular, for the sim1 group where we have 6 batches, each with the same cell types (although different proportions of cell types), we should test what happens when some of those batches don't have all the cell types. This would better mirror the situation when integrating samples that have non-overlapping cell types, similar to what we are seeing with the liver dataset and what we are likely to see with some of the ScPCA datasets.

To do this we can remove a cell type from each of the batches so that there are uneven cell types across the batches in the dataset. We should do this at the stage prior to the integration workflow when each batch is represented as an individual SCE object. After which we should have a set of SCE objects each with different cell types present. Then we can run those SCE objects through the integration workflow and identify if the same integration methods perform well with data lacking overlapping cell types.

sjspielman commented 2 years ago

I've just started looking into this, beginning with counting different cell types across sim1 batches. It seems as though we already have variation in presence/absence. However, every batch does have groups 1 and 2, so that could be varied?

Code -

library(SingleCellExperiment)
sce_dir <- here::here("data", "scib_simulated", "sce")

count_cells <- function(i) {
  dat <- readRDS(file.path(
    sce_dir, 
    paste0("sim1_Batch", i, "_sce.rds")
  ))
  table(colData(dat)$celltype)
}
purrr::map(1:6, count_cells)

Output -

[[1]]

Group1 Group2 Group3 Group4 Group5 Group6 
  1026    639    440    372    288    143 

[[2]]

Group1 Group2 Group3 Group4 Group6 Group7 
   601    803    487    240    194     97 

[[3]]

Group1 Group2 Group4 Group5 Group6 
   634    638    425    214    209 

[[4]]

Group1 Group2 Group3 Group4 Group5 
   288    387    681    285    288 

[[5]]

Group1 Group2 Group3 Group4 Group5 Group6 Group7 
   706     87    177    175    299    210    107 

[[6]]

Group1 Group2 Group3 Group4 
   433    331    144     49

sjspielman commented 2 years ago

For sim2, however, all batches have groups 1-4:

> count_cells <- function(i) {
+   dat <- readRDS(file.path(
+     sce_dir, 
+     paste0("sim2_Batch", i, "_sce.rds")
+   ))
+   table(colData(dat)$celltype)
+ }
> purrr::map(1:4, count_cells)
[[1]]

Group1 Group2 Group3 Group4 
  1682   1058   1212    854 

[[2]]

Group1 Group2 Group3 Group4 
  1198   1921   1303    483 

[[3]]

Group1 Group2 Group3 Group4 
  1446    726   1673    964 

[[4]]

Group1 Group2 Group3 Group4 
   481   1447    953   1917

allyhawkins commented 2 years ago

I've just started looking into this, beginning with counting different cell types across sim1 batches. It seems as though we already have variation in presence/absence. However, every batch does have groups 1 and 2, so that could be varied?

Hmm thanks for looking into this. I think we need to think about the experimental design a little bit here and what exactly we would be testing. I think we are interested in testing the case when not every library that is getting integrated carries the same set of cell types to use for integration. Right now the simulated data all have at least some portion of there cells from Group 1 and Group 2 so in any of the MNN based methods (almost all of them except scVI), then those cells are going to be used in integrating. But what happens when only some of our libraries share overlap in Group 1 and then the rest of our libraries share overlap in Group 2 (or another set of 2 groups) so there are no two common groups that can be used to assist in integration.

So with that said I think this would still be worth doing if we were to adjust the proportions so that no cell type is present in every library (batch) and no library (batch) has the same set of cell types.

sjspielman commented 2 years ago

So with that said I think this would still be worth doing if we were to adjust the proportions so that no cell type is present in every library (batch) and no library (batch) has the same set of cell types.

I think this is a good approach. What do we think of heading towards something like this?

Batch	Cell types retained
1	1, 3, 4, 5, 6
2	2, 3, 4, 6, 7
3	1, 4, 5, 6
4	2, 3, 4, 5
5	5, 6, 7
6	1, 2, 3

Number of batches each cell type appears in: Group 1 - 3 Group 2 - 3 Group 3 - 4 Group 4 - 4 Group 5 - 4 Group 6 - 4

jashapiro commented 2 years ago

So with that said I think this would still be worth doing if we were to adjust the proportions so that no cell type is present in every library (batch) and no library (batch) has the same set of cell types.

This seems like an interesting hypothetical case, but is it something we really expect to see? I would expect at least some cell types to be shared across any set of samples we would want to integrate.

jaclyn-taroni commented 2 years ago

So with that said I think this would still be worth doing if we were to adjust the proportions so that no cell type is present in every library (batch) and no library (batch) has the same set of cell types.

This seems like an interesting hypothetical case, but is it something we really expect to see? I would expect at least some cell types to be shared across any set of samples we would want to integrate.

How are "no cell type is present in every library and no library has the same set of cell types" and "at least some cell types to be shared across any set of samples we would want to integrate" in opposition?

Is it the "no cell type is present in every batch" that you're objecting to specifically @jashapiro?

jashapiro commented 2 years ago

Yes, the "no cell type present in every batch" seems unlikely. I guess if we have very pure tumor samples, but even there wouldn't we expect some common blood cells?

jaclyn-taroni commented 2 years ago

I don't think we know with great certainty, and I'll add that the point of the exercise is that it is a hypothetical. So I'd err on the side of "Well, that's weird and bad."

jashapiro commented 2 years ago

I guess my thought is that if we do this we should have both the "complex but expected" and "complex and breaks assumptions" sets. So sim1 seems like something we should probably keep similar to its current form. Adding an additional sim would make sense, but I wouldn't replace sim1. I might imagine 4 cases:

all cell types shared across samples (current sim2)
some cell types present across all samples (current sim1)
no cell types shared across all samples, but every cell type in at least 2 samples with the ability to "link" all samples. i.e (1,2,3), (2,3,4), (3,4,5), (4,5)
some sets of samples are disjoint, without cell types that "link" all samples, i.e. (1,2,3), (2,3), (4,5,6), (5,6)

jaclyn-taroni commented 2 years ago

I guess my thought is that if we do this we should have both the "complex but expected" and "complex and breaks assumptions" sets.

Sure, agreed.

No cell types shared across all samples, but every cell type in at least 2 samples with the ability to "link" all samples. i.e (1,2,3), (2,3,4), (3,4,5), (4,5)

To me, this seems like the most salient difference compared to https://github.com/AlexsLemonade/sc-data-integration/issues/151#issuecomment-1249302141

allyhawkins commented 2 years ago

no cell types shared across all samples, but every cell type in at least 2 samples with the ability to "link" all samples. i.e (1,2,3), (2,3,4), (3,4,5), (4,5)

This is the approach I was trying to describe in my initial comment, where we are removing the some cell types that are present across all samples, but not necessarily removing the ability to link them. I just think we want to test at least one additional scenario where not every sample contains the exact same cell type that can be used to link them. This is probably more similar to what I expect to see when integrating scpca datasets where some subset of samples being integrating share one set of cell types while another subset share another subset of cell types. Or in the case of solid tumors where we have mostly tumor cells, we may be dealing with more granular cell states.

I think the approach outlined by Josh sounds good, except the sim2 as we have it is a bit wonky and has some nested batch effects that we would have to deal with so I'm going to propose we alter it a little and stick with using sim1:

all cell types shared across samples (take a subset of current sim1 so that they all share cell types across samples)
some cell types present across all samples (current sim1)
no cell types shared across all samples, but every cell type in at least 2 samples with the ability to "link" all samples. i.e (1,2,3), (2,3,4), (3,4,5), (4,5)
some sets of samples are disjoint, without cell types that "link" all samples, i.e. (1,2,3), (2,3), (4,5,6), (5,6)

sjspielman commented 2 years ago

all cell types shared across samples (take a subset of current sim1 so that they all share cell types across samples)

There are two ways to do this that I can see:

Keep cell types 1, 2, and 4, and keep all batches
Keep cell types 1, 2, 3, and 4 but remove batch 3

I did miss noticing the first time around that cell type 4 is always present!

allyhawkins commented 2 years ago

Keep cell types 1, 2, 3, and 4 but remove batch 3

I like this approach just to keep the variety of cell types, I think still having 5 batches is good.

sjspielman commented 2 years ago

Here's something I came up with via the precise and accurate strategy of...eyeballing, with a bit of squinting.

# sim1a: All batches share all cell types (1, 2, 4). Batch 3 is removed.
sim1a_retain_celltypes <- tibble::tribble(
  ~batch, ~celltypes,
  1, c(1, 2, 4),
  2, c(1, 2, 4),
  4, c(1, 2, 4),
  5, c(1, 2, 4),
  6, c(1, 2, 4),
) 

# sim1b: Cell types are not shared across batches, but every cell type is in at
#  least 2 samples with the ability to "link" all samples
sim1b_retain_celltypes <- tibble::tribble(
  ~batch, ~celltypes,
  1, c(1, 3, 5),
  2, c(1, 2, 7),
  3, c(4, 5, 6),
  4, c(2, 3, 4),
  5, c(5, 6, 7),
  6, c(1, 2, 3),
) 

# sim1c: Cell types are not shared across batches, and some sets of samples are 
#  disjoint, without cell types that "link" all samples
sim1c_retain_celltypes <- tibble::tribble(
  ~batch, ~celltypes,
  1, c(1, 3, 5, 6),
  2, c(1, 2, 7),
  3, c(1, 4, 6),
  4, c(4, 5),
  5, c(4, 6, 7),
  6, c(2, 3, 4),
)

jashapiro commented 2 years ago

In sim1c it looks like there are still links among all batches.

sjspielman commented 2 years ago

In sim1c it looks like there are still links among all batches

Yeah, I struggled with this scheme.. Anyone want to take a stab at it? Eventually I'll break down and code it probably if my brain can't do it...

allyhawkins commented 2 years ago

I think for sim1c you could do something like this:

sim1c_retain_celltypes <- tibble::tribble(
  ~batch, ~celltypes,
  1, c(4,5,6),
  2, c(4,6,7),
  3, c(5,6),
  4, c(1,2,3),
  5, c(1,2)
)

sjspielman commented 2 years ago

I think for sim1c you could do something like this:

Ah ok, so super disjoint! Can do! Edit - I'm going with this (added in row 6):

# sim1c: Cell types are not shared across batches, and some sets of samples are 
#  disjoint, without cell types that "link" all samples
sim1c_retain_celltypes <- tibble::tribble(
  ~batch, ~celltypes,
  1, c(4, 5, 6),
  2, c(4, 6, 7),
  3, c(5, 6),
  4, c(1, 2, 3),
  5, c(1, 2), 
  6, c(3, 4)
)

AlexsLemonade / sc-data-integration

Test simulated data (sim1) with uneven cell types across batches #151