Is it possible to use multithreading when apply splatPopEstimate?

mugpeng commented 2 years ago

Hi,

It's really helpful me by using splatter to generate some simulation data. But I still have some problems.

The method seems to have no capability to use multithreading like function splatPopEstimate. So it's quite slow when estimate parameters from a big real data.

By the way, the reason why I am trying to estimate from real data is because the simulation results are quite weird and unexpected when consider both group(cell type), sample(different conditions):

vcf <- mockVCF(n.samples = 6)
gff <- mockGFF()
params.cond2 <- newSplatPopParams(eqtl.n = 0, 
                                 batchCells = 300,
                                 group.prob = c(0.2, 0.8),
                                 similarity.scale = 5,
                                 condition.prob = c(0.5, 0.5),
                                 eqtl.condition.specific = 0,
                                 cde.facLoc = 0.01, 
                                 cde.facScale = 0.01) 
system.time({sim.pop.cond2 <- splatPopSimulate(vcf = vcf, gff = gff, 
                                              params = params.cond2) })

sim.pop.cond2 <- logNormCounts(sim.pop.cond2)
sim.pop.cond2 <- runPCA(sim.pop.cond2, ncomponents = 10)
sim.pop.cond2 <- scater::runUMAP(sim.pop.cond2)

sim.pop.df2 <- colData(sim.pop.cond2)
sim.pop.df2 <- cbind(sim.pop.df2, 
                    reducedDim(sim.pop.cond2, "PCA")[,1:2])
sim.pop.df2 <- as.data.frame(sim.pop.df2)

# PCA
ggplot(sim.pop.df2) + 
  geom_point(
    aes(PC1, PC2, color = Sample, 
        shape = Group), 
    size = 2,
  ) + 
  scale_color_manual(values = paletteer_d(palette = "RColorBrewer::Set3")) +
  geom_mark_hull(aes(PC1, PC2, fill = Condition),
                 alpha = 0.2) + 
  theme_bw()

From my point of view, I think it's not "real" when group.prob is equal in each sample:

> table(sim.pop.df2$Sample, sim.pop.df2$Group)

           Group1 Group2
  sample_1     24     96
  sample_2     24     96
  sample_3     24     96
  sample_4     24     96
  sample_5     24     96
  sample_6     24     96

Besides, the difference between same groups(cell type) should not too great to separate into multiple cluster.

All the scenarios really puzzled me. Thanks. :)

mugpeng commented 2 years ago

By the way, is there any parameters that I can set to tell splatPopEstimate where to read corresponding data like condition info stores in coldata as a column, but how can I set it? Or change the colnames?

azodichr commented 2 years ago

Hello @mugpeng Thanks for your interest in using splatter.

Regarding your question about speeding up the parameter estimation step: There is not a multithreading function at this time. However, because the purpose of this step is to estimate distribution parameters from empirical data, you can randomly downsample the number of empirical cells you are providing to make this step faster.

Regarding the question about the degree of variation between groups and the number of cells assigned to each group: You can adjust the relative impact of condition and group to your liking using cde.facLoc/cde.facScale (see) and de.facLoc/de.facScale (see), respectively. You can also change the proportion of cells being assigned to each group using the group.prob parameter.

Regarding the question about telling splatPopEstimate where to look for things like group and condition in the provided single cell data: We actually recommend running splatPopEstimate on a subset of your empirical data that includes only cells from one individual, from one group (i.e. cell-type), and from one condition. This is because the single-cell parameters you are estimating in this step are used to define the homogenous population of cells from one individual, from one group, from one condition, so you don't want additional sources of variation being modeled at that stage!

Let us know if you have other questions!

mugpeng commented 2 years ago

thanks!

lazappi commented 2 years ago

I am going to close this now but please comment if you have further questions.

Oshlack / splatter

Is it possible to use multithreading when apply splatPopEstimate? #145