Confusion on the parallelization of R CoGAPS

LiuCanidk commented 5 months ago

In the tutorial, the author mention that in order to achieve parallelization for large dataset, one must set the setDistributedParams function, and the key parameter nSets.

I set the nSet to 20, and did not found so much efficiency to improved. So was the nSets parameter equal to what we are more familar with, i.e., number of threads or cores? Are there any other parameters in the setDistributionParams function to better achieve a parallelization using R CoGAPS? And how many threads should I give on a server when I set the nSets to 5, 10, 20?

dimalvovs commented 5 months ago

@LiuCanidk, I assume the tutorial is this, please note that also distributed needs to be set, either to "genome-wide" or "single-cell" to create smaller datasets for parallel execution using either genes or observations (cells). You can give any number of threads but to be largely parallel and not waste resources, 1 thread for 1 nSet is good.

LiuCanidk commented 5 months ago

@dimalvovs thanks a lot, but the distributed parameter seems to have a default to genome-wide, so I just do not need to specify this parameter? I just get confused about some tests I had:

when I run a large single cell dataset comprising ~20000genes and 10000+ cells, the output log exhibited some workers to be activated like "worker #X is starting!", which seemed to be a signal that the parallelization was working well.
But when I run a bulk dataset comprising 20000+ genes and 23 samples, It worked just like when I did not set the nSets paramter, i.e., nothing changed in the log. Just no obvious information that tell me whether the parallelization is working, and the single cell would run for ~3 days, which made me keep asking myself: did I miss some parameters that the parallelization needs....

dimalvovs commented 5 months ago

@LiuCanidk interesting, could you share your code and also share more details on how your data is organized? 23 samples and 20000 genes means roughly 20k by 20k dimensionality? 3 days does seem plausible runtime for a dataset like that, please see timing estimates in the article.

LiuCanidk commented 5 months ago

@dimalvovs I mean a single cell dataset with 20000+ genes 10000+ cells, a matrix, may cost ~3 days to find 7 patterns, not that bulk dataset with 20000+ genes 23 bulk samples. Bulk dataset only cost about ~1h to find 4 patterns, with default paramters.

Code:

#####################################

project: SKCM melanoma imitation

author: Volcano Liu

date: 2024-04-02

step05: NMF --- single cell gene program

rm(list=ls())

set the working directory

workdir='/work/share/acuwbf4fll/liucan/HND_project/single_cell/00.1imitation_melanoma_cellline/05.NMF/' setwd(workdir)

load the data

load(paste0(workdir, 'seurat_invitro_melanoma_final.rda')) phate=read.csv(paste0(workdir, 'phate_embedding.csv'), row.names = 1)

add the embedding of phate

library(Seurat) library(ggplot2) sce.invitro[['phate']]=CreateDimReducObject(as.matrix(phate), key = 'PHATE', assay = DefaultAssay(sce.invitro))

create new parameters object

library(CoGAPS) params <- CogapsParams(nIterations=50000, # 50000 iterations seed=1234, # for consistency across stochastic runs nPatterns=7, # each thread will learn 8 patterns sparseOptimization=TRUE, # optimize for sparse data distributed="genome-wide") # parallelize across sets

genome-wide refers to distribute among genes

single-cell refers to distribute among cells

see https://github.com/FertigLab/CoGAPS/issues/56

set parallelization

params = setDistributedParams(params, nSets=20) params

data input

count=as.matrix(sce.invitro@assays$RNA$data) cogapsresult <- CoGAPS(count, params, outputFrequency = 1000) saveRDS(cogapsresult, "cogaps_result.Rds")

data organization:

bulk: expression matrix with gene symbol as rownames, and sample_id as colnames, 27396 genes * 23 samples
single cell: expression matrix with gene symbol as rownames, and cell barcode as colnames, 19361 genes * 11042 cells

dimalvovs commented 4 months ago

3 days for the single cell dataset and 1h for the bulk dataset does look according to expectations. Closing this question, feel free to reopen if this is not yet answered.

FertigLab / CoGAPS

Confusion on the parallelization of R CoGAPS #90

project: SKCM melanoma imitation

author: Volcano Liu

date: 2024-04-02

step05: NMF --- single cell gene program

set the working directory

load the data

add the embedding of phate

create new parameters object

genome-wide refers to distribute among genes

single-cell refers to distribute among cells

see https://github.com/FertigLab/CoGAPS/issues/56

set parallelization

data input