FertigLab / CoGAPS

Bayesian MCMC matrix factorization algorithm
https://www.bioconductor.org/packages/release/bioc/html/CoGAPS.html
BSD 3-Clause "New" or "Revised" License
61 stars 17 forks source link

Confusion on the parallelization of R CoGAPS #90

Closed LiuCanidk closed 4 months ago

LiuCanidk commented 5 months ago

In the tutorial, the author mention that in order to achieve parallelization for large dataset, one must set the setDistributedParams function, and the key parameter nSets.

I set the nSet to 20, and did not found so much efficiency to improved. So was the nSets parameter equal to what we are more familar with, i.e., number of threads or cores? Are there any other parameters in the setDistributionParams function to better achieve a parallelization using R CoGAPS? And how many threads should I give on a server when I set the nSets to 5, 10, 20?

dimalvovs commented 5 months ago

@LiuCanidk, I assume the tutorial is this, please note that also distributed needs to be set, either to "genome-wide" or "single-cell" to create smaller datasets for parallel execution using either genes or observations (cells). You can give any number of threads but to be largely parallel and not waste resources, 1 thread for 1 nSet is good.

LiuCanidk commented 5 months ago

@dimalvovs thanks a lot, but the distributed parameter seems to have a default to genome-wide, so I just do not need to specify this parameter? I just get confused about some tests I had:

dimalvovs commented 5 months ago

@LiuCanidk interesting, could you share your code and also share more details on how your data is organized? 23 samples and 20000 genes means roughly 20k by 20k dimensionality? 3 days does seem plausible runtime for a dataset like that, please see timing estimates in the article.

LiuCanidk commented 5 months ago

@dimalvovs I mean a single cell dataset with 20000+ genes 10000+ cells, a matrix, may cost ~3 days to find 7 patterns, not that bulk dataset with 20000+ genes 23 bulk samples. Bulk dataset only cost about ~1h to find 4 patterns, with default paramters.

Code:

#####################################

project: SKCM melanoma imitation
author: Volcano Liu
date: 2024-04-02
step05: NMF --- single cell gene program

rm(list=ls())

set the working directory

workdir='/work/share/acuwbf4fll/liucan/HND_project/single_cell/00.1imitation_melanoma_cellline/05.NMF/' setwd(workdir)

load the data

load(paste0(workdir, 'seurat_invitro_melanoma_final.rda')) phate=read.csv(paste0(workdir, 'phate_embedding.csv'), row.names = 1)

add the embedding of phate

library(Seurat) library(ggplot2) sce.invitro[['phate']]=CreateDimReducObject(as.matrix(phate), key = 'PHATE', assay = DefaultAssay(sce.invitro))

create new parameters object

library(CoGAPS) params <- CogapsParams(nIterations=50000, # 50000 iterations seed=1234, # for consistency across stochastic runs nPatterns=7, # each thread will learn 8 patterns sparseOptimization=TRUE, # optimize for sparse data distributed="genome-wide") # parallelize across sets

genome-wide refers to distribute among genes

single-cell refers to distribute among cells

see https://github.com/FertigLab/CoGAPS/issues/56

set parallelization

params = setDistributedParams(params, nSets=20) params

data input

count=as.matrix(sce.invitro@assays$RNA$data) cogapsresult <- CoGAPS(count, params, outputFrequency = 1000) saveRDS(cogapsresult, "cogaps_result.Rds")

data organization:

dimalvovs commented 4 months ago

3 days for the single cell dataset and 1h for the bulk dataset does look according to expectations. Closing this question, feel free to reopen if this is not yet answered.