Closed LiuCanidk closed 7 months ago
@LiuCanidk, I assume the tutorial is this, please note that also distributed
needs to be set, either to "genome-wide"
or "single-cell"
to create smaller datasets for parallel execution using either genes or observations (cells).
You can give any number of threads but to be largely parallel and not waste resources, 1 thread for 1 nSet is good.
@dimalvovs thanks a lot, but the distributed parameter seems to have a default to genome-wide, so I just do not need to specify this parameter? I just get confused about some tests I had:
@LiuCanidk interesting, could you share your code and also share more details on how your data is organized? 23 samples and 20000 genes means roughly 20k by 20k dimensionality? 3 days does seem plausible runtime for a dataset like that, please see timing estimates in the article.
@dimalvovs I mean a single cell dataset with 20000+ genes 10000+ cells, a matrix, may cost ~3 days to find 7 patterns, not that bulk dataset with 20000+ genes 23 bulk samples. Bulk dataset only cost about ~1h to find 4 patterns, with default paramters.
Code:
#####################################
project: SKCM melanoma imitation
author: Volcano Liu
date: 2024-04-02
step05: NMF --- single cell gene program
rm(list=ls())
set the working directory
workdir='/work/share/acuwbf4fll/liucan/HND_project/single_cell/00.1imitation_melanoma_cellline/05.NMF/' setwd(workdir)
load the data
load(paste0(workdir, 'seurat_invitro_melanoma_final.rda')) phate=read.csv(paste0(workdir, 'phate_embedding.csv'), row.names = 1)
add the embedding of phate
library(Seurat) library(ggplot2) sce.invitro[['phate']]=CreateDimReducObject(as.matrix(phate), key = 'PHATE', assay = DefaultAssay(sce.invitro))
create new parameters object
library(CoGAPS) params <- CogapsParams(nIterations=50000, # 50000 iterations seed=1234, # for consistency across stochastic runs nPatterns=7, # each thread will learn 8 patterns sparseOptimization=TRUE, # optimize for sparse data distributed="genome-wide") # parallelize across sets
genome-wide refers to distribute among genes
single-cell refers to distribute among cells
see https://github.com/FertigLab/CoGAPS/issues/56
set parallelization
params = setDistributedParams(params, nSets=20) params
data input
count=as.matrix(sce.invitro@assays$RNA$data) cogapsresult <- CoGAPS(count, params, outputFrequency = 1000) saveRDS(cogapsresult, "cogaps_result.Rds")
data organization:
3 days for the single cell dataset and 1h for the bulk dataset does look according to expectations. Closing this question, feel free to reopen if this is not yet answered.
In the tutorial, the author mention that in order to achieve parallelization for large dataset, one must set the setDistributedParams function, and the key parameter nSets.
I set the nSet to 20, and did not found so much efficiency to improved. So was the nSets parameter equal to what we are more familar with, i.e., number of threads or cores? Are there any other parameters in the setDistributionParams function to better achieve a parallelization using R CoGAPS? And how many threads should I give on a server when I set the nSets to 5, 10, 20?