cole-trapnell-lab / monocle-release

281 stars 116 forks source link

differentialGeneTest taking too long #49

Closed sudhirthakurela closed 6 years ago

sudhirthakurela commented 7 years ago

hi there,

I am using Mocole and have around 700 cells and ~ 13000 expressed genes and 13 clusters after filtering out and running DDRTree. While running differentialGeneTest it is taking too long. I am using 10 cores. Moreover, I am also not sure if it is doing anything or stuck somewhere without throwing an error. Is there a way to speed up the process? Or do you think ~13000 genes is too high?

ST

TaoDFang commented 7 years ago

me toooooooooooo. everything right before clustering. but then from differentialGeneTest , things are going to crazy. I even dint know where I can see the result since it was always running

Xiaojieqiu commented 7 years ago

Hi there,

Could you please provide at least some code which we can take a look? It seems like both of you are using DDRTree to get clusters which we don't recommend to do. But I am not sure what is going and cannot provide any suggestions without more details

sudhirthakurela commented 7 years ago

Here is the set of code that I am using:

Step1: select genes.

scDNMTKO_Monocle <- detectGenes(scDNMTKO_Monocle, min_expr = 20) fData(scDNMTKO_Monocle)$use_for_ordering <- fData(scDNMTKO_Monocle)$num_cells_expressed > 0.05 * ncol(scDNMTKO_Monocle)

perform a PCA

plot_pc_variance_explained(scDNMTKO_Monocle, return_all = F)

Reduce Dimension

scDNMTKO_Monocle <- reduceDimension(scDNMTKO_Monocle, max_components = 2, norm_method = 'log', reduction_method = 'tSNE', verbose = T)

Cluster cell using densityPeaks

scDNMTKO_Monocle <- clusterCells(scDNMTKO_Monocle, verbose = F) plot_cell_clusters(scDNMTKO_Monocle, color_by = 'as.factor(Cluster)') plot_cell_clusters(scDNMTKO_Monocle, color_by = 'as.factor(Day)')

decide the threshold for defining the cell clusters

plot_rho_delta(scDNMTKO_Monocle, rho_threshold = 2, delta_threshold = 4 )

Re-run clustering

scDNMTKO_Monocle <- clusterCells(scDNMTKO_Monocle,rho_threshold = 5,delta_threshold = 4,skip_rho_sigma = T,verbose = F) plot_cell_clusters(scDNMTKO_Monocle, color_by = 'as.factor(Cluster)')

plot_cell_clusters(scDNMTKO_Monocle, color_by = 'as.factor(Day)') clustering_DEG_genes <- differentialGeneTest(scDNMTKO_Monocle[expressed_genes,],fullModelFormulaStr = '~Cluster', cores = 10)

Thanks,

ST

TaoDFang commented 7 years ago

Sorry Xiaojie, sorry for unclear question. Here is my core code in R, HSMM_gene_annotation is a matrix whose rows are gene names and columns ares cell names , HSMM_sample_sheet is data.frame whose row are cells with one column "Hours",HSMM_sample_sheet is data.frame whose row are genes with one column "gene short name", and if you like, maybe I can also send you my workplace, it it too bug to upload to giuhub:

pd <- new("AnnotatedDataFrame", data = HSMM_sample_sheet) fd <- new("AnnotatedDataFrame", data = HSMM_gene_annotation) HSMM <- newCellDataSet(HSMM_expr_matrix, phenoData = pd, featureData = fd, lowerDetectionLimit = 0.5, expressionFamily = gaussianff())

rpc_matrix <- relative2abs(HSMM, method = "num_genes") HSMM <- newCellDataSet(as(as.matrix(rpc_matrix), "sparseMatrix"), phenoData = pd, featureData = fd, lowerDetectionLimit = 0.5, expressionFamily = negbinomial.size()) HSMM <- estimateSizeFactors(HSMM) HSMM <- estimateDispersions(HSMM)

HSMM <- detectGenes(HSMM, min_expr = 0.1) expressed_genes <- row.names(subset(fData(HSMM), num_cells_expressed >= 10))

disp_table <- dispersionTable(HSMM) unsup_clustering_genes <- subset(disp_table, mean_expression >= 0.1) HSMM <- setOrderingFilter(HSMM, unsup_clustering_genes$gene_id) plot_ordering_genes(HSMM) plot_pc_variance_explained(HSMM, return_all = F)

HSMM <- reduceDimension(HSMM, max_components = 2, num_dim = 6, reduction_method = 'tSNE', verbose = T) HSMM <- clusterCells(HSMM, num_clusters = 4) plot_cell_clusters(HSMM)

HSMM_myo <- estimateDispersions(HSMM) diff_test_res <- differentialGeneTest(HSMM_myo[expressed_genes,], fullModelFormulaStr = "~Hours") ordering_genes <- row.names (subset(diff_test_res, qval < 0.01))

HSMM_myo <- setOrderingFilter(HSMM_myo, ordering_genes) plot_ordering_genes(HSMM_myo)

HSMM_myo <- reduceDimension(HSMM_myo, max_components = 2, method = 'DDRTree') HSMM_myo <- orderCells(HSMM_myo) plot_cell_trajectory(HSMM_myo, color_by = "CellType")

TaoDFang commented 7 years ago

Hoi xiaojie, Maybe you can ignore my problem. without changing anything, it works now

Xiaojieqiu commented 7 years ago

Hi ST,

Thanks for your code. The scripts look pretty good to me. Here are a few comments I have:

  1. scDNMTKO_Monocle <- detectGenes(scDNMTKO_Monocle, min_expr = 20) min_expr was set way too high here. min_expr means that the minimal expression which we regarded as expressed. Normally it is 1 for UMI counts or 0.1 for FPKM or other relative expression measure
  2. scDNMTKO_Monocle <- reduceDimension(scDNMTKO_Monocle, max_components = 2, norm_method = 'log', reduction_method = 'tSNE', verbose = T) You can pass a num_dim argument. to this function where the top number of pca components will be used for tSNE dimension reduction. The value for num_dim can be identified when you look at the scree plot from your previous plot_pc_variance_explained function 3.differentialGeneTest(scDNMTKO_Monocle[expressed_genes,],fullModelFormulaStr = '~Cluster', cores = 10) It seems to me you used expressed_genes to subset your cds. Please first test on a small set of genes with the following code

differentialGeneTest(scDNMTKO_Monocle[1:100,],fullModelFormulaStr = '~Cluster', cores = 10)

or you can just use a single core: differentialGeneTest(scDNMTKO_Monocle[1:100,],fullModelFormulaStr = '~Cluster', cores = 1) Let me know how long does those code take to run

sudhirthakurela commented 7 years ago

Hi,

Just to get a better idea about min_exp:

I have data from SmartSeq2 and supplying raw read counts while creating seurat object.

nbt=CreateSeuratObject(scDNMTKORNA,project="scDNMT1KO",min.cells = 30,names.field = 1, names.delim = "",is.expr=20)

Using 20 as cutoff results in ~10000 genes passing the criteria.

Regarding your 3 point, I had meanwhile ran it with just 200 genes and it was done within 5 minutes. I could also run it using all expressed genes but that was not possible in an interactive R session but it worked if I submit it as job to the cluster (took about an hour with 30 cores)

Thanks for suggestions. Will let you know how step 2 goes.

yicheinchang commented 7 years ago

Run into the same issue. However, I received the following warning messages

Warning in inherits(x, "factor"): closing unused connection 8 (<-xxxxx.com:11936)
Warning in inherits(x, "factor"): closing unused connection 7 (<-xxxxx.com:11936)
Warning in inherits(x, "factor"): closing unused connection 6 (<-xxxxx.com:11936)
Warning in inherits(x, "factor"): closing unused connection 5 (<-xxxxx.com:11936)

(Note: xxxxx.com is the host name of my machine)

Not quite familiar with the parallel computing in R. It looks like that there is a disconnection between the master and the workers.

In addition, I have one related question. @Xiaojieqiu recommended not using DDRTree to get clusters. Is DDRTree still the preferred method for reduceDimension?

Xiaojieqiu commented 7 years ago

@yicheinchang reduceDimension now performs two tasks. The first one is to run RGE for trajectory reconstruction. By default, it uses DDRTree. The second one is clustering. The default approach for clustering is to use tSNE for dimension reduction. Hope this helps