SimoneTiberi / distinct

distinct: a method for differential analyses via hierarchical permutation tests
11 stars 2 forks source link

Runs Indefinitely #8

Closed bshim181 closed 4 months ago

bshim181 commented 6 months ago

Hello,

Thank you for the development of distinct tool. I have been trying out both muscat and distinct for pseudobulk analysis. I am running into an issue while running distinct where it runs indefinitely without termination. My dataset has around 84000 cells and it is taking more than 3 days to run without completion message. I am using parameter ncores=16 and was wondering about the expected completion time.

SimoneTiberi commented 6 months ago

Hi Bohoon, 3 days is indeed a very long time! In our benchmarks, smaller datasets of 3,600 cells were analyzed in 3-4 minutes. Parallelization of tasks is done at the cell-type level; so, distinct uses at most n_cell_types cores (our default, if n_cores is left unspecified). How many cell types do you have?

Time heavily depends on the number of significant results: if you have a lot of differential genes, time could increase a lot. That's the only physiological potential cause I can think of. Alternatively, something is wrong with the input data, or with out internal functions.

If you want to share your scripts and data, I can have a deeper look.

Have a good one, Simone

bshim181 commented 6 months ago

So currently, we have 6 cell types and I have tried n_cores = 6 as well which also ran for more than three days. Is there a possibility where its detecting no differentially expressed genes and the program wont terminate? Since in our case, we are looking at potential gene level differences within different states of Tregs(activated vs exhausted), there might not exist significant level of gene profile differences.

Here is the code block which I am using to run distinct! Thank you!

merged.sce <- as.SingleCellExperiment(merged,assay="RNA")

merged.sce <- computeLibraryFactors(merged.sce) merged.sce <- logNormCounts(merged.sce) cpm(merged.sce) <- calculateCPM(merged.sce,assay.type="counts")

(merged.sce <- prepSCE(merged.sce, kid = "seurat_clusters", gid = "Condition", sid = "Sample", drop=T))

colData(merged.sce)

Batch = factor(c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B")) Sex = factor(c("Male", "Female", "Female", "Male", "Male", "Male", "Female", "Female", "Female", "Female"))

merged.samples <- merged.sce@metadata$experiment_info$sample_id merged.group <- merged.sce@metadata$experiment_info$group_id

merged.design <- model.matrix(~merged.group + Batch + Sex)

merged.design <- model.matrix(~merged.group)

rownames(merged.design) <- merged.samples

set.seed(61217)

merged.res <- distinct_test(x = merged.sce, name_assays_expression = "logcounts", name_cluster = "cluster_id", name_sample = "sample_id", design = merged.design, P_4 = 20000, column_to_test = 2, min_non_zero_cells = 50, n_cores = 16)

merged.res <- log2_FC(res = merged.res, x = merged.sce, name_assays_expression = "cpm", name_group = "group_id", name_cluster = "cluster_id")

SimoneTiberi commented 5 months ago

Ok, what I meant is that using n_cores = n_cells (6) will speed-up calculations, but using more will not make a difference.

Everything seems ok; if you want, you can also account for "batch" and sex" using the commented design.

> Is there a possibility where its detecting no differentially expressed genes and the program wont terminate? Actually, the more differential genes there are, the longer distinct will take.

I see you are using P_4 = 20,000: this actually increases the permutations used for significant genes, and hence the runtime. I suggest you use the default values (P_4 = 10,000), which is more than enough to obtain good estimates of the p-values.

In addition, just to try and see if everything runs, you can set P_3 = 1,000 and P_4 = 1,000. This will give less accurate estimates of small p-values, but it will run much faster.

If you want to share the data, I can try the package there.

Take care, Simone

bshim181 commented 5 months ago

Hello,

I have tried with P_2 = 100, P_3=100 and P_4 = 100 which took about one hour and half.

Screenshot 2024-06-10 at 11 56 12 AM

One thing that I noticed which was slightly abnormal was that when I looked at top_results for cluster 2, I saw that all genes had p-val of 0.009 (all equal p-value). I know that since permutations used is very low, these p-values should not be taken as granted but is this what you would normally see?

SimoneTiberi commented 5 months ago

Hi Bohoon, ok, this explains the huge computational cost: all p-values are significant, which means that they will use up to P_4 permutations (10 k by default), which increases a lot the computational cost.

Something is probably going wrong with this cluster. Do you notice the same behaviour with the other clusters too?

An easy solution may be to run the analysis without cluster 2. Alternatively, I can run the package on your data.

Simone

bshim181 commented 5 months ago

I am not 100% sure on the sharing permissibility since it is a patient dataset. One thing to consider is that this dataset is supposed to be all sorted for Tregs and therefore, the clusters are capturing potential variation in cell states among Tregs. I am assuming there should be very minute levels of differences in terms of expression profile.

When I ran muscat on this same dataset, it did not pick up any differentially expressed genes among all clusters. Is there a possibility that since differences are so miniscule, its considering all genes as differentially expressed (all-p values become significant).

SimoneTiberi commented 5 months ago

muscat detects DGE (differences in means), while distinct uses a non-parametric approach and also detects differences that do not involve the mean (e.g., difference in variance). So it can be expected that distinct detects more differences than muscat, but the kind of pattern you observe is extreme.

One thing I just noticed is that top_results ranks results from most to least significant; so when doing "head" it is normal that you have all significant results printed.

Check how many significant results there are:

a = top_results( ...)
mean(a$p_val < 0.01)
hist(a$p_val)
bshim181 commented 5 months ago

I ran distinct with P_4 = 1000 as you suggested and it did finish within one day. Interestingly, when i ran top_results for all clusters, there were some clusters that did not have any differentially expressed genes where other clusters had 7,000 to 10,000 differentially expressed genes detected.

These are the histogram of p values for clusters 0 and 1 where it demonstrated noticeably large number of differentially expressed genes. Are these numbers out of ordinary or would they eventually get filtered out after increasing P_3 and P_4 values? (I will give it around five days to run to check whether they successfully finish.)

000056 000054

SimoneTiberi commented 5 months ago

Hi Bohoon, there are definitely A LOT of significant genes...this is certainly an unusual histogram of p-values (all p < 0.03).

Increasing P_3 and P_4 will not make a difference: this enables a more accurate estimate of small p-values, but we're talking about a small difference; this will not push p-values to 1.

Hard to say what is causing this. If can make the data anonymous I can look at it.

Take care, Simone