HelenaLC / CATALYST

Cytometry dATa anALYsis Tools
66 stars 31 forks source link

Clustering and resulting UMAPs keep changing in every re-run of the analysis #409

Closed nbahti closed 1 month ago

nbahti commented 1 month ago

Hello,

Thanks a lot for a wonderful tool.

I have a problem with reproducing the same clusters and their locations on the UMAPs in every rerun of my codes with "cluster" and "runDR" commands althout I use the same seed number for randomizations involved. I don't know if it is expected or not. For example you can see the umaps below for the 2 runs of codes in separate times/days. For the selected meta18 clusters, UMAP structure is similar but it seems the cluster assignment of some cells are changing everytime I run "cluster" command (with the same seed = 1). Accordingly, plotExprHeatmap also give differing heatmaps (slight, but with crucial changes for us). Can you help with this issue?

image

HelenaLC commented 1 month ago

Yes, so... both UMAP and the clustering implemented here are stochastic. You can set the seed for random number generation in R via set.seed(123) (always!, independent of any software package). Unfortunately, there is an internal set.seed() call in the clustering method itself (over which I have no control, and it's generally not recommended functions do this without the user's explicit knowledge...), which is why we also pass a seed to that in cluster(). We explain this in our workflow here, also copy pasting the corresponding section for completeness:

FlowSOM output can be sensitive to random starts 15. To make results reproducible, we first set a seed for random number generation prior to calling cluster(), and, secondly, specifcy a seed argument inside cluster(). Unfortunately, this is necessary as the ConsensusClusterPlus() function internally calls set.seed() and will, when not provided with a seed, overwrite our seed using the current system time (set.seed(as.numeric(Sys.time())))

nbahti commented 1 month ago

Thank you for your quick response.

Yes, I understand and I followed the instructions from your workflow. I always use set.seed() outside regardless of the cluster command's internal workings, but still it seems the results are still not the same when re-running.

SamGG commented 1 month ago

If the result is not robust when changing the seed, I would not bet on it. If you trust such a result, the next step is to check if the cluster of interest makes sense using bi-parametric plots.

HelenaLC commented 1 month ago

I agree with @SamGG's point that drastic result changes with different seeds would indicate non-robustness. However, I am still confused as to why results would change if you're fixing the seed. Sorry to ask - but are you sure you're doing this in the corresponding code chunks, and there is no caching or similar? Never seen such significant changes with the same seed. UMAPs are the same as far as I can tell, so it's gotta be the clustering step.

nbahti commented 1 month ago

Here is my code snippet:

image

Inspired by this , I have run some testing and the clustering (and cell assignments) don't seem to change in every run:

image image

HelenaLC commented 1 month ago

Alriight, that is good news/as it should be! Still not clear how the clusterings displayed in the two UMAPs you sent originally came about. Any clue as to what might have been done differently there? You mentioned also plotExprHeatmap gives slightly different results, but this wouldn't be the case if clusterings are all.equal() as tested above.