JinmiaoChenLab / cytofkit

cytofkit: an integrated flow/mass cytometry data analysis pipeline
http://jinmiaochenlab.github.io/cytofkit/
57 stars 25 forks source link

Is phenograph deterministic? #32

Open lconde-ucl opened 6 years ago

lconde-ucl commented 6 years ago

Hi,

Apologies in advance if this is a silly question, but is phenograph deterministic? I.e., when applied repeatedly to the same cytof/flow dataset, and for the same value of k, is it expected to always produce the exact same clustering output? This is what we observe when trying to assess phenograph reproducibility on our data, but my colleague and I are discussing whether this is due to phenograph being very robust, or because it is inherently deterministic (and therefore the only way to test reproducibility is by subsampling or using with different values of k)

Thanks Lucia

MattMyint commented 6 years ago

No worries, I had to do some quick testing myself.

To answer your question, yes it is deterministic. Note though that tsne output can vary with the seed used, so phenograph clustering can be affected by using different seeds for tsne (if the tsne coordinates are used for phenograph clustering).

SamGG commented 6 years ago

Hi Matt, That question interests me also. Could you precise if this phenograph implementation returns the same result because at some points the seed is initialised with arbitrary values OR if the algorithm is deterministic per se? Additionally, would a randomisation of the events of the input files change the results? Sorry if those questions sound obvious to you, but the answers would help. Best.

lconde-ucl commented 6 years ago

Hi SamGG,

I think I can answer the second question, yes Phenograph works on the original expression data, so a randomization of the events of the input files would change the Phenograph results (to generate random subsets of events with cytofkit, use random values for 'sampleSeed' in the 'cytof_expresMerge' step). But please note that the authors of Phenograph claim in their paper that Phenograph is very robust to subsampling, with reproducibility close to 90% when tested in random subsets and for different values of k. So even in random subsets you might find very similar clustering results.

As for your first question, I'm also looking forward to hear what Matt thinks. But after trying to find the answer myself, and if I have to guess, I believe that there is not a "Phenograph seed" implemented in cytofkit. As far as I know, there are only 3 seeds, one for expression merging/downsampling (sampleSeed), one for tSNE and one for flowSOM. But, in the cytofkit implementation of Phenograph, I can see that 2 main functions are called, nn2 to find the nearest neighbours, and cluster_louvain to partition the graph and compute modularity. I'm not familiar at all with how the louvain algorithm works, but from what I've read is not deterministic, so I am guessing that perhaps cluster_louvain or other downstream function called by it might be using an internal seed to ensure reproducibility. So again, I'm just totally guessing here, but I believe that it's not that cytofkit per-se is setting up a seed in their Phenograph implementation, but that there is an internal seed somewhere in the louvain step which makes it deterministic, and unless this seed is exposed, Phenograph will behave deterministically for the same events and value of k. But looking forward to hear what Matt or others think!

Best Lucia

SamGG commented 6 years ago

Many thanks Lucia!

MattMyint commented 6 years ago

Hi @SamGG,

@lconde-ucl's comment actually contains a good deal of what I've found.

I not too familiar with what's under the hood of Phenograph yet, but in my inspection of the phenograph implementation, there was no apparent option of setting a seed.

Looking into the igraph code (but not too in-depth), I believe there may be an internal seed