On stability of null clusters & cell cycle awareness

BiotechPedro commented 9 months ago

Hi Dongyuan :D

Congrats for this clever solution regarding the double dipping issue!

I've read that the case scenario for ClusterDE to be applied is for '1vs1' differential expression rather than '1vsALL', which makes total sense to me. However, while running the same analytic pipeline to the null dataset for creating two artificial clusters, they will be sensitive to randomness. In other words, since the two clusters are gonna be artificial, cells in the edges could belong to cluster A with one seed and to cluster B with other seed. Have you though in implementing kind of cluster stability measures for creating stable null cluster? Would it make sense to repeat the process few times and then discard the "volatile" cells to create purer clusters?

Another question I have is how to deal with cell cycle effect. If I am comparing two original clusters obtained after regressing out the effect of the cell cycle, should I regress out that effect when running the pipeline on the null synthetic null data?

Best regards,

Pedro

SONGDONGYUAN1994 commented 9 months ago

Hi Pedro, Thank you very much for your interest in our work! Your two questions are crucial. I will provide my thoughts.

The randomness of the clusters in the null data. This is a good question since we are aware of the randomness of ClusterDE, although we have shown one supp figure that the DE genes are relatively robust. I would not think the clusters in the null data should be fixed, or even highly stable; the randomness in clustering is definitely a source of double-dipping; thus, we should assume that the clusters in the null data will reflect this randomness rather than stability. However, I do agree that we may aggregate the results from several synthetic null data to get more stable DE results (but it is actually not as easy as I thought; if you have any good results, I would greatly appreciate them).
Removal of cell cycling. It depends on what kind of variation you would like to keep in your synthetic null. In the case of cell cycling, I would suggest generating the null data from the cell-cycle-removal data since you are pretty sure that you don't want this variation in your final clusters. Unless you hesitate that the removal of cell-cycling will lead to some unexpected effects of further clustering, you can generate null data conditional on some preprocessing.

Best, Dongyuan

BiotechPedro commented 9 months ago

Thank you, Dongyuan!

1) I agree that we look for randomness rather than stability. However, the aggregation from several synthetic null data to get more stable DE results would be better. In which terms have you thought about it?

2) So, you would remove the cell cycle-related genes rather than regressing them out in both the original and the synthetic data, right?

Best,

Pedro

SONGDONGYUAN1994 commented 9 months ago

Hi Pedro,

Two ideas: (1) Clipper (the FDR control method) can actually take multiple null scores, but I did not find it help in the ClusterDE. (2) Just run ClusterDE a few times and use the "average" or "majority vote" DE genes.
Yes, I would remove the cell cycle-related genes from the very beginning rather than regressing them out in both the original and the synthetic data.

Best, Dongyuan

BiotechPedro commented 9 months ago

I suppose that ClusterDE has a low frequency of type I errors, so the results are almost equal when adjusting by either one or multiple null scores, right?

Thank you very much for all your insights and congrats again for the method!

Best,

Pedro

SONGDONGYUAN1994 / ClusterDE

On stability of null clusters & cell cycle awareness #3