Tips for replicating the performance on the lung cell clustering

tom-b commented 4 years ago

Hi,

I'm working to replicate the performance results of the paper with the 1.1 million cells from the lung data, but getting very poor performance in the call to leidenalg. I stuck a few timing statements into the code to find the bottleneck and am seeing runtimes well into the two-hour range for leidenalg with this data set.

I am testing with the snippet from the main page:

X = pd.read_csv("'./LungData.txt", header=None).values.astype("float") 
y = list(pd.read_csv('./LungData_annotations.txt', header=None)[0]) // list of cell-type annotations

// run PARC on 1.1M and 70K cells
parc1 = parc.PARC(X, true_label=y)
parc1.run_PARC() // run the clustering
parc_labels = parc1.labels

I also tried with a simple normalization of the data with:

normalized_X = preprocessing.normalize(X)

and see the same runtimes.

I am using GCP instances - c2-standard-16 (16 vCPUs, 64 GB memory) with what reports as 3.2GHz cpus on centos - but the almost order-of-magnitude difference makes me thing I have missed some configuration or parameter tuning.

My results show: input data has shape 1113369 (samples) x 26 (features) time elapsed make_knn_struct 28.5 seconds commencing global pruning commencing community detection partition type MVP time elapsed leiden 6713.5 seconds list of cluster labels and populations time elapsed run_subPARC 6834.2 seconds

Based on the partition type, this seems to point at:

partition = leidenalg.find_partition(G_sim, leidenalg.ModularityVertexPartition, weights='weight', n_iterations=self.n_iter_leiden, seed=self.random_seed)

as unexpectedly long-running.

I've also noticed that the call to leidenalg seems to result in a single CPU being pegged at 100% for the duration of that time - I'm a little unsure if leidenalg should (like HNSW) be more multiple-cpu friendly?

Any suggestions? Thanks.

ShobiStassen commented 4 years ago

hi there,

Thank you so much for bringing this to my attention. Fortunately there is a very simple fix which requires that you install the previous version of leidenalg. It seems like leidenalg updated their pip install version a month ago, and for some reason (which I will now investigate) it is drastically slower. Luckily you can just uninstall your current leiden alg and the install the version 0.7.0 pip uninstall leidenalg pip install leidenalg==0.7.0 Please let me know if this solves the issue for you!

Tips: 1) For very large datasets I suggest you change the jac_weighted_edges paramters to False as this also offers some speed-up. Though the main issue is the leidenalg version. 2) If you are interested in visualizing very large (>1M) datasets, I recently added a feature to PARC which allows you to apply UMAP to the HNSW graph. This offers huge memory savings as well as moderate time savings. You can check this out on the jupyter NB (which is still a work-in-progress, but should give you an idea).

Here is a rather verbose output, but you can see that the runtimes (using 0.7.0 leidenalg) are in line with the reported numbers

current time is: Tue Jun 2 11:48:29 2020 input to parc is (1113369, 24) unweighted edges - this offers slight speed up over weighted. jac pruning level set at: median time elapsed knn 53.78 sec (K=30) number of k-nn is 30, too big factor is 0.4, small pop is 10 shapes of neigh and dist array (1113369, 30) (1113369, 30) keep all is True (automatically determined) percentage of edges KEPT 50% time to prune jaccard: 6 secs time to remake and simplify igraph: 16 secs call leiden: unweighted graph. 5 iterations Q= 0.8757695136417326 time elapsed on leiden call 161 There are1668 clusters, proceed to handle singletons/small clusters number of small clusters before initial small_pop handling 1649 small pop exist True small and big round took 4 seconds final labels allocation {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18} pop of big list is of length and populations 19 [170518, 133796, 99979, 97756, 94627, 91372, 71878, 64127, 62969, 59007, 54055, 52696, 31536, 13179, 10891, 2517, 1967, 435, 64] PARC total time elapsed 297.79 seconds**

ShobiStassen commented 4 years ago

I also post the output of PARC on the 1.3 Million Mouse Neuron dataset (using first 20 PCs). Completed in 13 mins.

print('PARC started at:', time.ctime())
p1 = parc.PARC(data_mat, keep_all_local_dist=True, jac_weighted_edges=False)  
p1.run_PARC()
print('PARC ends', time.ctime())

I should mention my this was run on a machine with 126GB Ram and 8x Intel® Xeon(R) W-2123 CPU @ 3.60GHz

PARC started at: Tue Jun 2 12:17:03 2020 input data has shape 1300000 (samples) x 20 (features) knn struct was not available, so making one commencing global pruning commencing community detection partition type MVP list of cluster labels and populations 40 [(0, 100645), (1, 99736), (2, 84765), (3, 81285), (4, 74936), (5, 68916), (6, 66356), (7, 64582), (8, 64537), (9, 60377), (10, 60447), (11, 39137), (12, 38387), (13, 38327), (14, 34935), (15, 27971), (16, 27151), (17, 26221), (18, 24848), (19, 24662), (20, 22815), (21, 20918), (22, 18319), (23, 17540), (24, 15605), (25, 15123), (26, 12712), (27, 11732), (28, 11674), (29, 10528), (30, 8432), (31, 8145), (32, 5069), (33, 4077), (34, 3254), (35, 2437), (36, 1753), (37, 1020), (38, 480), (39, 146)] time elapsed 763.6 seconds PARC ends Tue Jun 2 12:29:46 2020

tom-b commented 4 years ago

Hi,

Thanks! Switching to the 0.7.0 branch of leidenalg definitely fixed the running time so that I could replicate the results from the paper. Now I can work on better understanding the pruning parts of PARC!

I see your open issue with the leidenalg developers - thanks for reaching out to them. It looks like there is some good insight already on what may have been performance impacting changes in their codebase.

ShobiStassen / PARC

Tips for replicating the performance on the lung cell clustering #11