ShobiStassen / PARC

MIT License
41 stars 11 forks source link

np.reshape error if too many edges are pruned #5

Closed ezunder closed 4 years ago

ezunder commented 4 years ago

Fantastic clustering tool! I'm very happy to be able to use this - amazing how fast and well it works. Some of my datasets produce the following error though:

`p = parc.PARC(data)

p.run_PARC() input data has shape 600 (samples) x 12 (features) commencing local pruning based on minowski metric at 2 s.dev above mean commencing global pruning commencing community detection 0.01430511474609375 Traceback (most recent call last):

File "", line 1, in p.run_PARC()

File "C:\Users\ezund\Anaconda3\lib\site-packages\parc_parc.py", line 431, in run_PARC self.run_subPARC()

File "C:\Users\ezund\Anaconda3\lib\site-packages\parc_parc.py", line 241, in run_subPARC PARC_labels_leiden = np.reshape(PARC_labels_leiden, (n_elements, 1))

File "C:\Users\ezund\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py", line 292, in reshape return _wrapfunc(a, 'reshape', newshape, order=order)

File "C:\Users\ezund\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py", line 56, in _wrapfunc return getattr(obj, method)(*args, **kwds)

ValueError: cannot reshape array of size 599 into shape (600,1)`

This was pretty easy to track down through the error messages. The problem happens because so many edges are pruned that some vertices lose all their edges, and then they're not added to the G_sim graph because it's constructed from the edgelist. I've fixed this problem by explicitly adding all the vertices during graph construction in the _parc.py file. I don't think this will cause any problems elsewhere in the code. Perhaps I should adjust parameters to decrease the level of pruning(?) but I think changing the code to make this failsafe is still a good idea. I've made a fork and pull request for this small fix. I'm happy to provide an example dataset that reproduces this error with np.reshape if that would be useful.

Eli

ShobiStassen commented 4 years ago

Hi Eli, Thanks for your PR - it's been well accepted.

Regarding your question on the level of pruning, we have mostly tested on datasets of 1E4 or above cells. Typically 0.15-0.25 for the parameter "jac_std_global" is a safe but still significant (discard 60-50%) level of pruning. Once you set jac_std_global to 0.1 you are pruning away abut 70-80% of the edges in the entire graph depending on the dataset, but even at this stage I find that the clusters are reasonable. This is all assuming you are allowing the local pruning parameter "dist_std_local" to remain fairly high (value of 1 or more) or skipping the local-pruning stage in favour of gaining a few mins of runtime for extremely large datasets. I will try to upload some analysis on the effects of tuning the pruning parameters.