FelSiq / DBCV

Efficient implementation in Python of Density-Based Clustering Validation (DBCV) metric, fully compatible with the original MATLAB implementation.
MIT License
11 stars 5 forks source link

Minimum Cluster Size Required #7

Closed cjuracek-tess closed 1 week ago

cjuracek-tess commented 3 weeks ago

Hello,

Is there a minimum cluster size required? On my dataset, this crashes for a cluster size of 2/3, but works for sizes of 5+.

FelSiq commented 3 weeks ago

Hi,

Could you show the Exception log that shows after the crash?

cjuracek-tess commented 3 weeks ago

Sure. This is with a minimum cluster size = 3 and X.shape == (6342, 384). I get the following runtime warnings:

/usr/local/lib/python3.10/site-packages/dbcv/core.py:70: RuntimeWarning: overflow encountered in power
 core_dists = np.power(dists, -d).sum(axis=-1, keepdims=True) / (n - 1) 
/usr/local/lib/python3.10/site-packages/dbcv/core.py:75: RuntimeWarning: divide by zero encountered in power
np.power(core_dists, -1.0 / d, out=core_dists)
 /usr/local/lib/python3.10/site-packages/dbcv/core.py:70: RuntimeWarning: overflow encountered in power
 core_dists = np.power(dists, -d).sum(axis=-1, keepdims=True) / (n - 1) 
/usr/local/lib/python3.10/site-packages/dbcv/core.py:75: RuntimeWarning: divide by zero encountered in power
 np.power(core_dists, -1.0 / d, out=core_dists)
/usr/local/lib/python3.10/site-packages/dbcv/core.py:75: RuntimeWarning: divide by zero encountered in power 
np.power(core_dists, -1.0 / d, out=core_dists) 
/usr/local/lib/python3.10/site-packages/dbcv/core.py:70: RuntimeWarning: overflow encountered in power
 core_dists = np.power(dists, -d).sum(axis=-1, keepdims=True) / (n - 1) 
/usr/local/lib/python3.10/site-packages/dbcv/core.py:70: RuntimeWarning: overflow encountered in power
 core_dists = np.power(dists, -d).sum(axis=-1, keepdims=True) / (n - 1)
 /usr/local/lib/python3.10/site-packages/dbcv/core.py:75: RuntimeWarning: divide by zero encountered in power
 np.power(core_dists, -1.0 / d, out=core_dists)

Before the following ValueError is raised:

RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/usr/local/lib/python3.10/site-packages/dbcv/core.py", line 106, in fn_density_sparseness
    dsc = float(internal_edge_weights.max())
  File "/root/.local/lib/python3.10/site-packages/numpy/core/_methods.py", line 41, in _amax
    return umr_maximum(a, axis, None, out, keepdims, initial, where)
ValueError: zero-size array to reduction operation maximum which has no identity
"""

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
 Cell In[24], line 7
 4 X = np.row_stack(samples_df_unique['embedding'])
 5 labels = samples_df_unique['cluster'].to_numpy()
 ----> 7 dbcv(X=X, y=labels)

File /usr/local/lib/python3.10/site-packages/dbcv/core.py:282, in dbcv(X, y, metric, noise_id, check_duplicates, n_processes, enable_dynamic_precision, bits_of_precision, use_original_mst_implementation)
273 fn_density_sparseness_ = functools.partial( 
274     fn_density_sparseness, 
275     d=d, 
276     enable_dynamic_precision=enable_dynamic_precision, 
277     use_original_mst_implementation=use_original_mst_implementation, 
278 ) 
280 args = [(cls_ind, get_subarray(dists, inds_a=cls_ind)) for cls_ind in cls_inds] --> 
282 for cls_id, (dsc, internal_core_dists, internal_node_inds) in enumerate(ppool.starmap(fn_density_sparseness_, args)): 
283     internal_objects_per_cls[cls_id] = internal_node_inds 
284     internal_core_dists_per_cls[cls_id] = internal_core_dists File /usr/local/lib/python3.10/multiprocessing/pool.py:375, in Pool.starmap(self, func, iterable, chunksize) 
369 def starmap(self, func, iterable, chunksize=None): 
370     ''' 
371     Like `map()` method but the elements of the `iterable` are expected to 
372     be iterables as well and will be unpacked as arguments. Hence 
373     `func` and (a, b) becomes func(a, b). 
374     ''' --> 
375     return self._map_async(func, iterable, starmapstar, chunksize).get() File /usr/local/lib/python3.10/multiprocessing/pool.py:774, in ApplyResult.get(self, timeout) 
772     return self._value 
773 else: --> 
774     raise self._value 

ValueError: zero-size array to reduction operation maximum which has no identity
cjuracek-tess commented 3 weeks ago

Update: Seems to be related to the dimensionality of the data

Do you know if dbcv is appropriate for high-dimensional data such as this?

cjuracek-tess commented 3 weeks ago

Additionally, reducing the dimension does not solve the problem for a minimum cluster size of 2.

FelSiq commented 3 weeks ago

Thanks!

From my understanding, this issue arises when there is not internal edges in the cluster graph.

I'll take a look on how the MATLAB implementation handles such cases as soon as possible.

cjuracek-tess commented 3 weeks ago

I appreciate the prompt responses!

FelSiq commented 1 week ago

Hi, @cjuracek-tess.

I updated the main branch with a bugfix for this edge case, following the approach from the original MATLAB implementation.

Please update your package version.

If the problem persists or if you find any other issue, please let me know.