Closed cjuracek-tess closed 1 week ago
Hi,
Could you show the Exception log that shows after the crash?
Sure. This is with a minimum cluster size = 3 and X.shape == (6342, 384)
. I get the following runtime warnings:
/usr/local/lib/python3.10/site-packages/dbcv/core.py:70: RuntimeWarning: overflow encountered in power
core_dists = np.power(dists, -d).sum(axis=-1, keepdims=True) / (n - 1)
/usr/local/lib/python3.10/site-packages/dbcv/core.py:75: RuntimeWarning: divide by zero encountered in power
np.power(core_dists, -1.0 / d, out=core_dists)
/usr/local/lib/python3.10/site-packages/dbcv/core.py:70: RuntimeWarning: overflow encountered in power
core_dists = np.power(dists, -d).sum(axis=-1, keepdims=True) / (n - 1)
/usr/local/lib/python3.10/site-packages/dbcv/core.py:75: RuntimeWarning: divide by zero encountered in power
np.power(core_dists, -1.0 / d, out=core_dists)
/usr/local/lib/python3.10/site-packages/dbcv/core.py:75: RuntimeWarning: divide by zero encountered in power
np.power(core_dists, -1.0 / d, out=core_dists)
/usr/local/lib/python3.10/site-packages/dbcv/core.py:70: RuntimeWarning: overflow encountered in power
core_dists = np.power(dists, -d).sum(axis=-1, keepdims=True) / (n - 1)
/usr/local/lib/python3.10/site-packages/dbcv/core.py:70: RuntimeWarning: overflow encountered in power
core_dists = np.power(dists, -d).sum(axis=-1, keepdims=True) / (n - 1)
/usr/local/lib/python3.10/site-packages/dbcv/core.py:75: RuntimeWarning: divide by zero encountered in power
np.power(core_dists, -1.0 / d, out=core_dists)
Before the following ValueError
is raised:
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "/usr/local/lib/python3.10/site-packages/dbcv/core.py", line 106, in fn_density_sparseness
dsc = float(internal_edge_weights.max())
File "/root/.local/lib/python3.10/site-packages/numpy/core/_methods.py", line 41, in _amax
return umr_maximum(a, axis, None, out, keepdims, initial, where)
ValueError: zero-size array to reduction operation maximum which has no identity
"""
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
Cell In[24], line 7
4 X = np.row_stack(samples_df_unique['embedding'])
5 labels = samples_df_unique['cluster'].to_numpy()
----> 7 dbcv(X=X, y=labels)
File /usr/local/lib/python3.10/site-packages/dbcv/core.py:282, in dbcv(X, y, metric, noise_id, check_duplicates, n_processes, enable_dynamic_precision, bits_of_precision, use_original_mst_implementation)
273 fn_density_sparseness_ = functools.partial(
274 fn_density_sparseness,
275 d=d,
276 enable_dynamic_precision=enable_dynamic_precision,
277 use_original_mst_implementation=use_original_mst_implementation,
278 )
280 args = [(cls_ind, get_subarray(dists, inds_a=cls_ind)) for cls_ind in cls_inds] -->
282 for cls_id, (dsc, internal_core_dists, internal_node_inds) in enumerate(ppool.starmap(fn_density_sparseness_, args)):
283 internal_objects_per_cls[cls_id] = internal_node_inds
284 internal_core_dists_per_cls[cls_id] = internal_core_dists File /usr/local/lib/python3.10/multiprocessing/pool.py:375, in Pool.starmap(self, func, iterable, chunksize)
369 def starmap(self, func, iterable, chunksize=None):
370 '''
371 Like `map()` method but the elements of the `iterable` are expected to
372 be iterables as well and will be unpacked as arguments. Hence
373 `func` and (a, b) becomes func(a, b).
374 ''' -->
375 return self._map_async(func, iterable, starmapstar, chunksize).get() File /usr/local/lib/python3.10/multiprocessing/pool.py:774, in ApplyResult.get(self, timeout)
772 return self._value
773 else: -->
774 raise self._value
ValueError: zero-size array to reduction operation maximum which has no identity
Update: Seems to be related to the dimensionality of the data
d=10
completes with no warningRuntimeWarning
s begin to appear ~d=100
d=330
still yields RuntimeWarning
s but no errors are raisedDo you know if dbcv is appropriate for high-dimensional data such as this?
Additionally, reducing the dimension does not solve the problem for a minimum cluster size of 2.
Thanks!
From my understanding, this issue arises when there is not internal edges in the cluster graph.
I'll take a look on how the MATLAB implementation handles such cases as soon as possible.
I appreciate the prompt responses!
Hi, @cjuracek-tess.
I updated the main branch with a bugfix for this edge case, following the approach from the original MATLAB implementation.
Please update your package version.
If the problem persists or if you find any other issue, please let me know.
Hello,
Is there a minimum cluster size required? On my dataset, this crashes for a cluster size of 2/3, but works for sizes of 5+.