Open alashkov83 opened 6 years ago
Can you provide the data?
On Fri, Sep 14, 2018, 2:51 AM Aleksandr Lashkov notifications@github.com wrote:
On my program your dbcv code return nan in some cases (sklearm Calinski-Harabaz and Shilhuette index work well with this data (3 dimensional, about 200-1000 points)).
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/christopherjenness/DBCV/issues/8, or mute the thread https://github.com/notifications/unsubscribe-auth/AKfcvjtPxdhQM3qV6fNSQ7A60hBCxxNZks5ua1IJgaJpZM4Wotlp .
I understood. For data with No. of Labels is 1 (1 cluster without noise) DBCV return nan. sklearm Calinski-Harabaz and Shilhuette index raise ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)
It seems many people misunderstand this. I will update when I have time.
On Fri, Sep 14, 2018, 6:08 AM Aleksandr Lashkov notifications@github.com wrote:
I understood. For data with No. of Labels is 1 (1 cluster without noise) DBCV return nan. sklearm Calinski-Harabaz and Shilhuette index raise ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/christopherjenness/DBCV/issues/8#issuecomment-421305687, or mute the thread https://github.com/notifications/unsubscribe-auth/AKfcvlHV2TVg-tJYPSI7YA3FiQGmykEvks5ua4AagaJpZM4Wotlp .
I thought that the problem was only among the labels. After I added this code:
if len (set (labels)) <2 or len (set (labels))> len (labels) - 1:
raise ValueError ("Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)")
part of the nan-values was replaced by exceptions. But not all! Some nan remained. In terminal I got:
test_of_nan.py:53: RuntimeWarning: divide by zero encountered in double_scalars
core_dist = (numerator / (n_neighbors)) ** (-1 / n_features)
test_of_nan.py:198: RuntimeWarning: invalid value encountered in double_scalars
cluster_validity = numerator / denominator
The data for verification and the verification script can be found in the attachment.
I change code in function:
def _core_dist(point, neighbors):
"""
Computes the core distance of a point.
Core distance is the inverse density of an object.
Args:
point (np.array): array of dimensions (n_features,)
point to compute core distance of
neighbors (np.ndarray): array of dimensions (n_neighbors, n_features):
array of all other points in object class
Returns: core_dist (float)
inverse density of point
"""
n_features = np.shape(point)[0]
n_neighbors = np.shape(neighbors)[1]
distance_vector = cdist(point.reshape(1, -1), neighbors)
print(distance_vector)
distance_vector = distance_vector[distance_vector != 0] # in this point was problem, some cases #return []
if len(distance_vector) != 0:
numerator = ((1 / distance_vector) ** n_features).sum()
core_dist = ((numerator / n_neighbors) ** (-1 / n_features))
else:
core_dist = 0.0
return core_dist
But i'an not sure that this code correct! core_dist =0 -> density = inf
I have faced the same issue after scaling my features using the StandardScaler()
before computing the DBCV score. Problem was that the range of values in the distance_vector = distance_vector[distance_vector != 0]
was too large. Consequently, when computing the numerator numerator = ((1 / distance_vector) ** n_features).sum()
the value was too small and rounded to 0.0 by numpy. I managed to solve this issue by converting the distance_vector
variable to np.float128()
first.
Not sure if this implementation is still being maintained, but I have faced nan
as well. This is due to cluster sizes of 2, yielding a division by zero error (n_neighbors - 1 == 0
).
I'm not sure what a correct fix should look like. 2 ideas:
On my program your dbcv code return nan in some cases (sklearm Calinski-Harabaz and Shilhuette index work well with this data (3 dimensional, about 200-1000 points)).