christopherjenness / DBCV

Python implementation of Density-Based Clustering Validation
MIT License
154 stars 41 forks source link

nan in result #8

Open alashkov83 opened 6 years ago

alashkov83 commented 6 years ago

On my program your dbcv code return nan in some cases (sklearm Calinski-Harabaz and Shilhuette index work well with this data (3 dimensional, about 200-1000 points)).

christopherjenness commented 6 years ago

Can you provide the data?

On Fri, Sep 14, 2018, 2:51 AM Aleksandr Lashkov notifications@github.com wrote:

On my program your dbcv code return nan in some cases (sklearm Calinski-Harabaz and Shilhuette index work well with this data (3 dimensional, about 200-1000 points)).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/christopherjenness/DBCV/issues/8, or mute the thread https://github.com/notifications/unsubscribe-auth/AKfcvjtPxdhQM3qV6fNSQ7A60hBCxxNZks5ua1IJgaJpZM4Wotlp .

alashkov83 commented 6 years ago

I understood. For data with No. of Labels is 1 (1 cluster without noise) DBCV return nan. sklearm Calinski-Harabaz and Shilhuette index raise ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

christopherjenness commented 6 years ago

It seems many people misunderstand this. I will update when I have time.

On Fri, Sep 14, 2018, 6:08 AM Aleksandr Lashkov notifications@github.com wrote:

I understood. For data with No. of Labels is 1 (1 cluster without noise) DBCV return nan. sklearm Calinski-Harabaz and Shilhuette index raise ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/christopherjenness/DBCV/issues/8#issuecomment-421305687, or mute the thread https://github.com/notifications/unsubscribe-auth/AKfcvlHV2TVg-tJYPSI7YA3FiQGmykEvks5ua4AagaJpZM4Wotlp .

alashkov83 commented 6 years ago

I thought that the problem was only among the labels. After I added this code:

if len (set (labels)) <2 or len (set (labels))> len (labels) - 1:
        raise ValueError ("Number of labels is 1. Valid values ​​are 2 to n_samples - 1 (inclusive)")

part of the nan-values ​​was replaced by exceptions. But not all! Some nan remained. In terminal I got:

test_of_nan.py:53: RuntimeWarning: divide by zero encountered in double_scalars
  core_dist = (numerator / (n_neighbors)) ** (-1 / n_features)
test_of_nan.py:198: RuntimeWarning: invalid value encountered in double_scalars
  cluster_validity = numerator / denominator

The data for verification and the verification script can be found in the attachment.

nan_error.zip

alashkov83 commented 6 years ago

I change code in function:

def _core_dist(point, neighbors):
    """
    Computes the core distance of a point.
    Core distance is the inverse density of an object.
    Args:
        point (np.array): array of dimensions (n_features,)
            point to compute core distance of
        neighbors (np.ndarray): array of dimensions (n_neighbors, n_features):
            array of all other points in object class
    Returns: core_dist (float)
        inverse density of point
    """
    n_features = np.shape(point)[0]
    n_neighbors = np.shape(neighbors)[1]

    distance_vector = cdist(point.reshape(1, -1), neighbors)
    print(distance_vector)
    distance_vector = distance_vector[distance_vector != 0] # in this point was problem, some cases #return [] 
    if len(distance_vector) != 0:
        numerator = ((1 / distance_vector) ** n_features).sum()
        core_dist = ((numerator / n_neighbors) ** (-1 / n_features))
    else:
        core_dist = 0.0
    return core_dist

But i'an not sure that this code correct! core_dist =0 -> density = inf

joachimpoutaraud commented 1 year ago

I have faced the same issue after scaling my features using the StandardScaler() before computing the DBCV score. Problem was that the range of values in the distance_vector = distance_vector[distance_vector != 0] was too large. Consequently, when computing the numerator numerator = ((1 / distance_vector) ** n_features).sum() the value was too small and rounded to 0.0 by numpy. I managed to solve this issue by converting the distance_vector variable to np.float128() first.

cjuracek-tess commented 1 week ago

Not sure if this implementation is still being maintained, but I have faced nan as well. This is due to cluster sizes of 2, yielding a division by zero error (n_neighbors - 1 == 0).

I'm not sure what a correct fix should look like. 2 ideas: