FelSiq / DBCV

Efficient implementation in Python of Density-Based Clustering Validation (DBCV) metric, fully compatible with the original MATLAB implementation.
MIT License
9 stars 4 forks source link

Duplicated samples have been found in X #2

Closed MilanDeruelle closed 8 months ago

MilanDeruelle commented 8 months ago

Hey and thanks for this package... since the other one available doesn't seem to follow the reference implementation and is inactive it is nice to see something being maintained.

I'm getting the duplicate error when running my code (even though the data has previously being de-duplicated using pandas) "Duplicated samples have been found in X."

I saw that the init function allows me to disable duplicate checks, however since its not documented I'm wondering what the implications of this are.

When running with duplication checks off, I am getting a result, however I am also getting the warning: "overflow encountered in power core_dists = np.power(dists, -d).sum(axis=-1, keepdims=True) / (n - 1 + 1e-12)" Is this related or something to worry about?

Thanks again and all the best!

FelSiq commented 8 months ago

Hi @MilanDeruelle

The duplicate checking is implemented as follows:

def _check_duplicated_samples(X: npt.NDArray[np.float64], threshold: float = 1e-9):
    if X.shape[0] <= 1:
        return

    nn = sklearn.neighbors.NearestNeighbors(n_neighbors=1)
    nn.fit(X)
    dists, _ = nn.kneighbors(return_distance=True)

    if np.any(dists < threshold):
        raise ValueError("Duplicated samples have been found in X.")

That means that embeddings with euclidean distances to their nearest neighbor below 1e-9 (10^-9) are considered duplicates. Without looking at any code or data, it is impossible to tell whether this explains the issue or not, but I hope this provides some insight to you. I'll update the dbcv docstring to make that process clear to the user.

Regarding the overflow, I suggest you try activating dynamic floating precision to compute DBCV in your data as follows:

dbcv.dbcv(..., enable_dynamic_precision=True, bits_of_precision=512)

This should slowdown your execution a bit, but should also prevent the overflow issue for regular 64-bit precision computation. Computing the density function in high dimensions is a somewhat "fragile" computation, hence why the support for large amounts of precision bits.

If you have any questions or another issue, please feel free to ask me.

MilanDeruelle commented 8 months ago

Hey and thanks for the quick response. Do you have any insights in how computation with duplicated/too little precision might influence the result? According to the documentation the dynamic precision might come at a significant performance impact and the calculation takes already around 2 minutes on my smallest dataset, which might be on the limit of practicality for hyper parameter tuning

FelSiq commented 8 months ago

Disclaimer: the tests presented below are quick assessments using artificially generated data. Conclusions may not hold for general distributions.

Testing the impact of duplicated examples in the metric score:

import dbcv
import numpy as np
import sklearn.datasets

X, y = sklearn.datasets.make_moons(n_samples=300, noise=0.10, random_state=1782, shuffle=True)

res = dbcv.dbcv(X, y)
print(res) # 0.9471829482518501

# Inserting 10 duplicated samples
X = np.vstack((X, X[-10:]))
y = np.hstack((y, y[-10:]))

res = dbcv.dbcv(X, y, check_duplicates=False)
print(res) # 0.9362759272491825

My conclusion is that the presence of duplicated samples is not optimal and interferes with the result, but it may not be catastrophic.

About precision, I believe the crucial factor is the embedding dimension. Since you're getting a warning about overflow, I suspect you're operating in a high dimension.

Given the following code:

import numpy as np
import dbcv

rng = np.random.RandomState(182)

X = rng.randn(3000, 5000)
y = rng.randint(0, 3, size=X.shape[0])
y[rng.random() <= 0.025] = -1

# Translate the center of mass of each class
X[y == 1] += np.full((1, X.shape[1]), fill_value=1)
X[y == 2] += np.full((1, X.shape[1]), fill_value=-2)

# Adding some random noise
X += 0.20 * rng.randn(X.shape[0], X.shape[1])

res = dbcv.dbcv(X, y, enable_dynamic_precision=False)
print(res) # 0.40635519185058705

res = dbcv.dbcv(X, y, enable_dynamic_precision=True)
print(res) # 0.4072063934947467

I would argue that the effects are somewhat similar to allowing duplicated data points: both act as sources of noise in the results, but they don't necessarily invalidate the metric estimation.

In your position, I would resample my data a few times, evaluate the DBCV metric in each sample, and assess its variability in the context of the application. If the results look acceptable, I would proceed further with the analysis.

Projecting the data into a smaller subspace before the clustering analysis is also a common solution to the limitations of density estimation in high dimensions.

MilanDeruelle commented 8 months ago

Thanks a lot for the help!

FelSiq commented 8 months ago

No worries.

If you find something else, please let me know.

FelSiq commented 8 months ago

I'm closing this issue for now, since there's nothing to be done.