eren-ck / st_dbscan

ST-DBSCAN: Simple and effective tool for spatial-temporal clustering
MIT License
131 stars 25 forks source link

Another distance metric #8

Closed neonntt closed 2 years ago

neonntt commented 2 years ago

Hi! Thanks so much for this implementation. I wanted some guidance on how to use a different distance metric than the default euclidean. I have data with multiple features and wanted to use another distance metric, such as mahalanobis would the implementation be as under:- st_dbscan = ST_DBSCAN(eps1 = 0.4, eps2 = 5, min_samples = 5, metric = 'mahalanobis')

I did try the above, but got an error Singular matrix. However, when I checked the correlation, it seems to be ok,

Also, in case I would want to use a different weightage for each of the features while calculating the distance, how should i go about it? Would be grateful if you could please help out.

Thanks

eren-ck commented 2 years ago

Hello neonntt, I can't reproduce the issue. Changing the metric in the provided demo notebook works for me. So if you change the fourth cell in the demo notebook to the following code snippet:

st_dbscan = ST_DBSCAN(eps1 = 0.05, eps2 = 10, min_samples = 5, metric = 'mahalanobis')

Regarding the second question you mean you want to apply a weighted Euclidean distance?

Cheers, Eren

neonntt commented 2 years ago

Eren, thank you so much for your reply. I will try out as you mentioned regarding changing the metric. Must be some issue with my data.

Regarding the second question, I would like to try a weighted distance with both Euclidean and Mahanalobis metric. Let's say we have another parameter in the data, for example, speed, and we would like the speed value to be given a higher weightage than the others while calculating the distance. Can you please guide how it can be implemented? Thx again,

eren-ck commented 2 years ago

Sure, you can adapt the code so using something like the following should work:

    def fit(self, X):
        """
        Apply the ST DBSCAN algorithm 
        ----------
        X : 2D numpy array with
            The first element of the array should be the time 
            attribute as float. The following positions in the array are 
            treated as spatial coordinates. The structure should look like this [[time_step1, x, y], [time_step2, x, y]..]
            For example 2D dataset:
            array([[0,0.45,0.43],
            [0,0.54,0.34],...])
        Returns
        -------
        self
        """
        # check if input is correct
        X = check_array(X)

        if not self.eps1 > 0.0 or not self.eps2 > 0.0 or not self.min_samples > 0.0:
            raise ValueError('eps1, eps2, minPts must be positive')

        n, m = X.shape

        # Compute sqaured form Euclidean Distance Matrix for 'time' attribute and the spatial attributes
        time_dist = pdist(X[:, 0].reshape(n, 1), metric=self.metric)

        # --------
        # --------
        # Line changed here:
        # np.array of weights 
        weights = np.array([0.5, 1, 0.2, 0.3]) # weights for the features 
        euc_dist = pdist(X[:, 1:], 'wminkowski', p=2, w=weights)
        # afterwards the same code snippets  
        # --------
        # --------

        # filter the euc_dist matrix using the time_dist
        dist = np.where(time_dist <= self.eps2, euc_dist, 2 * self.eps1)

        db = DBSCAN(eps=self.eps1,
                    min_samples=self.min_samples,
                    metric='precomputed')
        db.fit(squareform(dist))

        self.labels = db.labels_

        return self

Cheers, Eren

neonntt commented 2 years ago

Thanks a ton, Eren...will try it and reach out to you in case I need more help. regards

eren-ck commented 2 years ago

Easy, just reopen issue in that case.

Cheers, Eren