alitouka / spark_dbscan

DBSCAN clustering algorithm on top of Apache Spark
Apache License 2.0
255 stars 115 forks source link

Keep unique id of a point #12

Open lucaventurini opened 8 years ago

lucaventurini commented 8 years ago

Let's say we want to cluster some objects on a subset of their features. We then transform these objects into Points, where the said subset will become the coordinates of the Points. We want to keep track of the remainder of the features not used for the algorithm. How do we proceed?

A possible solution is to keep track of each point by means of a unique identifier. In the current source code I see such a field, but, even if I force it to a value, it is magically transformed in some point of the pre-partioning, and at the end of the algorithm all the identifiers are different, so that no join with the initial dataset is possible.

I think this is a critical issue, if confimed. Joining the result of a clustering with some metadata is the most, if not only, useful postprocessing part to make something of the results of DBSCAN (and any clustering algorithm).

mgaido91 commented 8 years ago

At the moment there is no way to attach metadata to the points. The only available way is to perform a join at the end using the coordinates...

If you want to attach metadata, you need to create a proper field in the Point object and refactor the code, since Points are re-created several times in the algorithm...

lucaventurini commented 8 years ago

I see your point, but a full join is an expensive operation that could be saved if only the id could be kept. So you confirm the id is refactored during the runs on purpose?

mgaido91 commented 8 years ago

Yes, the pointId you can see there is for internal processing and no metadata storage is allowed so far