derrickburns / generalized-kmeans-clustering

Spark library for generalized K-Means clustering. Supports general Bregman divergences. Suitable for clustering probabilistic data, time series data, high dimensional data, and very large data.
https://generalized-kmeans-clustering.massivedatascience.com/
Apache License 2.0
299 stars 50 forks source link

Rationale behind WeightedVector #74

Open tmnd1991 opened 9 years ago

tmnd1991 commented 9 years ago

I don't get the rationale behind Weighted Vector, As far as I got, WeightedVector applies the same weight to each Vector element. For example, if I have

val v  = Vectors.dense(1,0.5,3)
val wv = WeightedVector(v,0.5)

wv will be treated as Vector.dense(0.5,0.25,1.5) in terms of clustering, right? Now, let's say I'm extracting 2 features from data, one feature it's represented by one vector element and the other one is represented by 20 vector elements. Now I want that, for what concerns clustering, both the features have the same weight, so I should weight the first element as 1 and the other 20 as 1/20, right? I expected this kind of functionality from weighted vector, I don't see the point of WeightedVectors as they are now, but probably is because of my lack of experience about clustering and data mining in general.