derrickburns / generalized-kmeans-clustering

Spark library for generalized K-Means clustering. Supports general Bregman divergences. Suitable for clustering probabilistic data, time series data, high dimensional data, and very large data.
https://generalized-kmeans-clustering.massivedatascience.com/
Apache License 2.0
299 stars 50 forks source link

stackoverflow #73

Open bkersbergen opened 9 years ago

bkersbergen commented 9 years ago

Hi, When 'predicting' a single Vector from a RDD[Vector] on a trained model a stackoverflowerror is thrown. When doing the same on a RDD[Vector] at once it works oke.

println("clustering single vectors fails")
val singleVector = mymatrix.map { point =>
 try {
  val prediction = kModel.predict(point)
  (point.toString, prediction)
 } catch {
  case e: Error => println("unable to predict a single vector")
 }
}
println(s"singleVector.count():${singleVector.count()}")

println("clustering using multiple vectors, this runs oke")
val predictions = kModel.predict(mymatrix)
val multipleVector = predictions.zip(mymatrix).map(point => (point._2.toString, point._1))
println(s"multipleVector.count():${multipleVector.count()}")

I've put my code with data as an example here: https://github.com/bkersbergen/massive-kmeans-overflow.

2015/06/18 11:10:03:300 [ERROR] [Executor task launch worker-5]     org.apache.spark.Logging$class.logError:96 - Exception in task 0.0 in stage 63.0 (TID 31500)
java.lang.StackOverflowError
    at     com.massivedatascience.divergence.SquaredEuclideanDistanceDivergence$.convexHomogeneous    (BregmanDivergence.scala:144)
    at     com.massivedatascience.clusterer.NonSmoothedPointCenterFactory$class.toPoint(BregmanPointO    ps.scala:209)
    at     com.massivedatascience.clusterer.SquaredEuclideanPointOps$.toPoint(BregmanPointOps.scala:260)
    at     com.massivedatascience.clusterer.KMeansPredictor$class.predictWeighted(KMeansModel.scala:66)
    at com.massivedatascience.clusterer.KMeansModel.predictWeighted(KMeansModel.scala:99)

This works on the MLLib kmeans implementation, however switching to massive-kmeans gives the following stackoverflowerror: (you can switch between import statements MLLib/massivedatascience in the scala file to see the difference)

mvplove123 commented 7 years ago

did you solve this problem?

mvplove123 commented 7 years ago

I have solved this problem ,you should add more size for stack ,like this --conf spark.executor.extraJavaOptions=-Xss100m ,it's work