Word2Vec benchmark - Githubissues

WeichenXu123 commented 6 years ago

What's the PR

Word2Vec benchmark added.

Discussion

Whether we need to improve the way of testing text generation. For example, via LDA model ?
Whether we need to add benchmark for findSynonyms

WeichenXu123 commented 6 years ago

cc @jkbradley @MrBago

MrBago commented 6 years ago

I tried to run this locally and got the following error,

[info] Execution 'com.databricks.spark.sql.perf.mllib.feature.Word2Vec' failed: requirement failed: Column text must be of type equal to one of the following types: [ArrayType(StringType,true), ArrayType(StringType,false)] but was actually of type StringType.:
[info] scala.Predef$.require(Predef.scala:224)
[info] org.apache.spark.ml.util.SchemaUtils$.checkColumnTypes(SchemaUtils.scala:58)
[info] org.apache.spark.ml.feature.Word2VecBase$class.validateAndTransformSchema(Word2Vec.scala:117)
[info] org.apache.spark.ml.feature.Word2Vec.validateAndTransformSchema(Word2Vec.scala:127)
[info] org.apache.spark.ml.feature.Word2Vec.transformSchema(Word2Vec.scala:193)
[info] org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
[info] org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:176)
[info] org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:127)
[info] com.databricks.spark.sql.perf.mllib.MLPipelineStageBenchmarkable$$anonfun$3.apply(MLPipelineStageBenchmarkable.scala:63)
[info] com.databricks.spark.sql.perf.mllib.MLPipelineStageBenchmarkable$$anonfun$3.apply(MLPipelineStageBenchmarkable.scala:60)
[info] com.databricks.spark.sql.perf.Benchmarkable$class.measureTime(Benchmarkable.scala:118)
[info] com.databricks.spark.sql.perf.mllib.MLPipelineStageBenchmarkable.measureTime(MLPipelineStageBenchmarkable.scala:13)
[info] com.databricks.spark.sql.perf.mllib.MLPipelineStageBenchmarkable.doBenchmark(MLPipelineStageBenchmarkable.scala:60)
[info] com.databricks.spark.sql.perf.Benchmarkable$class.benchmark(Benchmarkable.scala:50)
[info] com.databricks.spark.sql.perf.mllib.MLPipelineStageBenchmarkable.benchmark(MLPipelineStageBenchmarkable.scala:13)
[info] com.databricks.spark.sql.perf.Benchmark$ExperimentStatus$$anonfun$2$$anonfun$21$$anonfun$apply$9$$anonfun$23$$anonfun$25.apply(Benchmark.scala:394)
[info] com.databricks.spark.sql.perf.Benchmark$ExperimentStatus$$anonfun$2$$anonfun$21$$anonfun$apply$9$$anonfun$23$$anonfun$25.apply(Benchmark.scala:394)
[info] scala.util.Try$.apply(Try.scala:192)
[info] com.databricks.spark.sql.perf.Benchmark$ExperimentStatus$$anonfun$2$$anonfun$21$$anonfun$apply$9$$anonfun$23.apply(Benchmark.scala:393)
[info] com.databricks.spark.sql.perf.Benchmark$ExperimentStatus$$anonfun$2$$anonfun$21$$anonfun$apply$9$$anonfun$23.apply(Benchmark.scala:376)
[info] scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
[info] scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
[info] scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[info] scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
[info] scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
[info] scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
[info] com.databricks.spark.sql.perf.Benchmark$ExperimentStatus$$anonfun$2$$anonfun$21$$anonfun$apply$9.apply(Benchmark.scala:376)
[info] com.databricks.spark.sql.perf.Benchmark$ExperimentStatus$$anonfun$2$$anonfun$21$$anonfun$apply$9.apply(Benchmark.scala:368)
[info] scala.collection.immutable.List.map(List.scala:273)
[info] com.databricks.spark.sql.perf.Benchmark$ExperimentStatus$$anonfun$2$$anonfun$21.apply(Benchmark.scala:368)
[info] com.databricks.spark.sql.perf.Benchmark$ExperimentStatus$$anonfun$2$$anonfun$21.apply(Benchmark.scala:367)
[info] scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
[info] scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
[info] scala.collection.immutable.Range.foreach(Range.scala:160)
[info] scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
[info] scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
[info] com.databricks.spark.sql.perf.Benchmark$ExperimentStatus$$anonfun$2.apply$mcV$sp(Benchmark.scala:367)
[info] com.databricks.spark.sql.perf.Benchmark$ExperimentStatus$$anonfun$2.apply(Benchmark.scala:329)
[info] com.databricks.spark.sql.perf.Benchmark$ExperimentStatus$$anonfun$2.apply(Benchmark.scala:329)
[info] scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
[info] scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
[info] scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
[info] scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[info] scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[info] scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[info] scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[info]

I thin we can just split the "text" column into an array of words to give Word2Vec what it expects.

WeichenXu123 commented 6 years ago

Why the CI passed ? Maybe we should trigger CI fail when test fail. ?

MrBago commented 6 years ago

I don't think the CI ran the benchmarks, from what I can tell the CI only does ./build/sbt test. We might want to have the CI run the benchmarks and ensure none of them fail.

WeichenXu123 commented 6 years ago

Gentle ping @MrBago

MrBago commented 6 years ago

Sorry for the delay Wiechen, lgtm!

jkbradley commented 6 years ago

Thanks @WeichenXu123 and @MrBago !

databricks / spark-sql-perf

Word2Vec benchmark #127

What's the PR

Discussion