Closed WeichenXu123 closed 6 years ago
cc @jkbradley @MrBago
I tried to run this locally and got the following error,
[info] Execution 'com.databricks.spark.sql.perf.mllib.feature.Word2Vec' failed: requirement failed: Column text must be of type equal to one of the following types: [ArrayType(StringType,true), ArrayType(StringType,false)] but was actually of type StringType.:
[info] scala.Predef$.require(Predef.scala:224)
[info] org.apache.spark.ml.util.SchemaUtils$.checkColumnTypes(SchemaUtils.scala:58)
[info] org.apache.spark.ml.feature.Word2VecBase$class.validateAndTransformSchema(Word2Vec.scala:117)
[info] org.apache.spark.ml.feature.Word2Vec.validateAndTransformSchema(Word2Vec.scala:127)
[info] org.apache.spark.ml.feature.Word2Vec.transformSchema(Word2Vec.scala:193)
[info] org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
[info] org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:176)
[info] org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:127)
[info] com.databricks.spark.sql.perf.mllib.MLPipelineStageBenchmarkable$$anonfun$3.apply(MLPipelineStageBenchmarkable.scala:63)
[info] com.databricks.spark.sql.perf.mllib.MLPipelineStageBenchmarkable$$anonfun$3.apply(MLPipelineStageBenchmarkable.scala:60)
[info] com.databricks.spark.sql.perf.Benchmarkable$class.measureTime(Benchmarkable.scala:118)
[info] com.databricks.spark.sql.perf.mllib.MLPipelineStageBenchmarkable.measureTime(MLPipelineStageBenchmarkable.scala:13)
[info] com.databricks.spark.sql.perf.mllib.MLPipelineStageBenchmarkable.doBenchmark(MLPipelineStageBenchmarkable.scala:60)
[info] com.databricks.spark.sql.perf.Benchmarkable$class.benchmark(Benchmarkable.scala:50)
[info] com.databricks.spark.sql.perf.mllib.MLPipelineStageBenchmarkable.benchmark(MLPipelineStageBenchmarkable.scala:13)
[info] com.databricks.spark.sql.perf.Benchmark$ExperimentStatus$$anonfun$2$$anonfun$21$$anonfun$apply$9$$anonfun$23$$anonfun$25.apply(Benchmark.scala:394)
[info] com.databricks.spark.sql.perf.Benchmark$ExperimentStatus$$anonfun$2$$anonfun$21$$anonfun$apply$9$$anonfun$23$$anonfun$25.apply(Benchmark.scala:394)
[info] scala.util.Try$.apply(Try.scala:192)
[info] com.databricks.spark.sql.perf.Benchmark$ExperimentStatus$$anonfun$2$$anonfun$21$$anonfun$apply$9$$anonfun$23.apply(Benchmark.scala:393)
[info] com.databricks.spark.sql.perf.Benchmark$ExperimentStatus$$anonfun$2$$anonfun$21$$anonfun$apply$9$$anonfun$23.apply(Benchmark.scala:376)
[info] scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
[info] scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
[info] scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[info] scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
[info] scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
[info] scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
[info] com.databricks.spark.sql.perf.Benchmark$ExperimentStatus$$anonfun$2$$anonfun$21$$anonfun$apply$9.apply(Benchmark.scala:376)
[info] com.databricks.spark.sql.perf.Benchmark$ExperimentStatus$$anonfun$2$$anonfun$21$$anonfun$apply$9.apply(Benchmark.scala:368)
[info] scala.collection.immutable.List.map(List.scala:273)
[info] com.databricks.spark.sql.perf.Benchmark$ExperimentStatus$$anonfun$2$$anonfun$21.apply(Benchmark.scala:368)
[info] com.databricks.spark.sql.perf.Benchmark$ExperimentStatus$$anonfun$2$$anonfun$21.apply(Benchmark.scala:367)
[info] scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
[info] scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
[info] scala.collection.immutable.Range.foreach(Range.scala:160)
[info] scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
[info] scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
[info] com.databricks.spark.sql.perf.Benchmark$ExperimentStatus$$anonfun$2.apply$mcV$sp(Benchmark.scala:367)
[info] com.databricks.spark.sql.perf.Benchmark$ExperimentStatus$$anonfun$2.apply(Benchmark.scala:329)
[info] com.databricks.spark.sql.perf.Benchmark$ExperimentStatus$$anonfun$2.apply(Benchmark.scala:329)
[info] scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
[info] scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
[info] scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
[info] scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[info] scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[info] scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[info] scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[info]
I thin we can just split the "text" column into an array of words to give Word2Vec
what it expects.
Why the CI passed ? Maybe we should trigger CI fail when test fail. ?
I don't think the CI ran the benchmarks, from what I can tell the CI only does ./build/sbt test
. We might want to have the CI run the benchmarks and ensure none of them fail.
Gentle ping @MrBago
Sorry for the delay Wiechen, lgtm!
Thanks @WeichenXu123 and @MrBago !
What's the PR
Word2Vec benchmark added.
Discussion
Whether we need to improve the way of testing text generation. For example, via LDA model ?
Whether we need to add benchmark for
findSynonyms