databricks / spark-sql-perf

Apache License 2.0
586 stars 406 forks source link

Add benchmark for LinearSVC/OnehotEncoder/VectorSlicer/VectorAssembler/StringIndexer/Tokenizer #112

Closed WeichenXu123 closed 7 years ago

WeichenXu123 commented 7 years ago

Add benchmark for:

FPGrowth is working on will be added soon.

Part of code from https://github.com/smurching/spark-sql-perf/pull/1 and address feedbacks from @smurching

WeichenXu123 commented 7 years ago

cc @smurching @jkbradley

WeichenXu123 commented 7 years ago

@smurching Most places update according to your feedback. I add a DocGenerator for Tokenizer. Leave a issue solving MLParam turning into DataFrame column broken, I will fix it tomorrow! Thanks for kindly review! FPGrowth PR #113 pls also help review, thanks!

WeichenXu123 commented 7 years ago

Update code against new MLParam impl. thanks! cc @smurching @jkbradley

jkbradley commented 7 years ago

One more comment, copying Sid's comment from above: Could you also add params for your new tests to src/main/scala/configs/mllib-small.yaml?

jkbradley commented 7 years ago

LGTM except the 1 comment above. Since this is blocking some other tasks, I'll merge it. Can you please send a tiny PR after this to fix the remaining comment? Thanks!