databricks / spark-sql-perf

Apache License 2.0
586 stars 407 forks source link

Improvement ideas - more options #86

Open a-roberts opened 8 years ago

a-roberts commented 8 years ago

Hi, I'd like to contribute the below changes, looking for second opinions on whether this is the direction we want to go with this benchmark.

I suggest the following options be added and I'm happy to work on this

-u for a URI so we can store our data in a database such as DB2 instead of the local file system (concerned here as we'd need to pass the user and pass to the benchmark for some database configurations, wouldn't want this to be getting made available in someone's .bash_history) -m for a custom master URL so we can use a cluster instead of just local[*] -wsc to quickly enable WholeStageCodegen -q to run only select queries e.g. -q 1 2 4 90

We have a variety of wholestage codegen changes that we'd like to contribute to Spark and we think this benchmark is the most fitting for our goals, with the above changes we can quickly execute individual queries and use databases, execute across multiple machines (or just not ours) and quickly see the benefits or drawbacks to WholeStageCodegen

a-roberts commented 8 years ago

I don't think we need the WholeStageCodegen setting actually as it's on by default, would also like to be able to read/write from/to data on the local file system at a place we specify

a-roberts commented 8 years ago

Wouldn't need the -q option (run only certain queries by name) either because we can use the already available -f option instead which would have the same effect, so I'll just implement the -u and -m steps