cmu-db / noisepage

Self-Driving Database Management System from Carnegie Mellon University
https://noise.page
MIT License
1.75k stars 502 forks source link

Exec Engine Microbenchmarks #566

Open apavlo opened 5 years ago

apavlo commented 5 years ago

@pmenon wrote some microbenchmarks to evaluate the performance of the new LLVM engine:

https://github.com/cmu-db/terrier/blob/master/sample_tpl/

This code does not use the GBenchmark library and thus we are not executing any code that measures the execution engine in our nightly runs. Thus, the purpose of this task is to port over this code to our benchmark directory.

The first execution engine microbenchmark should just be to measure how fast we can execute a sequential scan. This should go through the full LLVM engine and not access the SQLTable directly (the existing benchmarks do this). See this the RunFile function on how to do this programatically in C++.

You can use vec-filter.tpl as the target query.

  1. Create a new file seqscan_benchmark.cpp in terrier/benchmark/execution/. This new benchmark should follow the same setup as LargeTransactionBenchmark.

  2. You will want to adapt the table generator code from the util's TableGenerator. Check with @mbutrovich about what utility code we already have to generate sample data for a table. Otherwise we need to figure out where we want to put TableGenerator.

  3. Modify the nightly script to include your new benchmark code. Search for large_transaction in the file to find the two arrays that you need to modify (or the single array after #564 goes in).

eppingere commented 4 years ago

To fix this issue, port over either Q0/Q1 from tpch benchmark on my fork: https://github.com/eppingere/tpl/blob/master/benchmark/sql/tpch_benchmark.cpp#L141 Q0 has an unfulfillable predicate and Q1 has the predicate from tpc-h Q1 which is fulfillable (not sure of the percentage of tuples that fulfill it)

The location of the loading data is at /home/pmenon/tools/TPC-H/data/sf-10. It might be worth generating the data in some other way? Or putting this data in a repository and having it be a sub-repo that is cloned recursively? Additionally make sure that the execution mode that the query is run on is either Compile/Interpet as appropriate? Might be worth having a different benchmark for each? All fun design decisions