cmu-db / peloton

The Self-Driving Database Management System
http://pelotondb.io
Apache License 2.0
2.03k stars 623 forks source link

Sequential scan with value_parameter defaults to executor #1377

Open pervazea opened 6 years ago

pervazea commented 6 years ago

Found during performance testing. We configured YCSB to use a 100% read workload, removing the primary key from the schema, to force Peloton to use codegen sequential scan rather than executor index scan. Discovered that due to use of value parameter (in a prepared statement), the system uses the executor.

Modifying the code to allow the value parameter with codegen, Peloton runs fine, but YCSB crashes. Using the executor, things run fine.

The problem is consistently reproducible. After 5 reads, where JDBC adaptor thinks the next read is delivering binary data and fails. Previous reads of the same row are non-binary. Small portion of debug output below, from the JDBC adapter.

colcount = 0 colidx 1, isb false colcount = 1 colidx 2, isb false colcount = 2 colidx 3, isb false colcount = 3 colidx 4, isb false colcount = 4 colidx 5, isb false colcount = 5 colidx 6, isb false colcount = 6 colidx 7, isb false colcount = 7 colidx 8, isb false colcount = 8 colidx 9, isb false colcount = 9 colidx 10, isb false key = 0 rowcount = 1 colcount = 0 colidx 1, isb true idx 0

The data returned up from codegen consistently looks ok. Would like to examine what we are pushing out the wire, further up the stack.

pervazea commented 6 years ago

Examining the interaction between Peloton and oltpbench using wireshark. The output is hard to read since there is no jdbc protocol decoder. What it looks like:

Select calls from 1-5, pass back column information. Select call 6 has different interaction. The return information from codegen has the same MD5 hash as before, which implies we are sending back the right information.

JDBC documentation refers to:

_prepareThreshold = int

Determine the number of PreparedStatement executions required before switching over to use server side prepared statements. The default is five, meaning start using server side prepared statements on the fifth execution of the same PreparedStatement object. More information on server side prepared statements is available in the section called “Server Prepared Statements”._

Wrote a junit test to reproduce the issue (so we are not reliant on oltpbench YCSB). Changing the prepareThreshold does flip behavior from a failing test to a passing test.

pervazea commented 6 years ago

For future reference: https://jdbc.postgresql.org/documentation/head/server-prepare.html https://www.postgresql.org/docs/current/static/protocol-flow.html#PROTOCOL-FLOW-EXT-QUERY

pervazea commented 6 years ago

For future reference: https://jdbc.postgresql.org/documentation/head/server-prepare.html https://www.postgresql.org/docs/current/static/protocol-flow.html#PROTOCOL-FLOW-EXT-QUERY

pervazea commented 6 years ago

result_format is a vector containing information about the expected encoding of query parameters. On the way in, ReadParamValue in postgres_protocol_parser.cpp handles incoming data.

Outgoing, it is done in executor/logical_tile.cpp, in

std::vector<std::vector<std::string>> LogicalTile::GetAllValuesAsStrings(
    const std::vector<int> &result_format, bool use_to_string_null)

Since codegen (quite understandably) does not use GetAllValuesAsStrings... it fails to do the necessary conversions when jdbc switches to server side prepared statements with binary data format.

IMO, having the data conversion down in the executor is incorrect. The "network" layer (I use that term loosely), should handle all the reading, writing and format conversions. The internal layers should know nothing about the wire format.