GoogleCloudPlatform / pontem

Open source tools for Google Cloud Storage and Databases.
Apache License 2.0
63 stars 15 forks source link

treat query files as entire scripts or multiline queries #246

Open vicenteg opened 4 years ago

vicenteg commented 4 years ago

Fixes #244

Read all lines in the query file, and join them to create a single query or script.

codecov[bot] commented 4 years ago

Codecov Report

Merging #246 into dev will not change coverage. The diff coverage is 0.00%.

@@            Coverage Diff            @@
##                dev     #246   +/-   ##
=========================================
  Coverage     83.01%   83.01%           
  Complexity      203      203           
=========================================
  Files            34       34           
  Lines          1101     1101           
  Branches         66       66           
=========================================
  Hits            914      914           
  Misses          178      178           
  Partials          9        9           
Impacted Files Coverage Δ Complexity Δ
...m/google/cloud/pontem/config/WorkloadSettings.java 44.23% <0.00%> (ø) 10.00 <0.00> (ø)
ldanielmadariaga commented 4 years ago

Query files by design expect to contain 1 query per line as multiple queries (forming a workload) will usually be included in one file, the proposed change defeats the purpose of the query file.

If you'd like to implement something like this you'd need to have a parsing scheme that identifies whether a line represents a complete query or an unfinished query and adjust parsing accordingly.

vicenteg commented 4 years ago

Thanks for looking at this.

Query files by design expect to contain 1 query per line as multiple queries (forming a workload) will usually be included in one file, the proposed change defeats the purpose of the query file.

Here's an example of how I've been configuring workloads:

concurrencyLevel: 5
isRatioBasedBenchmark: true
benchmarkRatios: [0.2, 0.5, 1.0, 2.0]
outputFileFolder: ./
workloads:
- name: my_workload
  projectId: my_project
  queryFiles:
    - queries/query1.sql
    - queries/query2.sql
    - queries/query3.sql

Is this not equivalent to having a single query file with multiple queries?

One query per line makes complex queries and scripts much harder to read, and it requires pre-processing of the query to remove newlines. While this is not necessarily hard to do, it does represent friction.

If you'd like to implement something like this you'd need to have a parsing scheme that identifies whether a line represents a complete query or an unfinished query and adjust parsing accordingly.

Parsing SQL is non-trivial and BigQuery already parses SQL. I'd suggest that a query file's contents should be passed to BigQuery unmodified, so that BigQuery can do the parsing. This is especially important since people may want to test scripts, which are very naturally multiple line files and should be interpreted by BigQuery and not the tool submitting the script.

Diverging from the precedent set by bq is a bit unexpected as well. The bq tool does not parse the query but instead lets BigQuery handle that. For example, a query file like the one below works with bq but would fail with the current implementation of the workload tester:

$ cat <<EOF > /tmp/simplequeries.sql
> SELECT *
> FROM (SELECT 1);
> SELECT 2;
> EOF
$ bq query < /tmp/simplequeries.sql
Waiting on bqjob_r7d03abef82f68ffd_0000017233128fb5_1 ... (1s) Current status: DONE   
SELECT *
FROM (SELECT 1); -- at [1:1]
+-----+
| f0_ |
+-----+
|   1 |
+-----+
SELECT 2; -- at [3:1]
+-----+
| f0_ |
+-----+
|   2 |
+-----+

Thanks for the consideration!

--vince