facebookincubator / velox

A C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.
https://velox-lib.io/
Apache License 2.0
3.47k stars 1.14k forks source link

Use Velox Parquet Writer in ParquetTpchTest #6495

Closed majetideepak closed 10 months ago

majetideepak commented 1 year ago

Description

The current implementation uses the DuckDB writer. We must use the Parquet writer in Velox.

duanmeng commented 1 year ago

For this enhancement, we need to abandon the DuckDB's COPY SQL to import parquet files in ParquetTpchTestBase::saveTpchTablesAsParquet, and instead use the following steps to import,

  1. Use DuckDB's SELECT SQL to get rows(vector<MaterializedRow>) from the generated table by DuckDB
  2. Make RowVector using data (vector<variant> actually) in step 1
  3. Register and make a ParquetWriter (along with WriteFileSink?)
  4. Append and flush the data.

For further optimization, we could use a streaming way, fetch data from the cursor, and flush iteratively. @majetideepak Is my understanding correct?

majetideepak commented 1 year ago

@duanmeng You are right for the most part. For 1, 2, we could also use the TpchConnector in Velox. For 3, 4 we can use the TableWriter operator, simplifying the writing part. I tried this, but unfortunately, the values don't match with DuckDB's TPC-H connector. May be we need to use 1, 2 as you suggested. I am investigating further. Are you interested in pursuing this? I can open a PR with what I have.

4 of extra rows:
    "A" | "F" | 3804.559999999975 | 571775162.3000023 | 543203163.4208972 | 565079034.6405019 | 0.25575154611454526 | 38436.08243479446 | 0.05008133906963965 | 14876
    "N" | "F" | 89.71000000000005 | 13732828.869999986 | 13097715.565900002 | 13643081.422205001 | 0.25778735632183924 | 39462.15192528732 | 0.04775862068965505 | 348
    "N" | "O" | 7428.020000000037 | 1113714965.5200021 | 1058304630.7269992 | 1100785555.7079387 | 0.2545498783455 | 38165.75735992605 | 0.04993111956408442 | 29181
    "R" | "F" | 3814.489999999985 | 569873613.650002 | 541588201.3579006 | 563481104.5000294 | 0.2559716816534683 | 38241.41817541283 | 0.049827539927524055 | 14902

4 of missing rows:
    "A" | "F" | 380456 | 532348211.65 | 505822441.4861 | 526165934.000839 | 25.575154611454693 | 35785.70930693735 | 0.05008133906964238 | 14876
    "N" | "F" | 8971 | 12384801.37 | 11798257.208 | 12282485.056933 | 25.778735632183906 | 35588.50968390804 | 0.047758620689655175 | 348
    "N" | "O" | 742802 | 1041502841.45 | 989737518.6346 | 1029418531.52335 | 25.45498783454988 | 35691.129209074395 | 0.04993111956409993 | 29181
    "R" | "F" | 381449 | 534594445.35 | 507996454.4067 | 528524219.358903 | 25.597168165346933 | 35874.00653268018 | 0.049827539927526504 | 14902
duanmeng commented 1 year ago

@duanmeng You are right for the most part. For 1, 2, we could also use the TpchConnector in Velox. For 3, 4 we can use the TableWriter operator, simplifying the writing part. I tried this, but unfortunately, the values don't match with DuckDB's TPC-H connector. May be we need to use 1, 2 as you suggested. I am investigating further. Are you interested in pursuing this? I can open a PR with what I have.

@majetideepak Yes, I am interested in :), This could let me have the chance to learn more about the I/O part of Velox according to your guidance.

majetideepak commented 1 year ago

@duanmeng here is the PR https://github.com/facebookincubator/velox/pull/6519/files

duanmeng commented 1 year ago

@majetideepak I set a breakpoint in saveTpchTablesAsParquet and copied both the tpch data generated by TpchConnector (https://github.com/facebookincubator/velox/pull/6519, by dbgen) as tpch-connector-data and generated by current main branch code (by duckdb) as duckdb-gen-data. I picked up two tables (lineitem and order) related to failure cases to diff and found they generate different data in some columns although their row counts are the same. Maybe we need to use the data generated by duckdb on the fly to keep consistency?

$ parquet-tools show tpch-connector-data/lineitem/test_cursor\ 12_0_1_dfe1d4aa-1788-4a4d-acf2-3e7bd15f7170 | sort | head -5
...
|            1 |        2132 |        4633 |              4 |         0.28 |          28955.6  |         0.09 |    0.06 | N              | O              | 1996-04-21   | 1996-03-30     | 1996-05-16      | NONE              | AIR          | s cajole busily above t                     |
|            1 |       15635 |         638 |              6 |         0.32 |          49620.2  |         0.07 |    0.02 | N              | O              | 1996-01-30   | 1996-02-07     | 1996-02-03      | DELIVER IN PERSON | MAIL         | rouches. special                            |
|            1 |       24027 |        1534 |              5 |         0.24 |          22824.5  |         0.1  |    0.04 | N              | O              | 1996-03-30   | 1996-03-14     | 1996-04-01      | NONE              | FOB          | the regular, regular pa                     |

$ parquet-tools show duckdb-gen-data/lineitem/file.parquet | sort | head -5
...
|            1 |          22 |          48 |              4 |         28 |        25816.6  |       0.09 |  0.06 | N              | O              | 1996-04-21 | 1996-03-30     | 1996-05-16      | NONE              | AIR          | lites. fluffily even de                     |
|            1 |         157 |          10 |              6 |         32 |        33828.8  |       0.07 |  0.02 | N              | O              | 1996-01-30 | 1996-02-07     | 1996-02-03      | DELIVER IN PERSON | MAIL         | arefully slyly ex                           |
|            1 |         241 |          23 |              5 |         24 |        27389.8  |       0.1  |  0.04 | N              | O              | 1996-03-30 | 1996-03-14     | 1996-04-01      | NONE              | FOB          | pending foxes. slyly re

$ parquet-tools show tpch-connector-data/orders/test_cursor\ 10_0_1_144b280b-62de-4f5d-838c-111db66c625f | sort | head -5
...
|            1 |       36901 | O               |      173665    | 1996-01-02    | 5-LOW             | Clerk#000000951 |                0 | ly express platelets. deposits acc                                             |
|            2 |       78002 | O               |       46929.2  | 1996-12-01    | 1-URGENT          | Clerk#000000880 |                0 | ve the furiously fluffy dependencies. carefully regular                        |
|            3 |      123314 | F               |      193846    | 1993-10-14    | 5-LOW             | Clerk#000000955 |                0 | after the asymptotes. instructions cajole after the foxes. carefully unu       |

$ parquet-tools show duckdb-gen-data/orders/file.parquet | sort | head -5
..
|            1 |         370 | O               |      172799    | 1996-01-02    | 5-LOW             | Clerk#000000951 |                0 | nstructions sleep furiously among                                              |
|            2 |         781 | O               |       38426.1  | 1996-12-01    | 1-URGENT          | Clerk#000000880 |                0 | foxes. pending accounts at the pending, silent asymptot                        |
|            3 |        1234 | F               |      205654    | 1993-10-14    | 5-LOW             | Clerk#000000955 |                0 | sly final accounts boost. carefully regular ideas cajole carefully. depos      |
duanmeng commented 1 year ago

And for Q3 it failed with the following logs, Does it mean that DuckDB generates its own version tpch data?

Expected keys: 267010.5894 | 9200, actual: 13956.3645 | 9095
Note: DuckDB only supports timestamps of millisecond precision. If this test involves timestamp inputs, please make sure you use the right precision.
majetideepak commented 1 year ago

Does it mean that DuckDB generates its own version tpch data?

@duanmeng Likely yes. let's use the data generated from the DuckDB's TPC-H extension to be consistent.

duanmeng commented 10 months ago

Close this issue as https://github.com/facebookincubator/velox/pull/6719 merged.