Closed majetideepak closed 10 months ago
For this enhancement, we need to abandon the DuckDB's COPY
SQL to import parquet files in ParquetTpchTestBase::saveTpchTablesAsParquet
, and instead use the following steps to import,
SELECT
SQL to get rows(vector<MaterializedRow>
) from the generated table by DuckDBvector<variant>
actually) in step 1WriteFileSink
?)For further optimization, we could use a streaming way, fetch data from the cursor, and flush iteratively. @majetideepak Is my understanding correct?
@duanmeng You are right for the most part. For 1, 2, we could also use the TpchConnector in Velox. For 3, 4 we can use the TableWriter operator, simplifying the writing part. I tried this, but unfortunately, the values don't match with DuckDB's TPC-H connector. May be we need to use 1, 2 as you suggested. I am investigating further. Are you interested in pursuing this? I can open a PR with what I have.
4 of extra rows:
"A" | "F" | 3804.559999999975 | 571775162.3000023 | 543203163.4208972 | 565079034.6405019 | 0.25575154611454526 | 38436.08243479446 | 0.05008133906963965 | 14876
"N" | "F" | 89.71000000000005 | 13732828.869999986 | 13097715.565900002 | 13643081.422205001 | 0.25778735632183924 | 39462.15192528732 | 0.04775862068965505 | 348
"N" | "O" | 7428.020000000037 | 1113714965.5200021 | 1058304630.7269992 | 1100785555.7079387 | 0.2545498783455 | 38165.75735992605 | 0.04993111956408442 | 29181
"R" | "F" | 3814.489999999985 | 569873613.650002 | 541588201.3579006 | 563481104.5000294 | 0.2559716816534683 | 38241.41817541283 | 0.049827539927524055 | 14902
4 of missing rows:
"A" | "F" | 380456 | 532348211.65 | 505822441.4861 | 526165934.000839 | 25.575154611454693 | 35785.70930693735 | 0.05008133906964238 | 14876
"N" | "F" | 8971 | 12384801.37 | 11798257.208 | 12282485.056933 | 25.778735632183906 | 35588.50968390804 | 0.047758620689655175 | 348
"N" | "O" | 742802 | 1041502841.45 | 989737518.6346 | 1029418531.52335 | 25.45498783454988 | 35691.129209074395 | 0.04993111956409993 | 29181
"R" | "F" | 381449 | 534594445.35 | 507996454.4067 | 528524219.358903 | 25.597168165346933 | 35874.00653268018 | 0.049827539927526504 | 14902
@duanmeng You are right for the most part. For 1, 2, we could also use the TpchConnector in Velox. For 3, 4 we can use the TableWriter operator, simplifying the writing part. I tried this, but unfortunately, the values don't match with DuckDB's TPC-H connector. May be we need to use 1, 2 as you suggested. I am investigating further. Are you interested in pursuing this? I can open a PR with what I have.
@majetideepak Yes, I am interested in :), This could let me have the chance to learn more about the I/O part of Velox according to your guidance.
@duanmeng here is the PR https://github.com/facebookincubator/velox/pull/6519/files
@majetideepak I set a breakpoint in saveTpchTablesAsParquet and copied both the tpch data generated by TpchConnector (https://github.com/facebookincubator/velox/pull/6519, by dbgen) as tpch-connector-data and generated by current main branch code (by duckdb) as duckdb-gen-data. I picked up two tables (lineitem and order) related to failure cases to diff and found they generate different data in some columns although their row counts are the same. Maybe we need to use the data generated by duckdb on the fly to keep consistency?
$ parquet-tools show tpch-connector-data/lineitem/test_cursor\ 12_0_1_dfe1d4aa-1788-4a4d-acf2-3e7bd15f7170 | sort | head -5
...
| 1 | 2132 | 4633 | 4 | 0.28 | 28955.6 | 0.09 | 0.06 | N | O | 1996-04-21 | 1996-03-30 | 1996-05-16 | NONE | AIR | s cajole busily above t |
| 1 | 15635 | 638 | 6 | 0.32 | 49620.2 | 0.07 | 0.02 | N | O | 1996-01-30 | 1996-02-07 | 1996-02-03 | DELIVER IN PERSON | MAIL | rouches. special |
| 1 | 24027 | 1534 | 5 | 0.24 | 22824.5 | 0.1 | 0.04 | N | O | 1996-03-30 | 1996-03-14 | 1996-04-01 | NONE | FOB | the regular, regular pa |
$ parquet-tools show duckdb-gen-data/lineitem/file.parquet | sort | head -5
...
| 1 | 22 | 48 | 4 | 28 | 25816.6 | 0.09 | 0.06 | N | O | 1996-04-21 | 1996-03-30 | 1996-05-16 | NONE | AIR | lites. fluffily even de |
| 1 | 157 | 10 | 6 | 32 | 33828.8 | 0.07 | 0.02 | N | O | 1996-01-30 | 1996-02-07 | 1996-02-03 | DELIVER IN PERSON | MAIL | arefully slyly ex |
| 1 | 241 | 23 | 5 | 24 | 27389.8 | 0.1 | 0.04 | N | O | 1996-03-30 | 1996-03-14 | 1996-04-01 | NONE | FOB | pending foxes. slyly re
$ parquet-tools show tpch-connector-data/orders/test_cursor\ 10_0_1_144b280b-62de-4f5d-838c-111db66c625f | sort | head -5
...
| 1 | 36901 | O | 173665 | 1996-01-02 | 5-LOW | Clerk#000000951 | 0 | ly express platelets. deposits acc |
| 2 | 78002 | O | 46929.2 | 1996-12-01 | 1-URGENT | Clerk#000000880 | 0 | ve the furiously fluffy dependencies. carefully regular |
| 3 | 123314 | F | 193846 | 1993-10-14 | 5-LOW | Clerk#000000955 | 0 | after the asymptotes. instructions cajole after the foxes. carefully unu |
$ parquet-tools show duckdb-gen-data/orders/file.parquet | sort | head -5
..
| 1 | 370 | O | 172799 | 1996-01-02 | 5-LOW | Clerk#000000951 | 0 | nstructions sleep furiously among |
| 2 | 781 | O | 38426.1 | 1996-12-01 | 1-URGENT | Clerk#000000880 | 0 | foxes. pending accounts at the pending, silent asymptot |
| 3 | 1234 | F | 205654 | 1993-10-14 | 5-LOW | Clerk#000000955 | 0 | sly final accounts boost. carefully regular ideas cajole carefully. depos |
And for Q3 it failed with the following logs, Does it mean that DuckDB generates its own version tpch data?
Expected keys: 267010.5894 | 9200, actual: 13956.3645 | 9095
Note: DuckDB only supports timestamps of millisecond precision. If this test involves timestamp inputs, please make sure you use the right precision.
Does it mean that DuckDB generates its own version tpch data?
@duanmeng Likely yes. let's use the data generated from the DuckDB's TPC-H extension to be consistent.
Close this issue as https://github.com/facebookincubator/velox/pull/6719 merged.
Description
The current implementation uses the DuckDB writer. We must use the Parquet writer in Velox.