apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.56k stars 3.54k forks source link

[C++] Add ability to generate TPC-H data that matches the reference data exactly #31510

Open asfimport opened 2 years ago

asfimport commented 2 years ago

When checking these against the known-good answers for scale factor 1, all of the queries are off (two are close enough that they get past arrowbench's pretty loose validation).

Looking at the TPC-H tools, the validation for dbgen is:

b. Base Data Validation The base data set is produced using cmd_base_sf where is the scale factor to be generated. The resulting files will be produced in the current working directory. The generated files will be of the form .tbl., where will is the name of one of the tables in the TPCH schema, and

identifies a particular data generation step. The file set produced by genbaserefdata.sh should match the .tbl. files found in the reference data set for the same scale factor.

And the data that this generator is producing does not conform to that. Even if I sort the data by columns that they appear to be in the dbgen (or duckdb) produced data, the data we get from our TPC-H generator does not match.

We might want a mode where we produce random TPC-H like data. But for benchmarking we need a way to produce actual TPC-H compliant data out of the box (we can deal with rows in a shuffled order if we need to, but the content of the data must be the same.

Maybe taking a look at DuckDB's implementation of the random seeds might help with a way to accomplish this?

Here's an example of generating data with duckdb (first ten lines of lineitems — and it's the same each time I generate it):


> print(out, width = 500)
   l_orderkey l_partkey l_suppkey l_linenumber l_quantity l_extendedprice l_discount l_tax l_returnflag l_linestatus l_shipdate l_commitdate l_receiptdate    l_shipinstruct l_shipmode                           l_comment
1           1    155190      7706            1         17        21168.23       0.04  0.02            N            O 1996-03-13   1996-02-12    1996-03-22 DELIVER IN PERSON      TRUCK             egular courts above the
2           1     67310      7311            2         36        45983.16       0.09  0.06            N            O 1996-04-12   1996-02-28    1996-04-20  TAKE BACK RETURN       MAIL  ly final dependencies: slyly bold 
3           1     63700      3701            3          8        13309.60       0.10  0.02            N            O 1996-01-29   1996-03-05    1996-01-31  TAKE BACK RETURN    REG AIR       riously. regular, express dep
4           1      2132      4633            4         28        28955.64       0.09  0.06            N            O 1996-04-21   1996-03-30    1996-05-16              NONE        AIR             lites. fluffily even de
5           1     24027      1534            5         24        22824.48       0.10  0.04            N            O 1996-03-30   1996-03-14    1996-04-01              NONE        FOB             pending foxes. slyly re
6           1     15635       638            6         32        49620.16       0.07  0.02            N            O 1996-01-30   1996-02-07    1996-02-03 DELIVER IN PERSON       MAIL                   arefully slyly ex
7           2    106170      1191            1         38        44694.46       0.00  0.05            N            O 1997-01-28   1997-01-14    1997-02-02  TAKE BACK RETURN       RAIL     ven requests. deposits breach a
8           3      4297      1798            1         45        54058.05       0.06  0.00            R            F 1994-02-02   1994-01-04    1994-02-23              NONE        AIR ongside of the furiously brave acco
9           3     19036      6540            2         49        46796.47       0.10  0.00            R            F 1993-11-09   1993-12-20    1993-11-24  TAKE BACK RETURN       RAIL               unusual accounts. eve
10          3    128449      3474            3         27        39890.88       0.06  0.07            A            F 1994-01-16   1993-11-22    1994-01-23 DELIVER IN PERSON       SHIP                    nal foxes wake. 

And the first ten lines of lineitems I generated with the official dbgen:


head lineitem.tbl
1|155190|7706|1|17|21168.23|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER IN PERSON|TRUCK|egular courts above the|
1|67310|7311|2|36|45983.16|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-04-20|TAKE BACK RETURN|MAIL|ly final dependencies: slyly bold |
1|63700|3701|3|8|13309.60|0.10|0.02|N|O|1996-01-29|1996-03-05|1996-01-31|TAKE BACK RETURN|REG AIR|riously. regular, express dep|
1|2132|4633|4|28|28955.64|0.09|0.06|N|O|1996-04-21|1996-03-30|1996-05-16|NONE|AIR|lites. fluffily even de|
1|24027|1534|5|24|22824.48|0.10|0.04|N|O|1996-03-30|1996-03-14|1996-04-01|NONE|FOB| pending foxes. slyly re|
1|15635|638|6|32|49620.16|0.07|0.02|N|O|1996-01-30|1996-02-07|1996-02-03|DELIVER IN PERSON|MAIL|arefully slyly ex|
2|106170|1191|1|38|44694.46|0.00|0.05|N|O|1997-01-28|1997-01-14|1997-02-02|TAKE BACK RETURN|RAIL|ven requests. deposits breach a|
3|4297|1798|1|45|54058.05|0.06|0.00|R|F|1994-02-02|1994-01-04|1994-02-23|NONE|AIR|ongside of the furiously brave acco|
3|19036|6540|2|49|46796.47|0.10|0.00|R|F|1993-11-09|1993-12-20|1993-11-24|TAKE BACK RETURN|RAIL| unusual accounts. eve|
3|128449|3474|3|27|39890.88|0.06|0.07|A|F|1994-01-16|1993-11-22|1994-01-23|DELIVER IN PERSON|SHIP|nal foxes wake. |

And the first ten lines of the validation file form the TPC-H tools:


head lineitem.tbl.1
1|155190|7706|1|17|21168.23|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER IN PERSON|TRUCK|egular courts above the|
1|67310|7311|2|36|45983.16|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-04-20|TAKE BACK RETURN|MAIL|ly final dependencies: slyly bold |
1|63700|3701|3|8|13309.60|0.10|0.02|N|O|1996-01-29|1996-03-05|1996-01-31|TAKE BACK RETURN|REG AIR|riously. regular, express dep|
1|2132|4633|4|28|28955.64|0.09|0.06|N|O|1996-04-21|1996-03-30|1996-05-16|NONE|AIR|lites. fluffily even de|
1|24027|1534|5|24|22824.48|0.10|0.04|N|O|1996-03-30|1996-03-14|1996-04-01|NONE|FOB| pending foxes. slyly re|
1|15635|638|6|32|49620.16|0.07|0.02|N|O|1996-01-30|1996-02-07|1996-02-03|DELIVER IN PERSON|MAIL|arefully slyly ex|
2|106170|1191|1|38|44694.46|0.00|0.05|N|O|1997-01-28|1997-01-14|1997-02-02|TAKE BACK RETURN|RAIL|ven requests. deposits breach a|
3|4297|1798|1|45|54058.05|0.06|0.00|R|F|1994-02-02|1994-01-04|1994-02-23|NONE|AIR|ongside of the furiously brave acco|
3|19036|6540|2|49|46796.47|0.10|0.00|R|F|1993-11-09|1993-12-20|1993-11-24|TAKE BACK RETURN|RAIL| unusual accounts. eve|
3|128449|3474|3|27|39890.88|0.06|0.07|A|F|1994-01-16|1993-11-22|1994-01-23|DELIVER IN PERSON|SHIP|nal foxes wake. |

Note, you can generate this tpc-h data from https://github.com/apache/arrow/pull/12769 with the following. This shuffling is needed because we can only write datasets from execnodes (without materializing into memory entirely), so we move the files around as if they were single file writes:


path <- tpch_dbgen_write(1, "some/path")

from_dataset_to_parquet <- function(path, scale_factor) {
  ds_files <- list.files(path, recursive = TRUE, full.names = TRUE)
  # we can only deal with single parquet files in each partition this way
  if (!all(grepl("data-0.parquet$", ds_files))) {
    stop("At least one partition has more than one file")
  }

  ds_files_to <- gsub(
    "/data-0.parquet", 
    paste0("_", format(scale_factor, scientific = FALSE), ".parquet"), 
    ds_files
  )
  file.rename(ds_files, ds_files_to)

  # cleanup empty folders, this might be a bit aggressive
  folders_to_remove <- gsub("/data-0.parquet", "", ds_files)
  unlink(folders_to_remove, recursive = TRUE)
}

from_dataset_to_parquet(path, 1)

Reporter: Jonathan Keane / @jonkeane

Note: This issue was originally created as ARROW-16100. Please see the migration documentation for further details.

asfimport commented 2 years ago

Weston Pace / @westonpace: I believe at this point we have tackled the segmentation faults as part of the R bindings PR. Given that we are probably not going to proceed with exposing this feature via R until the randomness issue is resolved we will want to port those fixes into their own PR. I'll create a new issue for this. We can retitle this issue to focus on the randomness issue (since we have a lot of good content in the description).

Regarding the randomness issue, @jonkeane @save-buffer and I had a brief chat. This shouldn't interfere with our ability to do internal profiling but we likely prevents using the generated data for benchmarking (where we want to make sure we are more consistent and deterministic). We currently generate data and random numbers in a column-major fashion. The official generators generate data and random numbers in a row-major fashion. In addition, we generate data across multiple threads which could cause issues as well.

We could write a separate row-major generator (hopefully sharing some utilities) or we could try and use the generator as is and instead generate a table of random numbers for each batch and then transpose it and use the transposed random numbers. Or we could generate a large batch of random numbers and "seek" around the batch. There may be other complications as well to ensure compatibility between the two generators. This will probably be a significant chunk of work.

asfimport commented 2 years ago

Jonathan Keane / @jonkeane: Agreed. Either of those approaches sound like they might be fruitful! It's a bit hidden above, but https://github.com/duckdb/duckdb/blob/master/extension/tpch/dbgen/dbgen.cpp#L25-L74 might also be helpful and https://github.com/duckdb/duckdb/blob/31db2bada7548a700d26f61665828dd127abd2a8/extension/tpch/dbgen/dbgen.cpp#L533-L540 is where it looks like it's set|unset.

Also note that the answers for queries that include anything besides aggregations of numerics will have randomness with random data (so we can't quite count on the answers being close enough in the limit with enough samples)


> library(arrowbench)
> library(dplyr, warn.conflicts=FALSE)
> options(arrow.skip_nul = TRUE)
> 
> query_id <- 2
> 
> input_funcs <- get_input_func(
+   engine = "arrow",
+   scale_factor = 1,
+   query_id = query_id,
+   format = "parquet"
+ )
> 
> result <- get_query_func(query_id, "arrow")(input_funcs)
> 
> ans <- tpch_answer(1, query_id = query_id)
> 
> waldo::compare(result, ans)
old vs new
             s_acctbal             s_name         n_name p_partkey         p_mfgr                                s_address         s_phone                                                                                            s_comment
- old[1, ]     9973.06 Supplie#r000009288 UNITED KINGDOM     79287 Manufacturer#4 RiRNXmGRQdw1lbm5hcjV94                   33-645-115-8290 tions snooze permanently final sentiments --daring, close pinto beans should have to eat            
+ new[1, ]     9938.53 Supplier#000005359 UNITED KINGDOM    185358 Manufacturer#4 bgxj2K0w1kJvxYl5mhCfou,W                 33-429-790-6131 uriously regular requests hag                                                                       
- old[2, ]     9955.58 Supplie#r000008234 GERMANY            88233 Manufacturer#5 yg83WC5I YIhQPeTMQmXa                    17-560-538-8734 s promise; silent pinto beans may use? furious theodolites wake ruthlessly duri                     
+ new[2, ]     9937.84 Supplier#000005969 ROMANIA           108438 Manufacturer#1 rdnmd9c8EG1EIAYY3LPVa4yUNx6OwyVaQ        29-520-692-3537 efully express instructions. regular requests against the slyly fin                                 
- old[3, ]     9946.47 Supplie#r000001975 RUSSIA            161974 Manufacturer#1 wdsf5tQIrJKBNmu8I  eY pmAv akzRDSydc     32-937-928-5861 the sometimes permanent frets must haggle som                                                       
+ new[3, ]     9936.22 Supplier#000005250 UNITED KINGDOM       249 Manufacturer#4 qX AB0vP8mJEWeBuY9jri                    33-320-228-2957 etect about the furiously final accounts. slyly ironic pinto beans sleep inside the furiously       
- old[4, ]     9938.94 Supplie#r000006635 FRANCE             41626 Manufacturer#5 kIxC63UH H0et,bSwBA6O81PJ                16-750-637-3467 e through the idly daring warhorses --permanent patterns serve; escapades over th                   
+ new[4, ]     9923.77 Supplier#000002324 GERMANY            29821 Manufacturer#4 uXcnR7tv87dG                             17-779-299-1839 ackages boost blithely. blithely regular deposits c                                                 
- old[5, ]     9916.93 Supplie#r000004383 ROMANIA            84382 Manufacturer#5 tkEkQY F4J4JDu2WRYDk,owFCP7Qh1eOgHWSk9   29-283-925-2487 telets could poach busy waters! bl                                                                  
+ new[5, ]     9871.22 Supplier#000006373 GERMANY            43868 Manufacturer#5 iSLO35z7Ae                               17-813-485-8637 etect blithely bold asymptotes. fluffily ironic platelets wake furiously; blit                      
- old[6, ]     9884.91 Supplie#r000009799 RUSSIA             24794 Manufacturer#5 YvXpPGXYAg2TJLZJDW3LtRqbv97RzF6As        32-406-413-9944  the always furious pains                                                                           
+ new[6, ]     9870.78 Supplier#000001286 GERMANY            81285 Manufacturer#2 3gq0mZLHI5OTM6 tBYmLTHZaulCYnlECzQ7nj    17-516-924-4574  regular accounts. furiously unusual courts above the fi                                            
- old[7, ]     9884.91 Supplie#r000009799 RUSSIA             34792 Manufacturer#2 YvXpPGXYAg2TJLZJDW3LtRqbv97RzF6As        32-406-413-9944  the always furious pains                                                                           
+ new[7, ]     9870.78 Supplier#000001286 GERMANY           181285 Manufacturer#4 3gq0mZLHI5OTM6 tBYmLTHZaulCYnlECzQ7nj    17-516-924-4574  regular accounts. furiously unusual courts above the fi                                            
- old[8, ]     9836.42 Supplie#r000001571 UNITED KINGDOM     39067 Manufacturer#3 8lAIPuLFKWVfDQO LXavjzxLkVNF0sw8a6       33-788-833-6417  dogged orbits beyond the escapades shall                                                           
+ new[8, ]     9852.52 Supplier#000008973 RUSSIA             18972 Manufacturer#2 zVfUT3Np22kUC05tYWHBotaR                 32-188-594-7038 rns wake final foxes. carefully unusual depende                                                     
- old[9, ]     9836.42 Supplie#r000001571 UNITED KINGDOM    134031 Manufacturer#2 8lAIPuLFKWVfDQO LXavjzxLkVNF0sw8a6       33-788-833-6417  dogged orbits beyond the escapades shall                                                           
+ new[9, ]     9847.83 Supplier#000008097 RUSSIA            130557 Manufacturer#2 veMRTQBmUResNvfD3                        32-375-640-3593  the special excuses. silent sentiments serve carefully final ac                                    
- old[10, ]    9781.88 Supplie#r000005361 ROMANIA             2860 Manufacturer#4 lZYHncM3vjsXc3PPqaFlTO                   29-345-620-9471 players --daring, careful forges past the enticing, thin sentiments try to kindle stealthily fr     
+ new[10, ]    9847.57 Supplier#000006345 FRANCE             86344 Manufacturer#1 68yX tGXAkVRSxUGNSjJdptw 8O878xaFnaoQK   16-886-766-7945 ges. slyly regular requests are. ruthless, express excuses cajole blithely across the unu           
and 90 more ...

     old$s_acctbal | new$s_acctbal                
 [1] 9973.1        - 9938.5        [1]            
 [2] 9955.6        - 9937.8        [2]            
 [3] 9946.5        - 9936.2        [3]            
 [4] 9938.9        - 9923.8        [4]            
 [5] 9916.9        - 9871.2        [5]            
 [6] 9884.9        - 9870.8        [6]            
 [7] 9884.9        - 9870.8        [7]            
 [8] 9836.4        - 9852.5        [8]            
 [9] 9836.4        - 9847.8        [9]            
[10] 9781.9        - 9847.6        [10]           
 ... ...             ...           and 90 more ...

     old$s_name           | new$s_name                          
 [1] "Supplie#r000009288" - "Supplier#000005359" [1]            
 [2] "Supplie#r000008234" - "Supplier#000005969" [2]            
 [3] "Supplie#r000001975" - "Supplier#000005250" [3]            
 [4] "Supplie#r000006635" - "Supplier#000002324" [4]            
 [5] "Supplie#r000004383" - "Supplier#000006373" [5]            
 [6] "Supplie#r000009799" - "Supplier#000001286" [6]            
 [7] "Supplie#r000009799" - "Supplier#000001286" [7]            
 [8] "Supplie#r000001571" - "Supplier#000008973" [8]            
 [9] "Supplie#r000001571" - "Supplier#000008097" [9]            
[10] "Supplie#r000005361" - "Supplier#000006345" [10]           
 ... ...                    ...                  and 90 more ...

     old$n_name       | new$n_name                      
 [1] "UNITED KINGDOM" | "UNITED KINGDOM" [1]            
 [2] "GERMANY"        - "ROMANIA"        [2]            
 [3] "RUSSIA"         - "UNITED KINGDOM" [3]            
 [4] "FRANCE"         - "GERMANY"        [4]            
 [5] "ROMANIA"        - "GERMANY"        [5]            
 [6] "RUSSIA"         - "GERMANY"        [6]            
 [7] "RUSSIA"         - "GERMANY"        [7]            
 [8] "UNITED KINGDOM" - "RUSSIA"         [8]            
 [9] "UNITED KINGDOM" - "RUSSIA"         [9]            
[10] "ROMANIA"        - "FRANCE"         [10]           
 ... ...                ...              and 90 more ...

     old$p_partkey | new$p_partkey                
 [1] 79287         - 185358        [1]            
 [2] 88233         - 108438        [2]            
 [3] 161974        - 249           [3]            
 [4] 41626         - 29821         [4]            
 [5] 84382         - 43868         [5]            
 [6] 24794         - 81285         [6]            
 [7] 34792         - 181285        [7]            
 [8] 39067         - 18972         [8]            
 [9] 134031        - 130557        [9]            
[10] 2860          - 86344         [10]           
 ... ...             ...           and 90 more ...

     old$p_mfgr       | new$p_mfgr                      
 [1] "Manufacturer#4" | "Manufacturer#4" [1]            
 [2] "Manufacturer#5" - "Manufacturer#1" [2]            
 [3] "Manufacturer#1" - "Manufacturer#4" [3]            
 [4] "Manufacturer#5" - "Manufacturer#4" [4]            
 [5] "Manufacturer#5" | "Manufacturer#5" [5]            
 [6] "Manufacturer#5" - "Manufacturer#2" [6]            
 [7] "Manufacturer#2" - "Manufacturer#4" [7]            
 [8] "Manufacturer#3" - "Manufacturer#2" [8]            
 [9] "Manufacturer#2" | "Manufacturer#2" [9]            
[10] "Manufacturer#4" - "Manufacturer#1" [10]           
 ... ...                ...              and 90 more ...

old$s_address vs new$s_address
- "RiRNXmGRQdw1lbm5hcjV94"
+ "bgxj2K0w1kJvxYl5mhCfou,W"
- "yg83WC5I YIhQPeTMQmXa"
+ "rdnmd9c8EG1EIAYY3LPVa4yUNx6OwyVaQ"
- "wdsf5tQIrJKBNmu8I  eY pmAv akzRDSydc"
+ "qX AB0vP8mJEWeBuY9jri"
- "kIxC63UH H0et,bSwBA6O81PJ"
+ "uXcnR7tv87dG"
- "tkEkQY F4J4JDu2WRYDk,owFCP7Qh1eOgHWSk9"
+ "iSLO35z7Ae"
- "YvXpPGXYAg2TJLZJDW3LtRqbv97RzF6As"
+ "3gq0mZLHI5OTM6 tBYmLTHZaulCYnlECzQ7nj"
- "YvXpPGXYAg2TJLZJDW3LtRqbv97RzF6As"
+ "3gq0mZLHI5OTM6 tBYmLTHZaulCYnlECzQ7nj"
- "8lAIPuLFKWVfDQO LXavjzxLkVNF0sw8a6 "
+ "zVfUT3Np22kUC05tYWHBotaR"
- "8lAIPuLFKWVfDQO LXavjzxLkVNF0sw8a6 "
+ "veMRTQBmUResNvfD3"
- "lZYHncM3vjsXc3PPqaFlTO"
+ "68yX tGXAkVRSxUGNSjJdptw 8O878xaFnaoQK"
and 90 more ...

     old$s_phone       | new$s_phone                      
 [1] "33-645-115-8290" - "33-429-790-6131" [1]            
 [2] "17-560-538-8734" - "29-520-692-3537" [2]            
 [3] "32-937-928-5861" - "33-320-228-2957" [3]            
 [4] "16-750-637-3467" - "17-779-299-1839" [4]            
 [5] "29-283-925-2487" - "17-813-485-8637" [5]            
 [6] "32-406-413-9944" - "17-516-924-4574" [6]            
 [7] "32-406-413-9944" - "17-516-924-4574" [7]            
 [8] "33-788-833-6417" - "32-188-594-7038" [8]            
 [9] "33-788-833-6417" - "32-375-640-3593" [9]            
[10] "29-345-620-9471" - "16-886-766-7945" [10]           
 ... ...                 ...               and 90 more ...

old$s_comment vs new$s_comment
- "tions snooze permanently final sentiments --daring, close pinto beans should have to eat"
+ "uriously regular requests hag"
- "s promise; silent pinto beans may use? furious theodolites wake ruthlessly duri"
+ "efully express instructions. regular requests against the slyly fin"
- "the sometimes permanent frets must haggle som"
+ "etect about the furiously final accounts. slyly ironic pinto beans sleep inside the furiously"
- "e through the idly daring warhorses --permanent patterns serve; escapades over th"
+ "ackages boost blithely. blithely regular deposits c"
- "telets could poach busy waters! bl"
+ "etect blithely bold asymptotes. fluffily ironic platelets wake furiously; blit"
- " the always furious pains "
+ " regular accounts. furiously unusual courts above the fi"
- " the always furious pains "
+ " regular accounts. furiously unusual courts above the fi"
- " dogged orbits beyond the escapades shall "
+ "rns wake final foxes. carefully unusual depende"
- " dogged orbits beyond the escapades shall "
+ " the special excuses. silent sentiments serve carefully final ac"
- "players --daring, careful forges past the enticing, thin sentiments try to kindle stealthily fr"
+ "ges. slyly regular requests are. ruthless, express excuses cajole blithely across the unu"
and 90 more ...