MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.72k stars 466 forks source link

Data from tpcch looks bad in any UI #1099

Closed quodlibetor closed 4 years ago

quodlibetor commented 4 years ago

~Things are looking fine in metabase so I think that we're doing something weird possibly in our text serialization?~

EDIT: sometimes things look fine in metabase, and sometimes they look bad, suggesting that it is probably not our serialization.

mz> select * from mysql_tpcch_customer limit 5;
+--------+----------+----------+---------------+------------+------------+-------------------+----------------+--------------------+-----------+-----------+------------------+------------+------------+----------------+--------------+-------------+-----------------+-----------------+------------------+-------------------------------------------------------------------------------------------------------------------------------
| c_id   | c_d_id   | c_w_id   | c_first       | c_middle   | c_last     | c_street_1        | c_street_2     | c_city             | c_state   | c_zip     | c_phone          | c_since    | c_credit   | c_credit_lim   | c_discount   | c_balance   | c_ytd_payment   | c_payment_cnt   | c_delivery_cnt   | c_data
|--------+----------+----------+---------------+------------+------------+-------------------+----------------+--------------------+-----------+-----------+------------------+------------+------------+----------------+--------------+-------------+-----------------+-----------------+------------------+-------------------------------------------------------------------------------------------------------------------------------
| 256    | 1        | 1        | L7LSAJy5glPRM | OE         | ABLEESEESE | LrUg16BfgDPaF     | YW?JrLNUXDJVG2 | O1JQjaZZbfDpZV9u   | ka        | 634111111 | 0822824740563059 | 2019-11-21 | GC         | 50000.00       | 0.3512       | -10.00      | 10.00           | 1               | 0                | fzEbGadRWzYNtt9B@TxTnJ1?IyzBEhxGjuBQLKgFm1P?8bP1s33PFBJNmDjGR5BqmHx0BL?ONXuLOWoQYpQmRnIVFlQumKHhKKkt5fL4sCTRQ8?iZcaN4i7y8XpG@2
| 256    | 1        | 1        | L7LSAJy5glPRM | OE         | ABLEESEESE | LrUg16BfgDPaF     | YW?JrLNUXDJVG2 | O1JQjaZZbfDpZV9u   | ka        | 634111111 | 0822824740563059 | 2019-11-21 | GC         | 50000.00       | 0.3512       | -10.00      | 10.00           | 1               | 0                | fzEbGadRWzYNtt9B@TxTnJ1?IyzBEhxGjuBQLKgFm1P?8bP1s33PFBJNmDjGR5BqmHx0BL?ONXuLOWoQYpQmRnIVFlQumKHhKKkt5fL4sCTRQ8?iZcaN4i7y8XpG@2
| 256    | 1        | 1        | L7LSAJy5glPRM | OE         | ABLEESEESE | LrUg16BfgDPaF     | YW?JrLNUXDJVG2 | O1JQjaZZbfDpZV9u   | ka        | 634111111 | 0822824740563059 | 2019-11-21 | GC         | 50000.00       | 0.3512       | -10.00      | 10.00           | 1               | 0                | fzEbGadRWzYNtt9B@TxTnJ1?IyzBEhxGjuBQLKgFm1P?8bP1s33PFBJNmDjGR5BqmHx0BL?ONXuLOWoQYpQmRnIVFlQumKHhKKkt5fL4sCTRQ8?iZcaN4i7y8XpG@2
| 256    | 2        | 1        | joLAJidVn7    | OE         | ABLEESEESE | saw2f0vcFZKRzNo02 | qGSQG1w2B5ca9  | QwJSNmXSy6wRL4dbJJ | 6Q        | 135611111 | 9717241426978444 | 2019-11-21 | GC         | 50000.00       | 0.4654       | -10.00      | 10.00           | 1               | 0                | OE3q6LlL82F45PqYVdZ?rMCOPpzeChN6ZvdnT5hiMNVpx3afYs?@AvA4N49WC9NeSXKL90izOmwl8tgh5uzvEltUlVfCVSnieihArqqC8ZQ1VFpTXSQgq8@8bnOKmI
| 256    | 2        | 1        | joLAJidVn7    | OE         | ABLEESEESE | saw2f0vcFZKRzNo02 | qGSQG1w2B5ca9  | QwJSNmXSy6wRL4dbJJ | 6Q        | 135611111 | 9717241426978444 | 2019-11-21 | GC         | 50000.00       | 0.4654       | -10.00      | 10.00           | 1               | 0                | OE3q6LlL82F45PqYVdZ?rMCOPpzeChN6ZvdnT5hiMNVpx3afYs?@AvA4N49WC9NeSXKL90izOmwl8tgh5uzvEltUlVfCVSnieihArqqC8ZQ1VFpTXSQgq8@8bnOKmI
+--------+----------+----------+---------------+------------+------------+-------------------+----------------+--------------------+-----------+-----------+------------------+------------+------------+----------------+--------------+-------------+-----------------+-----------------+------------------+-------------------------------------------------------------------------------------------------------------------------------
SELECT
benesch commented 4 years ago

What's broken? The fact that there are duplicates?

quodlibetor commented 4 years ago

All the strings used to be real, I think? Certainly the states look like e.g. NY, CA, NV. I also think things like c_first used to be a random name like brandon or nikhil... I think (cc @umanwizard ?) Zip codes also used to actually be actual numbers, and comments were random sentences.

quodlibetor commented 4 years ago

This is also happening in metabase, my hypothesis is that we're never using the text encoding, or always doing something weird with it.

benesch commented 4 years ago

Why do you suspect the text encoding and not, say, the data generator?

jamii commented 4 years ago
[nix-shell:/nix/store/6srg02sf20nvnbzmhnmm3x35g956r3xl-tpch-dbgen-output/share]$ head customer.tbl 
1|Customer#000000001|IVhzIApeRb ot,c,E|15|25-989-741-2988|711.56|BUILDING|to the even, regular platelets. regular, ironic epitaphs nag e|
2|Customer#000000002|XSTf4,NCwDVaWNe6tEgvwfmRchLXak|13|23-768-687-3665|121.65|AUTOMOBILE|l accounts. blithely ironic theodolites integrate boldly: caref|
3|Customer#000000003|MG9kdTD2WBHm|1|11-719-748-3364|7498.12|AUTOMOBILE| deposits eat slyly ironic, even instructions. express foxes detect slyly. blithely even accounts abov|
4|Customer#000000004|XxVSJsLAGtn|4|14-128-190-5944|2866.83|MACHINERY| requests. final, regular ideas sleep final accou|
5|Customer#000000005|KvpyuHCplrB84WgAiGV6sYpZq7Tj|3|13-750-942-6364|794.47|HOUSEHOLD|n accounts will have to unwind. foxes cajole accor|
6|Customer#000000006|sKZz0CsnMD7mp4Xd0YrBvx,LREYKUWAh yVn|20|30-114-968-4951|7638.57|AUTOMOBILE|tions. even deposits boost according to the slyly bold packages. final accounts cajole requests. furious|
7|Customer#000000007|TcGe5gaZNgVePxU5kRrvXBfkasDTea|18|28-190-982-9759|9561.95|AUTOMOBILE|ainst the ironic, express theodolites. express, even pinto beans among the exp|
8|Customer#000000008|I0B10bB0AymmC, 0PrRYBCP1yGJ8xcBPmWhl5|17|27-147-574-9335|6819.74|BUILDING|among the slyly regular theodolites kindle blithely courts. carefully even theodolites haggle slyly along the ide|
9|Customer#000000009|xKiAFTjUsCuxfeleNqefumTrjS|8|18-338-906-3675|8324.07|FURNITURE|r theodolites according to the requests wake thinly excuses: pending requests haggle furiousl|
10|Customer#000000010|6LrEaV6KR6PLVcgl2ArL Q3rqzLzcT1 v2|5|15-741-346-9870|2753.54|HOUSEHOLD|es regular deposits haggle. fur|

Some of this certainly looks like real data. Not sure what chbenchmark adds on top of that though.

jamii commented 4 years ago
    std::string cState = DataSource::randomAlphanumeric62(2);

    customerStream << cId << csvDelim;                        // C_ID
    customerStream << dId << csvDelim;                        // C_D_ID
    customerStream << wId << csvDelim;                        // C_W_ID
    DataSource::addAlphanumeric64(8, 16, customerStream, true);  // C_FIRST
    customerStream << "OE" << csvDelim;                       // C_MIDDLE
    customerStream << cLast << csvDelim;                      // C_LAST
    DataSource::addAlphanumeric64(10, 20, customerStream, true); // C_STREET_1
    DataSource::addAlphanumeric64(10, 20, customerStream, true); // C_STREET_2
    DataSource::addAlphanumeric64(10, 20, customerStream, true); // C_CITY
    customerStream << cState << csvDelim;                     // C_STATE
    DataSource::addWDCZip(customerStream, true);                 // C_ZIP
    DataSource::addNumeric(16, customerStream, true);            // C_PHONE
    customerStream << customerTime
                   << csvDelim; // C_SINCE - date/time given by the os when
                                // the CUSTOMER table was populated
    customerStream << (DataSource::randomTrue(0.1) ? "BC" : "GC")
                   << csvDelim;                            // C_CREDIT
    customerStream << "50000.00" << csvDelim;              // C_CREDIT_LIM
    DataSource::addDouble(0.0, 0.5, 4, customerStream, true); // C_DISCOUNT
    customerStream << "-10.00" << csvDelim;                // C_BALANCE
    customerStream << "10.00" << csvDelim;                 // C_YTD_PAYMENT
    customerStream << "1" << csvDelim;                     // C_PAYMENT_CNT
    customerStream << "0" << csvDelim;                     // C_DELIVERY_CNT
    DataSource::addAlphanumeric64(300, 500, customerStream, true); // C_DATA
    customerStream << (int) (cState.c_str())[0];                // C_N_NATIONKEY
    customerStream << std::endl;
jamii commented 4 years ago

Maybe what's happening here is @quodlibetor previously saw the sensible data from tpch-dbgen, and then later it was overwritten by junk from chbenchmark.

quodlibetor commented 4 years ago

Why do you suspect the text encoding

Because that's what we've been changing a lot as part of the metabase work, but yeah if chbench has been generating entirely random strings then it's probably there.

quodlibetor commented 4 years ago

And yup, I have historically seen completely valid data.

Additionally, if chbench is generating complete noise then it will ruin things for metabase, which uses heuristics to figure out what queries it should run on the tables for "insight". cc @umanwizard

umanwizard commented 4 years ago

Yes, the data strings generated by chbenchmark are expected to be random.

Can you elaborate on the "heuristics". What distribution of data should we be generating?

quodlibetor commented 4 years ago

States should be from the set of state names/abbreviations, number-like things should be numbers (preferably in reasonable ranges) zip codes should be 5-digit numbers, countries should be real, etc.

umanwizard commented 4 years ago

Having a bunch of baked-in assumptions like that in the visualization tool seems totally backwards to me, is that actually important to people? What if people use it outside the US, does "Départment: Mayenne, Seine-Saint-Denis, ..." work as seamlessly as "State: AZ, NY, ..." ? Does the software handle alternate abbreviations like "Mass." vs. "MA" correctly? What is a "real" country. Are North and South Korea separate countries? What about China and Taiwan? Just a few random examples. Relying on this sort of thing in what's supposed to be a general-purpose tool strikes me as very brittle and specific.

cuongdo commented 4 years ago

Superseded by the issues in #1152