Closed quodlibetor closed 4 years ago
What's broken? The fact that there are duplicates?
All the strings used to be real, I think? Certainly the states look like e.g. NY
, CA
, NV
. I also think things like c_first
used to be a random name like brandon
or nikhil
... I think (cc @umanwizard ?) Zip codes also used to actually be actual numbers, and comments were random sentences.
This is also happening in metabase, my hypothesis is that we're never using the text encoding, or always doing something weird with it.
Why do you suspect the text encoding and not, say, the data generator?
[nix-shell:/nix/store/6srg02sf20nvnbzmhnmm3x35g956r3xl-tpch-dbgen-output/share]$ head customer.tbl
1|Customer#000000001|IVhzIApeRb ot,c,E|15|25-989-741-2988|711.56|BUILDING|to the even, regular platelets. regular, ironic epitaphs nag e|
2|Customer#000000002|XSTf4,NCwDVaWNe6tEgvwfmRchLXak|13|23-768-687-3665|121.65|AUTOMOBILE|l accounts. blithely ironic theodolites integrate boldly: caref|
3|Customer#000000003|MG9kdTD2WBHm|1|11-719-748-3364|7498.12|AUTOMOBILE| deposits eat slyly ironic, even instructions. express foxes detect slyly. blithely even accounts abov|
4|Customer#000000004|XxVSJsLAGtn|4|14-128-190-5944|2866.83|MACHINERY| requests. final, regular ideas sleep final accou|
5|Customer#000000005|KvpyuHCplrB84WgAiGV6sYpZq7Tj|3|13-750-942-6364|794.47|HOUSEHOLD|n accounts will have to unwind. foxes cajole accor|
6|Customer#000000006|sKZz0CsnMD7mp4Xd0YrBvx,LREYKUWAh yVn|20|30-114-968-4951|7638.57|AUTOMOBILE|tions. even deposits boost according to the slyly bold packages. final accounts cajole requests. furious|
7|Customer#000000007|TcGe5gaZNgVePxU5kRrvXBfkasDTea|18|28-190-982-9759|9561.95|AUTOMOBILE|ainst the ironic, express theodolites. express, even pinto beans among the exp|
8|Customer#000000008|I0B10bB0AymmC, 0PrRYBCP1yGJ8xcBPmWhl5|17|27-147-574-9335|6819.74|BUILDING|among the slyly regular theodolites kindle blithely courts. carefully even theodolites haggle slyly along the ide|
9|Customer#000000009|xKiAFTjUsCuxfeleNqefumTrjS|8|18-338-906-3675|8324.07|FURNITURE|r theodolites according to the requests wake thinly excuses: pending requests haggle furiousl|
10|Customer#000000010|6LrEaV6KR6PLVcgl2ArL Q3rqzLzcT1 v2|5|15-741-346-9870|2753.54|HOUSEHOLD|es regular deposits haggle. fur|
Some of this certainly looks like real data. Not sure what chbenchmark adds on top of that though.
std::string cState = DataSource::randomAlphanumeric62(2);
customerStream << cId << csvDelim; // C_ID
customerStream << dId << csvDelim; // C_D_ID
customerStream << wId << csvDelim; // C_W_ID
DataSource::addAlphanumeric64(8, 16, customerStream, true); // C_FIRST
customerStream << "OE" << csvDelim; // C_MIDDLE
customerStream << cLast << csvDelim; // C_LAST
DataSource::addAlphanumeric64(10, 20, customerStream, true); // C_STREET_1
DataSource::addAlphanumeric64(10, 20, customerStream, true); // C_STREET_2
DataSource::addAlphanumeric64(10, 20, customerStream, true); // C_CITY
customerStream << cState << csvDelim; // C_STATE
DataSource::addWDCZip(customerStream, true); // C_ZIP
DataSource::addNumeric(16, customerStream, true); // C_PHONE
customerStream << customerTime
<< csvDelim; // C_SINCE - date/time given by the os when
// the CUSTOMER table was populated
customerStream << (DataSource::randomTrue(0.1) ? "BC" : "GC")
<< csvDelim; // C_CREDIT
customerStream << "50000.00" << csvDelim; // C_CREDIT_LIM
DataSource::addDouble(0.0, 0.5, 4, customerStream, true); // C_DISCOUNT
customerStream << "-10.00" << csvDelim; // C_BALANCE
customerStream << "10.00" << csvDelim; // C_YTD_PAYMENT
customerStream << "1" << csvDelim; // C_PAYMENT_CNT
customerStream << "0" << csvDelim; // C_DELIVERY_CNT
DataSource::addAlphanumeric64(300, 500, customerStream, true); // C_DATA
customerStream << (int) (cState.c_str())[0]; // C_N_NATIONKEY
customerStream << std::endl;
Maybe what's happening here is @quodlibetor previously saw the sensible data from tpch-dbgen, and then later it was overwritten by junk from chbenchmark.
Why do you suspect the text encoding
Because that's what we've been changing a lot as part of the metabase work, but yeah if chbench has been generating entirely random strings then it's probably there.
And yup, I have historically seen completely valid data.
Additionally, if chbench is generating complete noise then it will ruin things for metabase, which uses heuristics to figure out what queries it should run on the tables for "insight". cc @umanwizard
Yes, the data strings generated by chbenchmark are expected to be random.
Can you elaborate on the "heuristics". What distribution of data should we be generating?
States should be from the set of state names/abbreviations, number-like things should be numbers (preferably in reasonable ranges) zip codes should be 5-digit numbers, countries should be real, etc.
Having a bunch of baked-in assumptions like that in the visualization tool seems totally backwards to me, is that actually important to people? What if people use it outside the US, does "Départment: Mayenne, Seine-Saint-Denis, ..." work as seamlessly as "State: AZ, NY, ..." ? Does the software handle alternate abbreviations like "Mass." vs. "MA" correctly? What is a "real" country. Are North and South Korea separate countries? What about China and Taiwan? Just a few random examples. Relying on this sort of thing in what's supposed to be a general-purpose tool strikes me as very brittle and specific.
Superseded by the issues in #1152
~Things are looking fine in metabase so I think that we're doing something weird possibly in our text serialization?~
EDIT: sometimes things look fine in metabase, and sometimes they look bad, suggesting that it is probably not our serialization.