Consider the creation of the following Kafka source, done similarly to the instructions for running TPC-H in Materialize:
CREATE CONNECTION kafka FOR KAFKA BROKER 'localhost:9092';
CREATE SOURCE nation_src
FROM KAFKA CONNECTION kafka TOPIC 'nation'
FORMAT CSV
WITH 5 COLUMNS
DELIMITED BY '|';
CREATE VIEW nation AS
SELECT
column1::integer as n_nationkey,
column2 as n_name,
column3::integer as n_regionkey,
column4 as n_comment
FROM
nation_src
;
Now, let's assume that you have a file nation.tbl with the following contents, which can be obtained by running dbgen at scale factor 1:
0|ALGERIA|0| haggle. carefully final deposits detect slyly agai
1|ARGENTINA|1|al foxes promise slyly according to the regular accounts. bold requests alon
2|BRAZIL|1|y alongside of the pending deposits. carefully special packages are about the ironic forges. slyly special
3|CANADA|1|eas hang ironic, silent packages. slyly regular packages are furiously over the tithes. fluffily bold
4|EGYPT|4|y above the carefully unusual theodolites. final dugouts are quickly across the furiously regular d
5|ETHIOPIA|0|ven packages wake quickly. regu
6|FRANCE|3|refully final requests. regular, ironi
7|GERMANY|3|l platelets. regular accounts x-ray: unusual, regular acco
8|INDIA|2|ss excuses cajole slyly across the packages. deposits print aroun
9|INDONESIA|2| slyly express asymptotes. regular deposits haggle slyly. carefully ironic hockey players sleep blithely. carefull
10|IRAN|4|efully alongside of the slyly final dependencies.
11|IRAQ|4|nic deposits boost atop the quickly final requests? quickly regula
12|JAPAN|2|ously. final, express gifts cajole a
13|JORDAN|4|ic deposits are blithely about the carefully regular pa
14|KENYA|0| pending excuses haggle furiously deposits. pending, express pinto beans wake fluffily past t
15|MOROCCO|0|rns. blithely bold courts among the closely regular packages use furiously bold platelets?
16|MOZAMBIQUE|0|s. ironic, unusual asymptotes wake blithely r
17|PERU|1|platelets. blithely pending dependencies use fluffily across the even pinto beans. carefully silent accoun
18|CHINA|2|c dependencies. furiously express notornis sleep slyly regular accounts. ideas sleep. depos
19|ROMANIA|3|ular asymptotes are about the furious multipliers. express dependencies nag above the ironically ironic account
20|SAUDI ARABIA|4|ts. silent requests haggle. closely express packages sleep across the blithely
21|VIETNAM|2|hely enticingly express accounts. even, final
22|RUSSIA|3| requests against the platelets use never according to the quickly regular pint
23|UNITED KINGDOM|3|eans boost carefully special requests. accounts are. carefull
24|UNITED STATES|1|y final packages. slow foxes cajole quickly. quickly silent platelets breach ironic accounts. unusual pinto be
The file can be loaded to a Kafka topic nation by executing the following command, where TPCH_DATADIR is an environment variable pointing to the directory containing nation.tbl:
Note that the file and view contain only four columns, but the source definition is created by declaring five columns.
This issue is a request to improve the error message to clearly signal to the user that the source definition and the data are not aligned. Hopefully, we should be able to detect record boundaries when reading data from the source, so our CSV parsing could identify the mismatch between the number of columns expected and the number of columns given in a record (namely, the first one). Above, the number of columns is overestimated to 40 instead of 4. In addition, the error is detected at record number 10, but it is already present in the first record.
What version of Materialize are you using?
v0.26.1-dev (14bb2ff6b)
How did you install Materialize?
Built from source
What is the issue?
Consider the creation of the following Kafka source, done similarly to the instructions for running TPC-H in Materialize:
Now, let's assume that you have a file
nation.tbl
with the following contents, which can be obtained by running dbgen at scale factor 1:The file can be loaded to a Kafka topic
nation
by executing the following command, whereTPCH_DATADIR
is an environment variable pointing to the directory containingnation.tbl
:kcat -b localhost:9092 -t "nation" -T -P -p -1 -l "$TPCH_DATADIR/nation.tbl" > /dev/null
We can then try to read the data in Materialize:
Note that the file and view contain only four columns, but the source definition is created by declaring five columns.
This issue is a request to improve the error message to clearly signal to the user that the source definition and the data are not aligned. Hopefully, we should be able to detect record boundaries when reading data from the source, so our CSV parsing could identify the mismatch between the number of columns expected and the number of columns given in a record (namely, the first one). Above, the number of columns is overestimated to 40 instead of 4. In addition, the error is detected at record number 10, but it is already present in the first record.
Relevant log output
No response