MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.72k stars 466 forks source link

Improve error message in CSV parsing from Kafka source #14518

Open vmarcos opened 2 years ago

vmarcos commented 2 years ago

What version of Materialize are you using?

v0.26.1-dev (14bb2ff6b)

How did you install Materialize?

Built from source

What is the issue?

Consider the creation of the following Kafka source, done similarly to the instructions for running TPC-H in Materialize:

CREATE CONNECTION kafka FOR KAFKA BROKER 'localhost:9092';

CREATE SOURCE nation_src
FROM KAFKA CONNECTION kafka TOPIC 'nation'
FORMAT CSV
WITH 5 COLUMNS
DELIMITED BY '|';

CREATE VIEW nation AS
SELECT
    column1::integer as n_nationkey,
    column2 as n_name,
    column3::integer as n_regionkey,
    column4 as n_comment
FROM
    nation_src
;

Now, let's assume that you have a file nation.tbl with the following contents, which can be obtained by running dbgen at scale factor 1:

0|ALGERIA|0| haggle. carefully final deposits detect slyly agai
1|ARGENTINA|1|al foxes promise slyly according to the regular accounts. bold requests alon
2|BRAZIL|1|y alongside of the pending deposits. carefully special packages are about the ironic forges. slyly special 
3|CANADA|1|eas hang ironic, silent packages. slyly regular packages are furiously over the tithes. fluffily bold
4|EGYPT|4|y above the carefully unusual theodolites. final dugouts are quickly across the furiously regular d
5|ETHIOPIA|0|ven packages wake quickly. regu
6|FRANCE|3|refully final requests. regular, ironi
7|GERMANY|3|l platelets. regular accounts x-ray: unusual, regular acco
8|INDIA|2|ss excuses cajole slyly across the packages. deposits print aroun
9|INDONESIA|2| slyly express asymptotes. regular deposits haggle slyly. carefully ironic hockey players sleep blithely. carefull
10|IRAN|4|efully alongside of the slyly final dependencies. 
11|IRAQ|4|nic deposits boost atop the quickly final requests? quickly regula
12|JAPAN|2|ously. final, express gifts cajole a
13|JORDAN|4|ic deposits are blithely about the carefully regular pa
14|KENYA|0| pending excuses haggle furiously deposits. pending, express pinto beans wake fluffily past t
15|MOROCCO|0|rns. blithely bold courts among the closely regular packages use furiously bold platelets?
16|MOZAMBIQUE|0|s. ironic, unusual asymptotes wake blithely r
17|PERU|1|platelets. blithely pending dependencies use fluffily across the even pinto beans. carefully silent accoun
18|CHINA|2|c dependencies. furiously express notornis sleep slyly regular accounts. ideas sleep. depos
19|ROMANIA|3|ular asymptotes are about the furious multipliers. express dependencies nag above the ironically ironic account
20|SAUDI ARABIA|4|ts. silent requests haggle. closely express packages sleep across the blithely
21|VIETNAM|2|hely enticingly express accounts. even, final 
22|RUSSIA|3| requests against the platelets use never according to the quickly regular pint
23|UNITED KINGDOM|3|eans boost carefully special requests. accounts are. carefull
24|UNITED STATES|1|y final packages. slow foxes cajole quickly. quickly silent platelets breach ironic accounts. unusual pinto be

The file can be loaded to a Kafka topic nation by executing the following command, where TPCH_DATADIR is an environment variable pointing to the directory containing nation.tbl:

kcat -b localhost:9092 -t "nation" -T -P -p -1 -l "$TPCH_DATADIR/nation.tbl" > /dev/null

We can then try to read the data in Materialize:

materialize=> SELECT * FROM nation;
ERROR:  Decode error: Text: CSV error at record number 10: expected 5 columns, got 40. (original bytes: [39, 7c, 49, 4e, 44, 4f, 4e, 45, 53, 49, 41, 7c, 32, 7c, 20, 73, 6c, 79, 6c, 79, 20, 65, 78, 70, 72, 65, 73, 73, 20, 61, 73, 79, 6d, 70, 74, 6f, 74, 65, 73, 2e, 20, 72, 65, 67, 75, 6c, 61, 72, 20, 64, 65, 70, 6f, 73, 69, 74, 73, 20, 68, 61, 67, 67, 6c, 65, 20, 73, 6c, 79, 6c, 79, 2e, 20, 63, 61, 72, 65, 66, 75, 6c, 6c, 79, 20, 69, 72, 6f, 6e, 69, 63, 20, 68, 6f, 63, 6b, 65, 79, 20, 70, 6c, 61, 79, 65, 72, 73, 20, 73, 6c, 65, 65, 70, 20, 62, 6c, 69, 74, 68, 65, 6c, 79, 2e, 20, 63, 61, 72, 65, 66, 75, 6c, 6c])

Note that the file and view contain only four columns, but the source definition is created by declaring five columns.

This issue is a request to improve the error message to clearly signal to the user that the source definition and the data are not aligned. Hopefully, we should be able to detect record boundaries when reading data from the source, so our CSV parsing could identify the mismatch between the number of columns expected and the number of columns given in a record (namely, the first one). Above, the number of columns is overestimated to 40 instead of 4. In addition, the error is detected at record number 10, but it is already present in the first record.

Relevant log output

No response

nmeagan11 commented 2 years ago

Should this fall under https://github.com/MaterializeInc/materialize/issues/12864?