Breaking change: Standardized schemas returned by /query endpoint

confluentinc / ksql

The database purpose-built for stream processing applications.

Other

128 stars 1.04k forks source link

The websocket /query endpoint returns the schema of the result in the first row. The RESTful response from /query does not include a schema at all. (The CLI gets around this by first querying for the schema).

The schema returned by the websocket is in the form:

[
{"name":"ROWTIME","schema":{"type":"BIGINT","fields":null,"memberSchema":null}},
{"name":"ROWKEY","schema":{"type":"STRING","fields":null,"memberSchema":null}},
{"name":"VIEWTIME","schema":{"type":"BIGINT","fields":null,"memberSchema":null}},
{"name":"PAGEID","schema":{"type":"STRING","fields":null,"memberSchema":null}},
{"name":"USERID","schema":{"type":"ARRAY","fields":null,"memberSchema":"STRING"}}
]

i.e. its a JSON document describing a Connect style schema.

What we should return is the SQL schema! e.g. something like the following:

per-column sql schema:

[
{"name":"ROWTIME","schema": "BIGINT"},
{"name":"ROWKEY","schema":"STRING"},
{"name":"VIEWTIME","schema":"BIGINT"},
{"name":"PAGEID","schema":"STRING"},
{"name":"USERID","schema":"ARRAY<STRING>"}
]

AND we should look to provide the schema for RESTful queries too. Or just a straight up SQL line:

ROWTIME BIGINT, ROWKEY STRING, VIEWTIME BIGINT, PAGEID STRING, USERID ARRAY<STRING>

Thinking on this more... if we take the example of a schema that contains a STRUCT column, e.g.

ID INT KEY, THING STRUCT<F0 INT, F1 STRING>

Personally, I really don't like us transmissitting the struct column using the old format of the following, (with null fields removed):

[
{"name":"ID","schema":{"type":"INT"},
{"name":"THING","schema":{"type":"STRUCT","fields":[{"name":"F0", "schema":{"type": "INT"}, {"name":"F1", "schema":{"type": "STRING"}]}},
]

We have a succinct was of expression the SQL types.... it's SQL!

The other option on splitting it per-column breaks down with complex column types as the client receiving the response still needs to be able to parse the F0 INT, F1 STRING bit. if they can parse that, then they can parse the whole string.

Hence, my preference is for just a sql schema.

If guess we could support a query parameter to switch to the more verbose

[
{"name":"ID","schema":{"type":"INT"},
{"name":"THING","schema":{"type":"STRUCT","fields":[{"name":"F0", "schema":{"type": "INT"}, {"name":"F1", "schema":{"type": "STRING"}]}},
]

confluentinc / ksql

Breaking change: Standardized schemas returned by /query endpoint #3814