Make it easier to change the key of a created table

rmoff commented 5 years ago

related: #2314

Consider two tables, computer and employee. We want to join them to determine which computer each employee uses (NB table-table joins are 1:1, not 1:N). Having joined them, we want to key it on MAC_ADDR for the purposes of joining to a stream of network data.

The table join is simple:

CREATE TABLE FOO AS 
SELECT c.*, e.* 
FROM computer c 
       INNER JOIN employee e 
      ON c.empkey = e.key

To rekey this currently requires:

-- Register the underlying topic as a stream
CREATE STREAM FOO_STREAM WITH (KAFKA_TOPIC='FOO', VALUE_FORMAT='AVRO');
-- Rekey
CREATE STREAM FOO_STREAM_BY_MACADDR AS SELECT * FROM FOO_STREAM PARTITION BY MAC_ADDR;
-- Declare table on rekeyed data
CREATE TABLE FOO_BY_MACADDR WITH (KAFKA_TOPIC='FOO_STREAM_BY_MACADDR', VALUE_FORMAT='AVRO');

More elegant / less complicated options would include:

Declare the key of the target table, KSQL would automagically rekey in the background:

CREATE TABLE FOO_BY_MACADDR WITH (KEY='MACADDR') AS 
  SELECT c.*, e.* 
  FROM computer c 
        INNER JOIN employee e 
        ON c.empkey = e.key

Explicitly rekey the data as part of the SELECT:

CREATE TABLE FOO_BY_MACADDR AS 
  SELECT c.*, e.* 
  FROM computer c 
        INNER JOIN employee e 
        ON c.empkey = e.key
  PARTITION BY MACADDR

Support rekeying one table to another (only supported for streams):

CREATE TABLE FOO_BY_MACADDR AS 
  SELECT * FROM FOO 
  PARTITION BY MACADDR

peoplemerge commented 5 years ago

Note: according to #2811 (mine) this doesn't seem to work with a composite rowkey.

peoplemerge commented 5 years ago

@rmoff I've had some trouble reproducing these workarounds in automated tests. Any examples of this?

For reference: https://github.com/peoplemerge/ksql/commit/202c0234c360c38581c7f018cde9a2ed41fccc12

mjsax commented 3 years ago

We are adding FK-joins right now. Will those resolve this issue?

In general, plain re-keying of a table is semantically not a well define operation. Assume the following input table:

ID (Primary Key) | A
-----------------+-----
 1               | 100
 2               | 100

If we would allow PARTITION BY a, what should the result be?

 A (PK) | ID          A (PK) | ID
 -------+----         -------+----
  100   | 1            100   | 2

The result is not well defined and thus the query is invalid.

Thoughts?

confluentinc / ksql

Make it easier to change the key of a created table #2356