confluentinc / ksql

The database purpose-built for stream processing applications.
https://ksqldb.io
Other
102 stars 1.04k forks source link

ksqlDB should support N-Way joins on the same table multiple times #6484

Open vrudenskyi opened 3 years ago

vrudenskyi commented 3 years ago

Describe the bug N-Way Join for the same table does not work

To Reproduce Steps to reproduce the behavior, include:

  1. The version of KSQL: Server v0.12.0
  2. Sample source data: n/a
  3. Any SQL statements you ran
    CREATE STREAM joined AS 
    SELECT * 
    FROM A
    LEFT JOIN B b1 ON A.b1_id = b1.id
    LEFT JOIN B b2 ON A.b2_id = b2.id;

    Expected behavior Stream created

Actual behaviour

CLI output/Error messages: Can not join 'B' to 'B': self joins are not yet supported.

Additional context n/a

vcrfxia commented 3 years ago

Hi @vrudenskyi , the behavior you're requesting is not possible in ksqlDB today as Kafka Streams (the underlying runtime) does not support it, and workarounds would require duplicating data in wasteful ways. Here are some related issues:

(cc @mjsax in case something has changed since the filing of those issues)

I'm going to update this issue from bug to enhancement accordingly.

agavra commented 3 years ago

@vcrfxia - I think the one thing that's interesting is that this isn't technically a self-join since there is no loop (though the limitation in KS probably still applies here):

  A
  |
  |--B
  |
  AB
  |
  |--B
  |
  ABB

We might want to change the error message to "n-way joins do not support reading the same topic multiple times"

mjsax commented 3 years ago

As the statement is CSAS it seems that A is a stream and B is a table. For this case, if should be possible to actually compile the query? It's not necessary to read B twice.

KTable b = builder.table("bTopic");
KStream A = builder.stream("aTopic");
a.selectKey(b1_id).join(b).select(b2_id).join(b);

I guess our check if a stream or table is referenced twice is rather "primitive" (I guess count based?).

vcrfxia commented 3 years ago

We might want to change the error message to "n-way joins do not support reading the same topic multiple times"

Sure, I've opened a PR to clarify the error message: https://github.com/confluentinc/ksql/pull/6527

rmoff commented 3 years ago

(To help others find this issue, the latest (0.14) version of the message is N-way joins do not support multiple occurrences of the same source)


I also tried duplicating the table definition to a new object against the same source topic, but ksqlDB is too clever for me:

Invalid topology: Topic TIPLOC_FLAT_KEYED has already been registered by another source.

I'd like to +1 making this possible - it's a common scenario in analytics to use the same lookup table multiple times, for example:

From the sounds of it this is actually possible today in Kafka Streams, but ksqlDB does not permit it - so hopefully it's a simple fix? 👼 /cc @MichaelDrogalis @derekjn

dttouchdata commented 2 years ago

We are hitting this issue as @rmoff mentioned in our reporting environment. Are there any alternatives or update on this issue? Thanks

Ktl-XV commented 2 years ago

Hello @vcrfxia the upstream issue you referenced Kafka Streams is now Resolved. Can the fix also be implemented in ksqldb?

vcrfxia commented 2 years ago

@vpapavas @jnh5y you started looking into https://github.com/confluentinc/ksql/issues/2030 right? Is this ticket (for n-way joins) also in scope for what you're looking at?

vpapavas commented 2 years ago

Hey @vcrfxia! N-way joins are not really in scope, nor are table-table self-joins. We might though get table-table self-joins as a side bonus if they are easy to support :)

killermoon commented 1 year ago

Any updates on this ? Or any workaround for this as we are investigating usage of another tool due to this reason

mjsax commented 1 year ago

Self-join optimization was done in KS in the meant-time: https://cwiki.apache.org/confluence/display/KAFKA/KIP-862%3A+Self-join+optimization+for+stream-stream+joins

But ksqlDB does not yet leverage it.

killermoon commented 1 year ago

Self-join optimization was done in KS in the meant-time: https://cwiki.apache.org/confluence/display/KAFKA/KIP-862%3A+Self-join+optimization+for+stream-stream+joins

But ksqlDB does not yet leverage it.

Fine but we only use KsqlDb in order to map, structure and deliver our data trough KSQL.