confluentinc / ksql

The database purpose-built for stream processing applications.
https://ksqldb.io
Other
128 stars 1.04k forks source link

ksqlDB will drop left-hand records in a LEFT JOIN if the join key is null #2657

Open yosefrat opened 5 years ago

yosefrat commented 5 years ago

For a Stream-to-Table join this statement in the documentation (and associated behavior in KSQL) contradicts the ANSI definition of a LEFT JOIN: "Input records for the stream with a NULL key or a NULL value are ignored and don’t trigger the join." - https://docs.confluent.io/current/ksql/docs/developer-guide/join-streams-and-tables.html For an INNER JOIN this concept and result applies in an expected fashion: if there is a null pointer in the stream object then the join would not be triggered and returned since there is no intersection between the two objects. However, if we LEFT JOIN a stream to a table where the stream value has a NULL the trigger is not run and therefore we are missing records from the LEFT referent. Since this is a documented behavior I assume that was intentional but doesn't it contradict the expected ANSI SQL-like behavior?

mjsax commented 5 years ago

KSQL does not follow ANSI SQL semantics, but defines its own semantics.

The semantics are inherited from Kafka Streams, and the underlying reasoning is the following: for some operations, null-key or null-value implies that the operation is not well defined (note that Kafka Streams has a key-value-pair model, not a tuple model like KSQL). To simplify the reasoning about operator semantics, all stateful operators (ie, aggregations, joins) in Kafka Streams treat null-key or null-value as invalid input and drop the record.

I still see your point, and it might be desirable to change some semantics. I just try to give some back ground information. I am also not sure atm, if KSQL could work around the current Kafka Streams behavior or if a change to Kafka Streams would be required for this case.

Hope this help.

yosefrat commented 5 years ago

It does indeed make sense and help, and I think since the behavior is well documented for KSQL you've done due diligence on the code. I just want to be sure to point out that using the term "LEFT JOIN" to describe a join that drops records from the left portion of the join contradicts the typical behavior of a standard SQL LEFT JOIN, and may be misleading. Thank you for your explanation :-)

mjsax commented 3 years ago

Apache Kafka ticket to change the behavior: https://issues.apache.org/jira/browse/KAFKA-12317

rmoff commented 3 years ago

@mjsax how is it best we track this for ksqlDB? Should we have this ticket open as an ER to change the behaviour in ksqlDB once Kafka Streams supports it? Or can we close it and it will happen automagically and inherit the behaviour from KS?

In the meantime we should document clearly the behaviour as a caveat on https://docs.ksqldb.io/en/latest/developer-guide/joins/join-streams-and-tables/ - @JimGalasyn can you do that from here or should i log a DOCS jira?

JimGalasyn commented 3 years ago

HI @rmoff , please open a DOCS ticket against me with details. Thanks!

mjsax commented 3 years ago

In general, ksqlDB should just inherit the fix from Kafka Streams -- but it might be worth anyway to add a test to ksqlDB to verify it. I guess we can use this ticket to ensure that we add the corresponding test.

burkematthew commented 9 months ago

From what I've just been observing\testing, this is also an issue with TABLE-LEFT OUTER JOIN-TABLE in ksqlDB. Records were missing (when those records had NULL foreign keys). I then changed my join to look like this:

FROM primary_key_table
LEFT OUTER JOIN foreign_key_table COALESCE(primary_key_table.foreign_key_id, 0) = foreign_key_table.id

and the records I expected to see started populating.

mjsax commented 9 months ago

That's correct. However, for table-table joi, it's harder to fix... I don't think that Kafka Streams would add a fix for this. Maybe it's possible to fix inside ksqlDB though.

Btw: For stream-table joins, the Kafka Streams ticket was resolved with 3.7 release recently, and ksqlDB should pickup this fix in a future release.

burkematthew commented 9 months ago

Is this limitation documented somewhere? I know I didn't see anything around this for ksqlDB. Would've saved me a bit of a headache in the meantime. It'd be great if ksqlDB could address this too.

Thanks for the update though on stream-table joins!

mjsax commented 9 months ago

Oh sorry. I miss-read your message... For fk-joins, the same fix (https://cwiki.apache.org/confluence/display/KAFKA/KIP-962%3A+Relax+non-null+key+requirement+in+Kafka+Streams) for KS as mentioned above will also apply to fk-joins...

I first (incorrectly) though, you are taking about input records null-keys, which are not allowed because PK always are a NOT NULL constraint.

It's seems, the current limitation is not documented...