Open yosefrat opened 5 years ago
KSQL does not follow ANSI SQL semantics, but defines its own semantics.
The semantics are inherited from Kafka Streams, and the underlying reasoning is the following: for some operations, null-key or null-value implies that the operation is not well defined (note that Kafka Streams has a key-value-pair model, not a tuple model like KSQL). To simplify the reasoning about operator semantics, all stateful operators (ie, aggregations, joins) in Kafka Streams treat null-key or null-value as invalid input and drop the record.
I still see your point, and it might be desirable to change some semantics. I just try to give some back ground information. I am also not sure atm, if KSQL could work around the current Kafka Streams behavior or if a change to Kafka Streams would be required for this case.
Hope this help.
It does indeed make sense and help, and I think since the behavior is well documented for KSQL you've done due diligence on the code. I just want to be sure to point out that using the term "LEFT JOIN" to describe a join that drops records from the left portion of the join contradicts the typical behavior of a standard SQL LEFT JOIN, and may be misleading. Thank you for your explanation :-)
Apache Kafka ticket to change the behavior: https://issues.apache.org/jira/browse/KAFKA-12317
@mjsax how is it best we track this for ksqlDB? Should we have this ticket open as an ER to change the behaviour in ksqlDB once Kafka Streams supports it? Or can we close it and it will happen automagically and inherit the behaviour from KS?
In the meantime we should document clearly the behaviour as a caveat on https://docs.ksqldb.io/en/latest/developer-guide/joins/join-streams-and-tables/ - @JimGalasyn can you do that from here or should i log a DOCS jira?
HI @rmoff , please open a DOCS ticket against me with details. Thanks!
In general, ksqlDB should just inherit the fix from Kafka Streams -- but it might be worth anyway to add a test to ksqlDB to verify it. I guess we can use this ticket to ensure that we add the corresponding test.
From what I've just been observing\testing, this is also an issue with TABLE-LEFT OUTER JOIN-TABLE in ksqlDB. Records were missing (when those records had NULL foreign keys). I then changed my join to look like this:
FROM primary_key_table
LEFT OUTER JOIN foreign_key_table COALESCE(primary_key_table.foreign_key_id, 0) = foreign_key_table.id
and the records I expected to see started populating.
That's correct. However, for table-table joi, it's harder to fix... I don't think that Kafka Streams would add a fix for this. Maybe it's possible to fix inside ksqlDB though.
Btw: For stream-table joins, the Kafka Streams ticket was resolved with 3.7 release recently, and ksqlDB should pickup this fix in a future release.
Is this limitation documented somewhere? I know I didn't see anything around this for ksqlDB. Would've saved me a bit of a headache in the meantime. It'd be great if ksqlDB could address this too.
Thanks for the update though on stream-table joins!
Oh sorry. I miss-read your message... For fk-joins, the same fix (https://cwiki.apache.org/confluence/display/KAFKA/KIP-962%3A+Relax+non-null+key+requirement+in+Kafka+Streams) for KS as mentioned above will also apply to fk-joins...
I first (incorrectly) though, you are taking about input records null-keys, which are not allowed because PK always are a NOT NULL
constraint.
It's seems, the current limitation is not documented...
For a Stream-to-Table join this statement in the documentation (and associated behavior in KSQL) contradicts the ANSI definition of a LEFT JOIN: "Input records for the stream with a NULL key or a NULL value are ignored and don’t trigger the join." - https://docs.confluent.io/current/ksql/docs/developer-guide/join-streams-and-tables.html For an INNER JOIN this concept and result applies in an expected fashion: if there is a null pointer in the stream object then the join would not be triggered and returned since there is no intersection between the two objects. However, if we LEFT JOIN a stream to a table where the stream value has a NULL the trigger is not run and therefore we are missing records from the LEFT referent. Since this is a documented behavior I assume that was intentional but doesn't it contradict the expected ANSI SQL-like behavior?