Closed uday1409 closed 1 year ago
I have raise PR for this, please review. I have tested modified jar in my local and its able to track the lineage for the job in question.
@uday1409 I would be nice to have a unit or integration test on this. Or at least a short example of Spark query that is failing, so we could create a test ourselves. Thank you.
Sure, I will check if I can replicate using some example as I cannot share the actual code here. I will update here.
@wajda @cerveada agent fails for this query, this should be a pretty common scenario I believe.
Issue lies here "SELECT * FROM NonNullableTest where id in (select id from NonNullableTest_1) "
# Execute SQL queries to create tables
spark.sql("CREATE TABLE NonNullableTest (id INT NOT NULL, name STRING, age INT)")
spark.sql("INSERT INTO NonNullableTest VALUES (1, 'John', 25), (2, 'Jane', 30), (3, 'Bob', NULL)")
spark.sql("CREATE TABLE NonNullableTest_1 (id INT, city STRING, country STRING)")
spark.sql("INSERT INTO NonNullableTest_1 VALUES (null, 'New York', 'USA'), (2, 'London', 'UK'), (3, 'Paris', 'France')")
spark.sql(
"""CREATE TABLE NonNullable_final AS
SELECT * FROM NonNullableTest where id in (select id from NonNullableTest_1)
""").explain(true)
Lineage is not getting tracked for jobs where IN clause is involved with left hand side of it being not null and when right side of expression yields null.
This was not present earlier as the bug was recently identified in spark and back ported to certain versions of spark which is causing the Spline parser to fail.
More details Spakr PR https://github.com/apache/spark/pull/41094
Spark Issue https://issues.apache.org/jira/browse/SPARK-43413
Spline is throwing below error after above changes in Spark. Below has been added recently. By default, nullable is returned false, hence need to handle this case specifically in Spline.
[https://github.com/apache/spark/commit/2e56821830019765bf8530e0e6a8a5abd6125293]
https://github.com/apache/spark/blob/2e56821830019765bf8530e0e6a8a5abd6125293/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala