Closed bithw1 closed 5 years ago
Hi @bithw1
I'll try to take a look today or tomorrow.
Best regards, Bartosz.
@bithw1
The range method returns a DataFrame with a single one column called id. It's why the engines looks for it in your query declaration:
* Creates a [[Dataset]] with a single `LongType` column named `id`, containing elements
* in a range from `start` to `end` (exclusive) with step value 1.
If you rewrite your query like this it should work:
val spark = SparkSession.builder().master("local").appName("SparkTest").enableHiveSupport().getOrCreate()
spark.experimental.extraOptimizations = Seq(RangeIntersectRule)
spark.range(10, 40).createOrReplaceTempView("t1")
spark.range(20, 50).createOrReplaceTempView("t2")
val df = spark.sql("select t1.id from t1 join t2 on t1.id = t2.id")
df.explain(true)
df.show(truncate = false)
Best regards, Bartosz.
Thanks @bartosz25 , but when rewriting it using column id
, the rule doesn't take effect, you could see that the physical plan is still using BroadcastHashJoin.
When using id
, I try to modify the rule'apply method, I still can't be able to make the apply
method work ,still throwing
Caused by: java.lang.RuntimeException: Couldn't find id#0L in [id#2L]
object RangeIntersectRule extends Rule[LogicalPlan] {
override def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {
case Join(Range(start1, end1, 1, Some(1), output1, false), Range(start2, end2, 1, Some(1), output2, false), Inner, _) => {
val start = start1 max start2
val end = end1 min end2
if (start1 > end2 || end1 < start2) Range(0, 0, 1, Some(1), output1, false)
else Range(start, end, 1, Some(1), output1, false)
}
}
}
Hi @bartosz25
I wrap Range
with Project
as the following code does, it works but I have no idea why it should be wrapped with Project
,could you please help take a look? thank you.
object RangeIntersectRule extends Rule[LogicalPlan] {
override def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {
case Join(Range(start1, end1, 1, Some(1), output1, false), Range(start2, end2, 1, Some(1), output2, false), Inner, _) => {
val start = start1 max start2
val end = end1 min end2
if (start1 > end2 || end1 < start2) Project(output1, Range(0, 0, 1, Some(1), output1, false))
//wrap Range with Project
else Project(output1, Range(start, end, 1, Some(1), output1, false))
}
}
}
Hi @bithw1
Sorry, I missed your message last week. I'll add the topic of extra optimizations to my backlog and try to answer your question here when I'll write about it.
Best regards, Bartosz.
Sure, thank you @bartosz25 !
Hi @bithw1
Today I started the topic of custom optimizations. Since the topic is quite new for me, I will go slowly from the basics and try to cover more advanced concepts at the end. The first post is there : https://www.waitingforcode.com/apache-spark-sql/introduction-custom-optimization-apache-spark-sql/read
Best regards, Bartosz.
That's great, thanks @bartosz25. Looking forward to your great posts and learn, :-)
Hi @bartosz25 ,
I got a question that I would like you to help take a look, thank you.
I want to use spark sql optimizer to optimize two ranges join , just to calculate two range intersection,so that it can avoid join
The rule takes effect, but it throws exception as follows, looks I haven't implemented the apply method appropriately
The plan is
The exception is: