--join key都是week
select concat('20191216','_', '20191222') as week,live_id ....
所以我们直觉是这个常量导致了一些负优化
第二步,精简用户case
select concat('20191216','_', '20191222') as week,count(distinct presideid) as anchor_num
from live.report_entertainment_hall_preside_detail
where dt between '20191216' and '20191222' group by week
as tb_a;
select concat('20191216','_', '20191222') as week,count(1) cnt from dwd.live_room_chat_play_daily_log_di
where dt between '20191216' and '20191222' and user_id<1000000000 group by week as tb_b;
select tb_a.week ,cnt,anchor_num from tb_a inner join tb_b on tb_a.week = tb_b.week;
case j @ Join(left, right, joinType, joinCondition) =>
val (leftJoinConditions, rightJoinConditions, commonJoinCondition) =
split(joinCondition.map(splitConjunctivePredicates).getOrElse(Nil), left, right)
joinType match {
case _: InnerLike | LeftSemi =>
// push down the single side only join filter for both sides sub queries
val newLeft = leftJoinConditions.
reduceLeftOption(And).map(Filter(_, left)).getOrElse(left)
// Join derives the output attributes from its child while they are actually not the
// same attributes. For example, the output of outer join is not always picked from its
// children, but can also be null. We should exclude these miss-derived attributes when
// propagating the foldable expressions.
// TODO(cloud-fan): It seems more reasonable to use new attributes as the output attributes
// of outer join.
case j @ Join(left, right, joinType, _) if foldableMap.nonEmpty =>
val newJoin = j.transformExpressions(replaceFoldable)
val missDerivedAttrsSet: AttributeSet = AttributeSet(joinType match {
case _: InnerLike | LeftExistence(_) => Nil
case LeftOuter => right.output
case RightOuter => left.output
case FullOuter => left.output ++ right.output
})
foldableMap = AttributeMap(foldableMap.baseMap.values.filterNot {
case (attr, _) => missDerivedAttrsSet.contains(attr)
}.toSeq)
newJoin
/**
* Combines two adjacent [[Project]] operators into one and perform alias substitution,
* merging the expressions into one single expression.
*/
object CollapseProject extends Rule[LogicalPlan]
select concat('20191216','_', '20191222') as week,count(distinct presideid) as anchor_num
from live.report_entertainment_hall_preside_detail
where dt between '20191216' and '20191222' group by week
as tb_a;
select concat('20191216','_', '20191222') as week,user_id from dwd.live_room_chat_play_daily_log_di
where dt between '20191216' and '20191222' and user_id<1000000000 as tb_b;
-- 注意这个tb_c,是会触发CollapseProject
select count(1) cnt,week from tb_b group by week as tb_c;
select tb_a.week, cnt,anchor_num from tb_a inner join tb_c on tb_a.week = tb_c.week ;
背景
12月23号有个小伙伴说他的任务有点问题,一个量不是很大的任务产生了大量的spark task
大概有几十万的task,但是他单独跑那几个表的时候都很快,task也不多
排查
第一步,基本执行计划
我们看到最后的
Physical Plan
已经变成了笛卡尔积了,这显然可以解释task变多这件事,那么接下来我们就需要找到一个合理的解释,我们看到了Analyzed Plan
里Join key还是在的,但是Optimized Plan
已经把Join key丢掉了,所以这个效果应该是Optimize
阶段生效的,我们分析了下用户的sql,发现他的join key是自己搞出来的一个常量所以我们直觉是这个常量导致了一些负优化
第二步,精简用户case
explain 的结果是
等等,并没出现
CartesianProduct
,这说明常量的joinKey在inner join下被优化掉不是唯一条件,需要更细致的去看下可能的原因第三步,分析Trace log
于是我修改了用来debug的engine的日志级别,打开了
org.apache.spark.sql
的Trace日志,并且用原始的任务和我的简化版任务各打了一个Trace log观察原始任务的时候,我们是带着
Join Inner上的JoinKey是什么时候被丢掉的
这个问题去跟踪的,我们就找到了需要分析的第一段执行计划经过
PushPredicateThroughJoin
我们的Join Inner, (week#50 = 20191216_20191222)
变成了Filter (week#50 = 20191216_20191222)
通过静态代码分析
大概是这段代码生效了,所以将Join变成了Filter,分析了下过程好像没啥问题,那么我们再把目光提前
这段日志的上一段是ConstantFolding,这里面做了一些常量求值的事情,看起来也比较正常
但是再上一个Rule就有点问题了,再往上的一个Rule是
FoldablePropagation
我们看下他的执行计划
输入的时候是
Join Inner, (week#50 = week#64)
,规则执行完就变成了Join Inner, (week#50 = concat(20191216, _, 20191222))
, 也就是在这一步将week#64给替换掉了我们找到这个规则的源代码,从逻辑来这里因为是inner join,所以没有不确定的attr,所以这里就把我们的foldable给换掉了
第四步,修复任务
由于我们并不想换掉,所以对于用户case的优化来说,week只有一个确定值的情况下,他做inner join和left join没啥本质区别,所以我们就让用户改成了left join,成功的跑到了结果。不过我们的事情还没结束,我们还需要知道什么情况下会走进这个case
第五步,确定问题
我们翻阅了精简的对照组的trace日志,结果并没有发现有
FoldablePropagation
,从
ColumnPruning
直接走到了ConstantFolding
和线上有问题的情况相比,少了两步,这两步是CollapseProject
和FoldablePropagation
我们知道优化都是有前因后果的,那么我们需要找到进一步的前因是什么,很明显这里的前因就是
CollapseProject
这个规则主要是用来合并Project的,所以我们再次对比了我们的case和用户case的区别,然后稍加改造我们的sql
explain:
成功触发了优化,变成了笛卡尔积
当我们改成left之后
plan 就变成了
最终结论