背景

12月23号有个小伙伴说他的任务有点问题，一个量不是很大的任务产生了大量的spark task

大概有几十万的task，但是他单独跑那几个表的时候都很快，task也不多

排查

第一步，基本执行计划

我们看到最后的Physical Plan已经变成了笛卡尔积了，这显然可以解释task变多这件事，那么接下来我们就需要找到一个合理的解释，我们看到了Analyzed Plan里Join key还是在的，但是Optimized Plan已经把Join key丢掉了，所以这个效果应该是Optimize阶段生效的，我们分析了下用户的sql，发现他的join key是自己搞出来的一个常量

--join key都是week
select concat('20191216','_', '20191222') as week,live_id ....

所以我们直觉是这个常量导致了一些负优化

第二步，精简用户case

select concat('20191216','_', '20191222') as week,count(distinct presideid) as anchor_num
from live.report_entertainment_hall_preside_detail 
where  dt between '20191216' and '20191222' group by week
as tb_a;

select concat('20191216','_', '20191222') as week,count(1)  cnt from dwd.live_room_chat_play_daily_log_di
where dt between '20191216' and '20191222' and user_id<1000000000 group by week as tb_b;
select tb_a.week ,cnt,anchor_num from tb_a inner join tb_b on tb_a.week = tb_b.week;

explain 的结果是

== Parsed Logical Plan ==
'Project ['tb_a.week, 'cnt, 'anchor_num]
+- 'Join Inner, ('tb_a.week = 'tb_b.week)
   :- 'UnresolvedRelation `tb_a`
   +- 'UnresolvedRelation `tb_b`

== Analyzed Logical Plan ==
week: string, cnt: bigint, anchor_num: bigint
Project [week#212540, cnt#212555L, anchor_num#212541L]
+- Join Inner, (week#212540 = week#212554)
   :- SubqueryAlias `tb_a`
   :  +- Aggregate [concat(20191216, _, 20191222)], [concat(20191216, _, 20191222) AS week#212540, count(distinct presideid#212544L) AS anchor_num#212541L]
   :     +- Filter ((dt#212550 >= 20191216) && (dt#212550 <= 20191222))
   :        +- SubqueryAlias `live`.`report_entertainment_hall_preside_detail`
   :           +- HiveTableRelation `live`.`report_entertainment_hall_preside_detail`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [roomid#212542L, title#212543, presideid#212544L, nickname#212545, star_time#212546, end_time#212547, duration#212548, times_order#212549], [dt#212550]
   +- SubqueryAlias `tb_b`
      +- Aggregate [concat(20191216, _, 20191222)], [concat(20191216, _, 20191222) AS week#212554, count(1) AS cnt#212555L]
         +- Filter (((dt#212577 >= 20191216) && (dt#212577 <= 20191222)) && (user_id#212563L < cast(1000000000 as bigint)))
            +- SubqueryAlias `dwd`.`live_room_chat_play_daily_log_di`
               +- HiveTableRelation `dwd`.`live_room_chat_play_daily_log_di`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [chat_id#212557L, live_id#212558L, room_id#212559L, anchor_id#212560L, device_id#212561, client_name#212562, user_id#212563L, play_duration_secs#212564L, app_name#212565, app_topic_name#212566, create_timestamp#212567, ip#212568, series_id#212569, operation_name#212570, text#212571, send_timestamp#212572, insert_timestamp#212573, app_id#212574, status_code#212575, play_source#212576], [dt#212577]

== Optimized Logical Plan ==
Project [week#212540, cnt#212555L, anchor_num#212541L]
+- Join Inner, (week#212540 = week#212554)
   :- Aggregate [0], [20191216_20191222 AS week#212540, count(distinct presideid#212544L) AS anchor_num#212541L]
   :  +- Project [presideid#212544L]
   :     +- Filter ((isnotnull(dt#212550) && (dt#212550 >= 20191216)) && (dt#212550 <= 20191222))
   :        +- HiveTableRelation `live`.`report_entertainment_hall_preside_detail`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [roomid#212542L, title#212543, presideid#212544L, nickname#212545, star_time#212546, end_time#212547, duration#212548, times_order#212549], [dt#212550]
   +- Aggregate [0], [20191216_20191222 AS week#212554, count(1) AS cnt#212555L]
      +- Project
         +- Filter ((((isnotnull(user_id#212563L) && isnotnull(dt#212577)) && (dt#212577 >= 20191216)) && (dt#212577 <= 20191222)) && (user_id#212563L < 1000000000))
            +- HiveTableRelation `dwd`.`live_room_chat_play_daily_log_di`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [chat_id#212557L, live_id#212558L, room_id#212559L, anchor_id#212560L, device_id#212561, client_name#212562, user_id#212563L, play_duration_secs#212564L, app_name#212565, app_topic_name#212566, create_timestamp#212567, ip#212568, series_id#212569, operation_name#212570, text#212571, send_timestamp#212572, insert_timestamp#212573, app_id#212574, status_code#212575, play_source#212576], [dt#212577]

== Physical Plan ==
*(8) Project [week#212540, cnt#212555L, anchor_num#212541L]
+- *(8) SortMergeJoin [week#212540], [week#212554], Inner
   :- *(4) Sort [week#212540 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(week#212540, 300)
   :     +- *(3) HashAggregate(keys=[0#212583], functions=[count(distinct presideid#212544L)], output=[week#212540, anchor_num#212541L])
   :        +- Exchange hashpartitioning(0#212583, 300)
   :           +- *(2) HashAggregate(keys=[0#212583], functions=[partial_count(distinct presideid#212544L)], output=[0#212583, count#212586L])
   :              +- *(2) HashAggregate(keys=[0#212583, presideid#212544L], functions=[], output=[0#212583, presideid#212544L])
   :                 +- Exchange hashpartitioning(0#212583, presideid#212544L, 300)
   :                    +- *(1) HashAggregate(keys=[0 AS 0#212583, presideid#212544L], functions=[], output=[0#212583, presideid#212544L])
   :                       +- Scan hive live.report_entertainment_hall_preside_detail [presideid#212544L], HiveTableRelation `live`.`report_entertainment_hall_preside_detail`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [roomid#212542L, title#212543, presideid#212544L, nickname#212545, star_time#212546, end_time#212547, duration#212548, times_order#212549], [dt#212550], [isnotnull(dt#212550), (dt#212550 >= 20191216), (dt#212550 <= 20191222)]
   +- *(7) Sort [week#212554 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(week#212554, 300)
         +- *(6) HashAggregate(keys=[0#212588], functions=[count(1)], output=[week#212554, cnt#212555L])
            +- Exchange hashpartitioning(0#212588, 300)
               +- *(5) HashAggregate(keys=[0 AS 0#212588], functions=[partial_count(1)], output=[0#212588, count#212590L])
                  +- *(5) Project
                     +- *(5) Filter (isnotnull(user_id#212563L) && (user_id#212563L < 1000000000))
                        +- Scan hive dwd.live_room_chat_play_daily_log_di [user_id#212563L], HiveTableRelation `dwd`.`live_room_chat_play_daily_log_di`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [chat_id#212557L, live_id#212558L, room_id#212559L, anchor_id#212560L, device_id#212561, client_name#212562, user_id#212563L, play_duration_secs#212564L, app_name#212565, app_topic_name#212566, create_timestamp#212567, ip#212568, series_id#212569, operation_name#212570, text#212571, send_timestamp#212572, insert_timestamp#212573, app_id#212574, status_code#212575, play_source#212576], [dt#212577], [isnotnull(dt#212577), (dt#212577 >= 20191216), (dt#212577 <= 20191222)]

等等，并没出现CartesianProduct ，这说明常量的joinKey在inner join下被优化掉不是唯一条件，需要更细致的去看下可能的原因

第三步，分析Trace log

于是我修改了用来debug的engine的日志级别，打开了org.apache.spark.sql的Trace日志，并且用原始的任务和我的简化版任务各打了一个Trace log

观察原始任务的时候，我们是带着Join Inner上的JoinKey是什么时候被丢掉的这个问题去跟踪的，我们就找到了需要分析的第一段执行计划

12-24 17:44:27 379 TRACE (org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$2:62) - 
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
 Project [week#50, anchor_num#51L, user_amount#104L AS week_user#147L, consume_numbers#132L AS week_consume_numbers#148L, xizuan_consume#133, arppu#134]                                                                                                                                                                                                                                                                                                                        Project [week#50, anchor_num#51L, user_amount#104L AS week_user#147L, consume_numbers#132L AS week_consume_numbers#148L, xizuan_consume#133, arppu#134]
 +- Join Inner                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +- Join Inner
!   :- Join Inner, (week#50 = 20191216_20191222)                                                                                                                                                                                                                                                                                                                                                                                                                                   :- Join Inner
!   :  :- Aggregate [0], [20191216_20191222 AS week#50, count(distinct presideid#54L) AS anchor_num#51L]                                                                                                                                                                                                                                                                                                                                                                           :  :- Filter (week#50 = 20191216_20191222)
!   :  :  +- Project [presideid#54L]                                                                                                                                                                                                                                                                                                                                                                                                                                               :  :  +- Aggregate [0], [20191216_20191222 AS week#50, count(distinct presideid#54L) AS anchor_num#51L]
!   :  :     +- Join Inner, (roomid#52L = id#0L)                                                                                                                                                                                                                                                                                                                                                                                                                                   :  :     +- Project [presideid#54L]
!   :  :        :- Project [roomid#52L, presideid#54L]                                                                                                                                                                                                                                                                                                                                                                                                                             :  :        +- Join Inner, (roomid#52L = id#0L)
!   :  :        :  +- Filter ((dt#60 >= 20191216) && (dt#60 <= 20191222))                                                                                                                                                                                                                                                                                                                                                                                                          :  :           :- Project [roomid#52L, presideid#54L]
!   :  :        :     +- HiveTableRelation `live`.`report_entertainment_hall_preside_detail`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [roomid#52L, title#53, presideid#54L, nickname#55, star_time#56, end_time#57, duration#58, times_order#59], [dt#60]                                                                                                                                                                                                     :  :           :  +- Filter ((dt#60 >= 20191216) && (dt#60 <= 20191222))
!   :  :        +- Repartition 3, false                                                                                                                                                                                                                                                                                                                                                                                                                                            :  :           :     +- HiveTableRelation `live`.`report_entertainment_hall_preside_detail`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [roomid#52L, title#53, presideid#54L, nickname#55, star_time#56, end_time#57, duration#58, times_order#59], [dt#60]
!   :  :           +- Project [id#0L]                                                                                                                                                                                                                                                                                                                                                                                                                                              :  :           +- Repartition 3, false
!   :  :              +- Filter (mode#15 = 2)                                                                                                                                                                                                                                                                                                                                                                                                                                      :  :              +- Project [id#0L]
!   :  :                 +- Relation[id#0L,uid#1L,chat_id#2L,title#3,cover_url#4,rule_info#5,fm_id#6L,is_hidden#7,created_at#8,updated_at#9,approve_status#10,weight#11,category_id#12,channel#13,hosting_uid#14L,mode#15] JDBCRelation(doom.tb_entertain_room) [numPartitions=1]                                                                                                                                                                                                  :  :                 +- Filter (mode#15 = 2)
!   :  +- Aggregate [20191216_20191222], [count(distinct user_id#71L) AS user_amount#104L]                                                                                                                                                                                                                                                                                                                                                                                         :  :                    +- Relation[id#0L,uid#1L,chat_id#2L,title#3,cover_url#4,rule_info#5,fm_id#6L,is_hidden#7,created_at#8,updated_at#9,approve_status#10,weight#11,category_id#12,channel#13,hosting_uid#14L,mode#15] JDBCRelation(doom.tb_entertain_room) [numPartitions=1]
!   :     +- Project [user_id#71L]                                                                                                                                                                                                                                                                                                                                                                                                                                                 :  +- Aggregate [20191216_20191222], [count(distinct user_id#71L) AS user_amount#104L]
!   :        +- Join Inner, (room_id#67L = id#0L)                                                                                                                                                                                                                                                                                                                                                                                                                                  :     +- Project [user_id#71L]
!   :           :- Project [room_id#67L, user_id#71L]                                                                                                                                                                                                                                                                                                                                                                                                                              :        +- Join Inner, (room_id#67L = id#0L)
!   :           :  +- Filter (((dt#85 >= 20191216) && (dt#85 <= 20191222)) && (user_id#71L < 1000000000))                                                                                                                                                                                                                                                                                                                                                                          :           :- Project [room_id#67L, user_id#71L]
!   :           :     +- HiveTableRelation `dwd`.`live_room_chat_play_daily_log_di`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [chat_id#65L, live_id#66L, room_id#67L, anchor_id#68L, device_id#69, client_name#70, user_id#71L, play_duration_secs#72L, app_name#73, app_topic_name#74, create_timestamp#75, ip#76, series_id#77, operation_name#78, text#79, send_timestamp#80, insert_timestamp#81, app_id#82, status_code#83, play_source#84], [dt#85]      :           :  +- Filter (((dt#85 >= 20191216) && (dt#85 <= 20191222)) && (user_id#71L < 1000000000))
!   :           +- Repartition 3, false                                                                                                                                                                                                                                                                                                                                                                                                                                            :           :     +- HiveTableRelation `dwd`.`live_room_chat_play_daily_log_di`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [chat_id#65L, live_id#66L, room_id#67L, anchor_id#68L, device_id#69, client_name#70, user_id#71L, play_duration_secs#72L, app_name#73, app_topic_name#74, create_timestamp#75, ip#76, series_id#77, operation_name#78, text#79, send_timestamp#80, insert_timestamp#81, app_id#82, status_code#83, play_source#84], [dt#85]
!   :              +- Project [id#0L]                                                                                                                                                                                                                                                                                                                                                                                                                                              :           +- Repartition 3, false
!   :                 +- Filter (mode#15 = 2)                                                                                                                                                                                                                                                                                                                                                                                                                                      :              +- Project [id#0L]
!   :                    +- Relation[id#0L,uid#1L,chat_id#2L,title#3,cover_url#4,rule_info#5,fm_id#6L,is_hidden#7,created_at#8,updated_at#9,approve_status#10,weight#11,category_id#12,channel#13,hosting_uid#14L,mode#15] JDBCRelation(doom.tb_entertain_room) [numPartitions=1]                                                                                                                                                                                                  :                 +- Filter (mode#15 = 2)
!   +- Aggregate [20191216_20191222], [count(distinct user_id#115L) AS consume_numbers#132L, (sum(xizuan#121) + sum(cast(boxprice#114 as double))) AS xizuan_consume#133, (((sum(xizuan#121) + sum(cast(boxprice#114 as double))) / 10.0) / cast(count(distinct user_id#115L) as double)) AS arppu#134]                                                                                                                                                                            :                    +- Relation[id#0L,uid#1L,chat_id#2L,title#3,cover_url#4,rule_info#5,fm_id#6L,is_hidden#7,created_at#8,updated_at#9,approve_status#10,weight#11,category_id#12,channel#13,hosting_uid#14L,mode#15] JDBCRelation(doom.tb_entertain_room) [numPartitions=1]
!      +- Project [user_id#115L, xizuan#121, CASE WHEN isnull(get_json_object(context#119, $.openGiftPrice)) THEN 0 ELSE get_json_object(context#119, $.openGiftPrice) END AS boxprice#114]                                                                                                                                                                                                                                                                                        +- Aggregate [20191216_20191222], [count(distinct user_id#115L) AS consume_numbers#132L, (sum(xizuan#121) + sum(cast(boxprice#114 as double))) AS xizuan_consume#133, (((sum(xizuan#121) + sum(cast(boxprice#114 as double))) / 10.0) / cast(count(distinct user_id#115L) as double)) AS arppu#134]
!         +- Filter ((((dt#124 >= 20191216) && (dt#124 <= 20191222)) && (business_type_id#116 = 609)) && (user_id#115L < 1000000000))                                                                                                                                                                                                                                                                                                                                                 +- Project [user_id#115L, xizuan#121, CASE WHEN isnull(get_json_object(context#119, $.openGiftPrice)) THEN 0 ELSE get_json_object(context#119, $.openGiftPrice) END AS boxprice#114]
!            +- HiveTableRelation `dws`.`live_business_consume_detail_di`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [user_id#115L, business_type_id#116, unified_order_no#117, channel_type_id#118, context#119, device_id#120, xizuan#121, insert_timestamp#122, payee_user_id#123L], [dt#124]                                                                                                                                                                      +- Filter ((((dt#124 >= 20191216) && (dt#124 <= 20191222)) && (business_type_id#116 = 609)) && (user_id#115L < 1000000000))
!

经过PushPredicateThroughJoin我们的 Join Inner, (week#50 = 20191216_20191222) 变成了Filter (week#50 = 20191216_20191222)

通过静态代码分析

    case j @ Join(left, right, joinType, joinCondition) =>
      val (leftJoinConditions, rightJoinConditions, commonJoinCondition) =
        split(joinCondition.map(splitConjunctivePredicates).getOrElse(Nil), left, right)

      joinType match {
        case _: InnerLike | LeftSemi =>
          // push down the single side only join filter for both sides sub queries
          val newLeft = leftJoinConditions.
            reduceLeftOption(And).map(Filter(_, left)).getOrElse(left)

大概是这段代码生效了，所以将Join变成了Filter，分析了下过程好像没啥问题，那么我们再把目光提前

这段日志的上一段是ConstantFolding，这里面做了一些常量求值的事情，看起来也比较正常

   :  +- Join Inner, (week#50 = concat(20191216, _, 20191222))  ->  Join Inner, (week#50 = 20191216_20191222)

但是再上一个Rule就有点问题了,再往上的一个Rule是FoldablePropagation

我们看下他的执行计划

=== Applying Rule org.apache.spark.sql.catalyst.optimizer.FoldablePropagation ===
 Project [week#50, anchor_num#51L, user_amount#104L AS week_user#147L, consume_numbers#132L AS week_consume_numbers#148L, xizuan_consume#133, arppu#134]                                                                                                                                                                                                                                                                                                                           Project [week#50, anchor_num#51L, user_amount#104L AS week_user#147L, consume_numbers#132L AS week_consume_numbers#148L, xizuan_consume#133, arppu#134]
!+- Join Inner, (week#50 = week#111)                                                                                                                                                                                                                                                                                                                                                                                                                                               +- Join Inner, (week#50 = concat(20191216, _, 20191222))
    :- Project [week#50, anchor_num#51L, user_amount#104L]                                                                                                                                                                                                                                                                                                                                                                                                                            :- Project [week#50, anchor_num#51L, user_amount#104L]
!   :  +- Join Inner, (week#50 = week#64)                                                                                                                                                                                                                                                                                                                                                                                                                                             :  +- Join Inner, (week#50 = concat(20191216, _, 20191222))
    :     :- Aggregate [0], [concat(20191216, _, 20191222) AS week#50, count(distinct presideid#54L) AS anchor_num#51L]                                                                                                                                                                                                                                                                                                                                                               :     :- Aggregate [0], [concat(20191216, _, 20191222) AS week#50, count(distinct presideid#54L) AS anchor_num#51L]
    :     :  +- Project [presideid#54L]                                                                                                                                                                                                                                                                                                                                                                                                                                               :     :  +- Project [presideid#54L]
    :     :     +- Join Inner, (roomid#52L = id#0L)                                                                                                                                                                                                                                                                                                                                                                                                                                   :     :     +- Join Inner, (roomid#52L = id#0L)
    :     :        :- Project [roomid#52L, presideid#54L]                                                                                                                                                                                                                                                                                                                                                                                                                             :     :        :- Project [roomid#52L, presideid#54L]
    :     :        :  +- Filter ((dt#60 >= 20191216) && (dt#60 <= 20191222))                                                                                                                                                                                                                                                                                                                                                                                                          :     :        :  +- Filter ((dt#60 >= 20191216) && (dt#60 <= 20191222))
    :     :        :     +- HiveTableRelation `live`.`report_entertainment_hall_preside_detail`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [roomid#52L, title#53, presideid#54L, nickname#55, star_time#56, end_time#57, duration#58, times_order#59], [dt#60]                                                                                                                                                                                                     :     :        :     +- HiveTableRelation `live`.`report_entertainment_hall_preside_detail`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [roomid#52L, title#53, presideid#54L, nickname#55, star_time#56, end_time#57, duration#58, times_order#59], [dt#60]
    :     :        +- Project [id#0L]                                                                                                                                                                                                                                                                                                                                                                                                                                                 :     :        +- Project [id#0L]
    :     :           +- Repartition 3, false                                                                                                                                                                                                                                                                                                                                                                                                                                         :     :           +- Repartition 3, false
    :     :              +- Project [id#0L]                                                                                                                                                                                                                                                                                                                                                                                                                                           :     :              +- Project [id#0L]
    :     :                 +- Filter (mode#15 = 2)                                                                                                                                                                                                                                                                                                                                                                                                                                   :     :                 +- Filter (mode#15 = 2)
    :     :                    +- Relation[id#0L,uid#1L,chat_id#2L,title#3,cover_url#4,rule_info#5,fm_id#6L,is_hidden#7,created_at#8,updated_at#9,approve_status#10,weight#11,category_id#12,channel#13,hosting_uid#14L,mode#15] JDBCRelation(doom.tb_entertain_room) [numPartitions=1]                                                                                                                                                                                               :     :                    +- Relation[id#0L,uid#1L,chat_id#2L,title#3,cover_url#4,rule_info#5,fm_id#6L,is_hidden#7,created_at#8,updated_at#9,approve_status#10,weight#11,category_id#12,channel#13,hosting_uid#14L,mode#15] JDBCRelation(doom.tb_entertain_room) [numPartitions=1]
!   :     +- Aggregate [week#64], [week#64, count(distinct user_id#71L) AS user_amount#104L]                                                                                                                                                                                                                                                                                                                                                                                          :     +- Aggregate [concat(20191216, _, 20191222)], [concat(20191216, _, 20191222) AS week#64, count(distinct user_id#71L) AS user_amount#104L]
!   :        +- Project [week#64, user_id#71L]                                                                                                                                                                                                                                                                                                                                                                                                                                        :        +- Project [concat(20191216, _, 20191222) AS week#64, user_id#71L]
    :           +- Join Inner, (room_id#67L = id#0L)                                                                                                                                                                                                                                                                                                                                                                                                                                  :           +- Join Inner, (room_id#67L = id#0L)
    :              :- Project [concat(20191216, _, 20191222) AS week#64, room_id#67L, user_id#71L]                                                                                                                                                                                                                                                                                                                                                                                    :              :- Project [concat(20191216, _, 20191222) AS week#64, room_id#67L, user_id#71L]
    :              :  +- Filter (((dt#85 >= 20191216) && (dt#85 <= 20191222)) && (user_id#71L < cast(1000000000 as bigint)))                                                                                                                                                                                                                                                                                                                                                          :              :  +- Filter (((dt#85 >= 20191216) && (dt#85 <= 20191222)) && (user_id#71L < cast(1000000000 as bigint)))
    :              :     +- HiveTableRelation `dwd`.`live_room_chat_play_daily_log_di`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [chat_id#65L, live_id#66L, room_id#67L, anchor_id#68L, device_id#69, client_name#70, user_id#71L, play_duration_secs#72L, app_name#73, app_topic_name#74, create_timestamp#75, ip#76, series_id#77, operation_name#78, text#79, send_timestamp#80, insert_timestamp#81, app_id#82, status_code#83, play_source#84], [dt#85]      :              :     +- HiveTableRelation `dwd`.`live_room_chat_play_daily_log_di`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [chat_id#65L, live_id#66L, room_id#67L, anchor_id#68L, device_id#69, client_name#70, user_id#71L, play_duration_secs#72L, app_name#73, app_topic_name#74, create_timestamp#75, ip#76, series_id#77, operation_name#78, text#79, send_timestamp#80, insert_timestamp#81, app_id#82, status_code#83, play_source#84], [dt#85]
    :              +- Project [id#0L]                                                                                                                                                                                                                                                                                                                                                                                                                                                 :              +- Project [id#0L]
    :                 +- Repartition 3, false                                                                                                                                                                                                                                                                                                                                                                                                                                         :                 +- Repartition 3, false
    :                    +- Project [id#0L]                                                                                                                                                                                                                                                                                                                                                                                                                                           :                    +- Project [id#0L]
    :                       +- Filter (mode#15 = 2)                                                                                                                                                                                                                                                                                                                                                                                                                                   :                       +- Filter (mode#15 = 2)
    :                          +- Relation[id#0L,uid#1L,chat_id#2L,title#3,cover_url#4,rule_info#5,fm_id#6L,is_hidden#7,created_at#8,updated_at#9,approve_status#10,weight#11,category_id#12,channel#13,hosting_uid#14L,mode#15] JDBCRelation(doom.tb_entertain_room) [numPartitions=1]                                                                                                                                                                                               :                          +- Relation[id#0L,uid#1L,chat_id#2L,title#3,cover_url#4,rule_info#5,fm_id#6L,is_hidden#7,created_at#8,updated_at#9,approve_status#10,weight#11,category_id#12,channel#13,hosting_uid#14L,mode#15] JDBCRelation(doom.tb_entertain_room) [numPartitions=1]
!   +- Aggregate [week#111], [week#111, count(distinct user_id#115L) AS consume_numbers#132L, (sum(xizuan#121) + sum(cast(boxprice#114 as double))) AS xizuan_consume#133, (((sum(xizuan#121) + sum(cast(boxprice#114 as double))) / cast(10 as double)) / cast(count(distinct user_id#115L) as double)) AS arppu#134]                                                                                                                                                                +- Aggregate [concat(20191216, _, 20191222)], [concat(20191216, _, 20191222) AS week#111, count(distinct user_id#115L) AS consume_numbers#132L, (sum(xizuan#121) + sum(cast(boxprice#114 as double))) AS xizuan_consume#133, (((sum(xizuan#121) + sum(cast(boxprice#114 as double))) / cast(10 as double)) / cast(count(distinct user_id#115L) as double)) AS arppu#134]
       +- Project [concat(20191216, _, 20191222) AS week#111, user_id#115L, xizuan#121, CASE WHEN isnull(get_json_object(context#119, $.openGiftPrice)) THEN cast(0 as string) ELSE get_json_object(context#119, $.openGiftPrice) END AS boxprice#114]                                                                                                                                                                                                                                   +- Project [concat(20191216, _, 20191222) AS week#111, user_id#115L, xizuan#121, CASE WHEN isnull(get_json_object(context#119, $.openGiftPrice)) THEN cast(0 as string) ELSE get_json_object(context#119, $.openGiftPrice) END AS boxprice#114]
          +- Filter ((((dt#124 >= 20191216) && (dt#124 <= 20191222)) && (business_type_id#116 = 609)) && (user_id#115L < cast(1000000000 as bigint)))                                                                                                                                                                                                                                                                                                                                       +- Filter ((((dt#124 >= 20191216) && (dt#124 <= 20191222)) && (business_type_id#116 = 609)) && (user_id#115L < cast(1000000000 as bigint)))
             +- HiveTableRelation `dws`.`live_business_consume_detail_di`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [user_id#115L, business_type_id#116, unified_order_no#117, channel_type_id#118, context#119, device_id#120, xizuan#121, insert_timestamp#122, payee_user_id#123L], [dt#124]                                                                                                                                                                            +- HiveTableRelation `dws`.`live_business_consume_detail_di`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [user_id#115L, business_type_id#116, unified_order_no#117, channel_type_id#118, context#119, device_id#120, xizuan#121, insert_timestamp#122, payee_user_id#123L], [dt#124]

输入的时候是Join Inner, (week#50 = week#64),规则执行完就变成了Join Inner, (week#50 = concat(20191216, _, 20191222)), 也就是在这一步将week#64给替换掉了

我们找到这个规则的源代码，从逻辑来这里因为是inner join，所以没有不确定的attr，所以这里就把我们的foldable给换掉了

        // Join derives the output attributes from its child while they are actually not the
        // same attributes. For example, the output of outer join is not always picked from its
        // children, but can also be null. We should exclude these miss-derived attributes when
        // propagating the foldable expressions.
        // TODO(cloud-fan): It seems more reasonable to use new attributes as the output attributes
        // of outer join.
        case j @ Join(left, right, joinType, _) if foldableMap.nonEmpty =>
          val newJoin = j.transformExpressions(replaceFoldable)
          val missDerivedAttrsSet: AttributeSet = AttributeSet(joinType match {
            case _: InnerLike | LeftExistence(_) => Nil
            case LeftOuter => right.output
            case RightOuter => left.output
            case FullOuter => left.output ++ right.output
          })
          foldableMap = AttributeMap(foldableMap.baseMap.values.filterNot {
            case (attr, _) => missDerivedAttrsSet.contains(attr)
          }.toSeq)
          newJoin

第四步，修复任务

由于我们并不想换掉，所以对于用户case的优化来说，week只有一个确定值的情况下，他做inner join和left join没啥本质区别，所以我们就让用户改成了left join，成功的跑到了结果。不过我们的事情还没结束，我们还需要知道什么情况下会走进这个case

第五步，确定问题

我们翻阅了精简的对照组的trace日志，结果并没有发现有FoldablePropagation，

12-25 09:17:28 277 TRACE (org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$2:62) - 
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.ColumnPruning ===
 Project [week#1124, cnt#1139L, anchor_num#1125L]                                                                                                                                                                                                                                                                                                                                                                                                                                                             Project [week#1124, cnt#1139L, anchor_num#1125L]
 +- Join Inner, (week#1124 = week#1138)                                                                                                                                                                                                                                                                                                                                                                                                                                                                       +- Join Inner, (week#1124 = week#1138)
    :- Aggregate [0], [concat(20191216, _, 20191222) AS week#1124, count(distinct presideid#1128L) AS anchor_num#1125L]                                                                                                                                                                                                                                                                                                                                                                                          :- Aggregate [0], [concat(20191216, _, 20191222) AS week#1124, count(distinct presideid#1128L) AS anchor_num#1125L]
!   :  +- Filter ((dt#1134 >= 20191216) && (dt#1134 <= 20191222))                                                                                                                                                                                                                                                                                                                                                                                                                                                :  +- Project [presideid#1128L]
!   :     +- HiveTableRelation `live`.`report_entertainment_hall_preside_detail`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [roomid#1126L, title#1127, presideid#1128L, nickname#1129, star_time#1130, end_time#1131, duration#1132, times_order#1133], [dt#1134]                                                                                                                                                                                                                             :     +- Filter ((dt#1134 >= 20191216) && (dt#1134 <= 20191222))
!   +- Aggregate [0], [concat(20191216, _, 20191222) AS week#1138, count(1) AS cnt#1139L]                                                                                                                                                                                                                                                                                                                                                                                                                        :        +- HiveTableRelation `live`.`report_entertainment_hall_preside_detail`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [roomid#1126L, title#1127, presideid#1128L, nickname#1129, star_time#1130, end_time#1131, duration#1132, times_order#1133], [dt#1134]
!      +- Filter (((dt#1161 >= 20191216) && (dt#1161 <= 20191222)) && (user_id#1147L < cast(1000000000 as bigint)))                                                                                                                                                                                                                                                                                                                                                                                              +- Aggregate [0], [concat(20191216, _, 20191222) AS week#1138, count(1) AS cnt#1139L]
!         +- HiveTableRelation `dwd`.`live_room_chat_play_daily_log_di`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [chat_id#1141L, live_id#1142L, room_id#1143L, anchor_id#1144L, device_id#1145, client_name#1146, user_id#1147L, play_duration_secs#1148L, app_name#1149, app_topic_name#1150, create_timestamp#1151, ip#1152, series_id#1153, operation_name#1154, text#1155, send_timestamp#1156, insert_timestamp#1157, app_id#1158, status_code#1159, play_source#1160], [dt#1161]         +- Project
!                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +- Filter (((dt#1161 >= 20191216) && (dt#1161 <= 20191222)) && (user_id#1147L < cast(1000000000 as bigint)))
!                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         +- HiveTableRelation `dwd`.`live_room_chat_play_daily_log_di`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [chat_id#1141L, live_id#1142L, room_id#1143L, anchor_id#1144L, device_id#1145, client_name#1146, user_id#1147L, play_duration_secs#1148L, app_name#1149, app_topic_name#1150, create_timestamp#1151, ip#1152, series_id#1153, operation_name#1154, text#1155, send_timestamp#1156, insert_timestamp#1157, app_id#1158, status_code#1159, play_source#1160], [dt#1161]

12-25 09:17:28 284 TRACE (org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$2:62) - 
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantFolding ===
 Project [week#1124, cnt#1139L, anchor_num#1125L]                                                                                                                                                                                                                                                                                                                                                                                                                                                                Project [week#1124, cnt#1139L, anchor_num#1125L]
 +- Join Inner, (week#1124 = week#1138)                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +- Join Inner, (week#1124 = week#1138)
!   :- Aggregate [0], [concat(20191216, _, 20191222) AS week#1124, count(distinct presideid#1128L) AS anchor_num#1125L]                                                                                                                                                                                                                                                                                                                                                                                             :- Aggregate [0], [20191216_20191222 AS week#1124, count(distinct presideid#1128L) AS anchor_num#1125L]
    :  +- Project [presideid#1128L]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 :  +- Project [presideid#1128L]
    :     +- Filter ((dt#1134 >= 20191216) && (dt#1134 <= 20191222))                                                                                                                                                                                                                                                                                                                                                                                                                                                :     +- Filter ((dt#1134 >= 20191216) && (dt#1134 <= 20191222))
    :        +- HiveTableRelation `live`.`report_entertainment_hall_preside_detail`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [roomid#1126L, title#1127, presideid#1128L, nickname#1129, star_time#1130, end_time#1131, duration#1132, times_order#1133], [dt#1134]                                                                                                                                                                                                                             :        +- HiveTableRelation `live`.`report_entertainment_hall_preside_detail`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [roomid#1126L, title#1127, presideid#1128L, nickname#1129, star_time#1130, end_time#1131, duration#1132, times_order#1133], [dt#1134]
!   +- Aggregate [0], [concat(20191216, _, 20191222) AS week#1138, count(1) AS cnt#1139L]                                                                                                                                                                                                                                                                                                                                                                                                                           +- Aggregate [0], [20191216_20191222 AS week#1138, count(1) AS cnt#1139L]
       +- Project                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +- Project
!         +- Filter (((dt#1161 >= 20191216) && (dt#1161 <= 20191222)) && (user_id#1147L < cast(1000000000 as bigint)))                                                                                                                                                                                                                                                                                                                                                                                                    +- Filter (((dt#1161 >= 20191216) && (dt#1161 <= 20191222)) && (user_id#1147L < 1000000000))
             +- HiveTableRelation `dwd`.`live_room_chat_play_daily_log_di`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [chat_id#1141L, live_id#1142L, room_id#1143L, anchor_id#1144L, device_id#1145, client_name#1146, user_id#1147L, play_duration_secs#1148L, app_name#1149, app_topic_name#1150, create_timestamp#1151, ip#1152, series_id#1153, operation_name#1154, text#1155, send_timestamp#1156, insert_timestamp#1157, app_id#1158, status_code#1159, play_source#1160], [dt#1161]               +- HiveTableRelation `dwd`.`live_room_chat_play_daily_log_di`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [chat_id#1141L, live_id#1142L, room_id#1143L, anchor_id#1144L, device_id#1145, client_name#1146, user_id#1147L, play_duration_secs#1148L, app_name#1149, app_topic_name#1150, create_timestamp#1151, ip#1152, series_id#1153, operation_name#1154, text#1155, send_timestamp#1156, insert_timestamp#1157, app_id#1158, status_code#1159, play_source#1160], [dt#1161]

从ColumnPruning直接走到了ConstantFolding 和线上有问题的情况相比，少了两步，这两步是CollapseProject和FoldablePropagation

我们知道优化都是有前因后果的，那么我们需要找到进一步的前因是什么，很明显这里的前因就是CollapseProject

/**
 * Combines two adjacent [[Project]] operators into one and perform alias substitution,
 * merging the expressions into one single expression.
 */
object CollapseProject extends Rule[LogicalPlan]

这个规则主要是用来合并Project的，所以我们再次对比了我们的case和用户case的区别，然后稍加改造我们的sql

select concat('20191216','_', '20191222') as week,count(distinct presideid) as anchor_num
from live.report_entertainment_hall_preside_detail 
where  dt between '20191216' and '20191222' group by week
as tb_a;
select concat('20191216','_', '20191222') as week,user_id from dwd.live_room_chat_play_daily_log_di
where dt between '20191216' and '20191222' and user_id<1000000000 as tb_b;
-- 注意这个tb_c，是会触发CollapseProject
select count(1) cnt,week  from tb_b group by week as tb_c;
select  tb_a.week, cnt,anchor_num from tb_a inner join tb_c on tb_a.week = tb_c.week ;

explain:

== Parsed Logical Plan ==
'Project ['tb_a.week, 'cnt, 'anchor_num]
+- 'Join Inner, ('tb_a.week = 'tb_c.week)
   :- 'UnresolvedRelation `tb_a`
   +- 'UnresolvedRelation `tb_c`

== Analyzed Logical Plan ==
week: string, cnt: bigint, anchor_num: bigint
Project [week#319875, cnt#319913L, anchor_num#319876L]
+- Join Inner, (week#319875 = week#319889)
   :- SubqueryAlias `tb_a`
   :  +- Aggregate [concat(20191216, _, 20191222)], [concat(20191216, _, 20191222) AS week#319875, count(distinct presideid#319879L) AS anchor_num#319876L]
   :     +- Filter ((dt#319885 >= 20191216) && (dt#319885 <= 20191222))
   :        +- SubqueryAlias `live`.`report_entertainment_hall_preside_detail`
   :           +- HiveTableRelation `live`.`report_entertainment_hall_preside_detail`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [roomid#319877L, title#319878, presideid#319879L, nickname#319880, star_time#319881, end_time#319882, duration#319883, times_order#319884], [dt#319885]
   +- SubqueryAlias `tb_c`
      +- Aggregate [week#319889], [count(1) AS cnt#319913L, week#319889]
         +- SubqueryAlias `tb_b`
            +- Project [concat(20191216, _, 20191222) AS week#319889, user_id#319896L]
               +- Filter (((dt#319910 >= 20191216) && (dt#319910 <= 20191222)) && (user_id#319896L < cast(1000000000 as bigint)))
                  +- SubqueryAlias `dwd`.`live_room_chat_play_daily_log_di`
                     +- HiveTableRelation `dwd`.`live_room_chat_play_daily_log_di`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [chat_id#319890L, live_id#319891L, room_id#319892L, anchor_id#319893L, device_id#319894, client_name#319895, user_id#319896L, play_duration_secs#319897L, app_name#319898, app_topic_name#319899, create_timestamp#319900, ip#319901, series_id#319902, operation_name#319903, text#319904, send_timestamp#319905, insert_timestamp#319906, app_id#319907, status_code#319908, play_source#319909], [dt#319910]

== Optimized Logical Plan ==
Project [week#319875, cnt#319913L, anchor_num#319876L]
+- Join Inner
   :- Aggregate [0], [20191216_20191222 AS week#319875, count(distinct presideid#319879L) AS anchor_num#319876L]
   :  +- Project [presideid#319879L]
   :     +- Filter ((isnotnull(dt#319885) && (dt#319885 >= 20191216)) && (dt#319885 <= 20191222))
   :        +- HiveTableRelation `live`.`report_entertainment_hall_preside_detail`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [roomid#319877L, title#319878, presideid#319879L, nickname#319880, star_time#319881, end_time#319882, duration#319883, times_order#319884], [dt#319885]
   +- Aggregate [20191216_20191222], [count(1) AS cnt#319913L]
      +- Project
         +- Filter ((((isnotnull(dt#319910) && isnotnull(user_id#319896L)) && (dt#319910 >= 20191216)) && (dt#319910 <= 20191222)) && (user_id#319896L < 1000000000))
            +- HiveTableRelation `dwd`.`live_room_chat_play_daily_log_di`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [chat_id#319890L, live_id#319891L, room_id#319892L, anchor_id#319893L, device_id#319894, client_name#319895, user_id#319896L, play_duration_secs#319897L, app_name#319898, app_topic_name#319899, create_timestamp#319900, ip#319901, series_id#319902, operation_name#319903, text#319904, send_timestamp#319905, insert_timestamp#319906, app_id#319907, status_code#319908, play_source#319909], [dt#319910]

== Physical Plan ==
*(6) Project [week#319875, cnt#319913L, anchor_num#319876L]
+- CartesianProduct
   :- *(3) HashAggregate(keys=[0#319920], functions=[count(distinct presideid#319879L)], output=[week#319875, anchor_num#319876L])
   :  +- Exchange hashpartitioning(0#319920, 300)
   :     +- *(2) HashAggregate(keys=[0#319920], functions=[partial_count(distinct presideid#319879L)], output=[0#319920, count#319923L])
   :        +- *(2) HashAggregate(keys=[0#319920, presideid#319879L], functions=[], output=[0#319920, presideid#319879L])
   :           +- Exchange hashpartitioning(0#319920, presideid#319879L, 300)
   :              +- *(1) HashAggregate(keys=[0 AS 0#319920, presideid#319879L], functions=[], output=[0#319920, presideid#319879L])
   :                 +- Scan hive live.report_entertainment_hall_preside_detail [presideid#319879L], HiveTableRelation `live`.`report_entertainment_hall_preside_detail`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [roomid#319877L, title#319878, presideid#319879L, nickname#319880, star_time#319881, end_time#319882, duration#319883, times_order#319884], [dt#319885], [isnotnull(dt#319885), (dt#319885 >= 20191216), (dt#319885 <= 20191222)]
   +- *(5) HashAggregate(keys=[20191216_20191222#319925], functions=[count(1)], output=[cnt#319913L])
      +- Exchange hashpartitioning(20191216_20191222#319925, 300)
         +- *(4) HashAggregate(keys=[20191216_20191222 AS 20191216_20191222#319925], functions=[partial_count(1)], output=[20191216_20191222#319925, count#319927L])
            +- *(4) Project
               +- *(4) Filter (isnotnull(user_id#319896L) && (user_id#319896L < 1000000000))
                  +- Scan hive dwd.live_room_chat_play_daily_log_di [user_id#319896L], HiveTableRelation `dwd`.`live_room_chat_play_daily_log_di`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [chat_id#319890L, live_id#319891L, room_id#319892L, anchor_id#319893L, device_id#319894, client_name#319895, user_id#319896L, play_duration_secs#319897L, app_name#319898, app_topic_name#319899, create_timestamp#319900, ip#319901, series_id#319902, operation_name#319903, text#319904, send_timestamp#319905, insert_timestamp#319906, app_id#319907, status_code#319908, play_source#319909], [dt#319910], [isnotnull(dt#319910), (dt#319910 >= 20191216), (dt#319910 <= 20191222)]

成功触发了优化，变成了笛卡尔积

当我们改成left之后

plan 就变成了

== Parsed Logical Plan ==
'Project ['tb_a.week, 'cnt, 'anchor_num]
+- 'Join LeftOuter, ('tb_a.week = 'tb_c.week)
   :- 'UnresolvedRelation `tb_a`
   +- 'UnresolvedRelation `tb_c`

== Analyzed Logical Plan ==
week: string, cnt: bigint, anchor_num: bigint
Project [week#35159, cnt#35197L, anchor_num#35160L]
+- Join LeftOuter, (week#35159 = week#35173)
   :- SubqueryAlias `tb_a`
   :  +- Aggregate [concat(20191216, _, 20191222)], [concat(20191216, _, 20191222) AS week#35159, count(distinct presideid#35163L) AS anchor_num#35160L]
   :     +- Filter ((dt#35169 >= 20191216) && (dt#35169 <= 20191222))
   :        +- SubqueryAlias `live`.`report_entertainment_hall_preside_detail`
   :           +- HiveTableRelation `live`.`report_entertainment_hall_preside_detail`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [roomid#35161L, title#35162, presideid#35163L, nickname#35164, star_time#35165, end_time#35166, duration#35167, times_order#35168], [dt#35169]
   +- SubqueryAlias `tb_c`
      +- Aggregate [week#35173], [count(1) AS cnt#35197L, week#35173]
         +- SubqueryAlias `tb_b`
            +- Project [concat(20191216, _, 20191222) AS week#35173, user_id#35180L]
               +- Filter (((dt#35194 >= 20191216) && (dt#35194 <= 20191222)) && (user_id#35180L < cast(1000000000 as bigint)))
                  +- SubqueryAlias `dwd`.`live_room_chat_play_daily_log_di`
                     +- HiveTableRelation `dwd`.`live_room_chat_play_daily_log_di`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [chat_id#35174L, live_id#35175L, room_id#35176L, anchor_id#35177L, device_id#35178, client_name#35179, user_id#35180L, play_duration_secs#35181L, app_name#35182, app_topic_name#35183, create_timestamp#35184, ip#35185, series_id#35186, operation_name#35187, text#35188, send_timestamp#35189, insert_timestamp#35190, app_id#35191, status_code#35192, play_source#35193], [dt#35194]

== Optimized Logical Plan ==
Project [week#35159, cnt#35197L, anchor_num#35160L]
+- Join LeftOuter, (week#35159 = 20191216_20191222)
   :- Aggregate [0], [20191216_20191222 AS week#35159, count(distinct presideid#35163L) AS anchor_num#35160L]
   :  +- Project [presideid#35163L]
   :     +- Filter ((isnotnull(dt#35169) && (dt#35169 >= 20191216)) && (dt#35169 <= 20191222))
   :        +- HiveTableRelation `live`.`report_entertainment_hall_preside_detail`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [roomid#35161L, title#35162, presideid#35163L, nickname#35164, star_time#35165, end_time#35166, duration#35167, times_order#35168], [dt#35169]
   +- Aggregate [20191216_20191222], [count(1) AS cnt#35197L]
      +- Project
         +- Filter ((((isnotnull(user_id#35180L) && isnotnull(dt#35194)) && (dt#35194 >= 20191216)) && (dt#35194 <= 20191222)) && (user_id#35180L < 1000000000))
            +- HiveTableRelation `dwd`.`live_room_chat_play_daily_log_di`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [chat_id#35174L, live_id#35175L, room_id#35176L, anchor_id#35177L, device_id#35178, client_name#35179, user_id#35180L, play_duration_secs#35181L, app_name#35182, app_topic_name#35183, create_timestamp#35184, ip#35185, series_id#35186, operation_name#35187, text#35188, send_timestamp#35189, insert_timestamp#35190, app_id#35191, status_code#35192, play_source#35193], [dt#35194]

== Physical Plan ==
*(6) Project [week#35159, cnt#35197L, anchor_num#35160L]
+- BroadcastNestedLoopJoin BuildRight, LeftOuter, (week#35159 = 20191216_20191222)
   :- *(3) HashAggregate(keys=[0#35204], functions=[count(distinct presideid#35163L)], output=[week#35159, anchor_num#35160L])
   :  +- Exchange hashpartitioning(0#35204, 300)
   :     +- *(2) HashAggregate(keys=[0#35204], functions=[partial_count(distinct presideid#35163L)], output=[0#35204, count#35207L])
   :        +- *(2) HashAggregate(keys=[0#35204, presideid#35163L], functions=[], output=[0#35204, presideid#35163L])
   :           +- Exchange hashpartitioning(0#35204, presideid#35163L, 300)
   :              +- *(1) HashAggregate(keys=[0 AS 0#35204, presideid#35163L], functions=[], output=[0#35204, presideid#35163L])
   :                 +- Scan hive live.report_entertainment_hall_preside_detail [presideid#35163L], HiveTableRelation `live`.`report_entertainment_hall_preside_detail`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [roomid#35161L, title#35162, presideid#35163L, nickname#35164, star_time#35165, end_time#35166, duration#35167, times_order#35168], [dt#35169], [isnotnull(dt#35169), (dt#35169 >= 20191216), (dt#35169 <= 20191222)]
   +- BroadcastExchange IdentityBroadcastMode
      +- *(5) HashAggregate(keys=[20191216_20191222#35209], functions=[count(1)], output=[cnt#35197L])
         +- Exchange hashpartitioning(20191216_20191222#35209, 300)
            +- *(4) HashAggregate(keys=[20191216_20191222 AS 20191216_20191222#35209], functions=[partial_count(1)], output=[20191216_20191222#35209, count#35211L])
               +- *(4) Project
                  +- *(4) Filter (isnotnull(user_id#35180L) && (user_id#35180L < 1000000000))
                     +- Scan hive dwd.live_room_chat_play_daily_log_di [user_id#35180L], HiveTableRelation `dwd`.`live_room_chat_play_daily_log_di`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [chat_id#35174L, live_id#35175L, room_id#35176L, anchor_id#35177L, device_id#35178, client_name#35179, user_id#35180L, play_duration_secs#35181L, app_name#35182, app_topic_name#35183, create_timestamp#35184, ip#35185, series_id#35186, operation_name#35187, text#35188, send_timestamp#35189, insert_timestamp#35190, app_id#35191, status_code#35192, play_source#35193], [dt#35194], [isnotnull(dt#35194), (dt#35194 >= 20191216), (dt#35194 <= 20191222)]

最终结论

如果是常量的joinKey，可以用left join避免被优化
也可以自己写sql的时候将一些中间结果去掉，直接写出聚合sql，这种情况下inner join也不会被优化

cjuexuan / mynote

spark2.4 `Inner join`被“优化”成`CartesianProduct`的问题分析 #66

背景