Open cindygl opened 6 years ago
HiveQL在执行的时候,会把对应的SQL语句转成mapreduce代码来执行。想要看mapreduce执行的具体过程,可以用explain关键字。也就是HiveQL的执行计划,它是hive调优时很重要的一个工具。 语法:
EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION|LOCKS|VECTORIZATION] query
执行过程中,explain会把查询语句转化成stage组成的序列,主要由三方面组成:
示例:
STAGE DEPENDENCIES: 这个HivQL语句执行的过程划只有一个stage,并且是一个根stage,不跟其他stage发生依赖。 stage-0是一个根stage
STAGE PLANS Stage-0:它是一个fetch操作 TableScan:from加载表,描述中有表名、表的行数和表的大小等 Select Operator:筛选列,描述中有列名、列类型,输出的列名、大小等。
hive> explain select * from emp; OK STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: TableScan alias: emp Statistics: Num rows: 2 Data size: 700 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: empno (type: int), name (type: string), job (type: string), mgr (type: int), hiredate (type: string), salary (type: double), comm (type: double), depno (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 2 Data size: 700 Basic stats: COMPLETE Column stats: NONE ListSink Time taken: 0.068 seconds, Fetched: 17 row(s) hive>
SQL on Hadoop 有shuffle:common join => shuffle join => reduce join 没有shuffle:mapjoin => broadcastjoin(spark) 一般来讲,下面的性能高于上面的。前提是,下面是大表join小表,且小表的size不超过一个固定值。
shuffle:相同的key分发到一个reduce task上去执行 join的过程是发生在reduce阶段的。
STAGE DEPENDENCIES: 这个SQL语句将执行的过程划分成3个stage,并明确了三个stage之间的依赖关系 stage-4是一个根stage stage-3依赖于stage-4 stage-0依赖于stage-3
STAGE PLANS: stage-4: Map Reduce Local Work:stage-4是一个本地的mapreduce作业; TableScan:from加载表b,描述中有行数和表的大小等; Filter Operator:隐藏的where过滤条件,on的等号两边的字段deptno要求非空,同样描述了行和大小等; HashTable Sink Operator:
stage-3: Map Reduce:stage-3是一个mapreduce作业; TableScan:from加载表a,描述了表中的行数和大小 Filter Operator:隐藏的where过滤条件,on的等号两边的字段deptno要求非空,同样描述了行和大小等; Map Join Operator:map join操作,谁和谁join outputColumnNames:输出的列名、数量; Select Operator:筛选列,描述中有字段名、字段类型,输出的类型、大小等; File Output Operator:输出结果到临时文件中,描述介绍了压缩格式、输出文件格式,文件数量等; table:
Stage: Stage-3 Map Reduce # stage-3是一个mapreduce作业 Map Operator Tree: TableScan # from 加载表,描述了表中的行数和大小 alias: a Statistics: Num rows: 6 Data size: 700 Basic stats: COMPLETE Column stats: NONE GatherStats: false Filter Operator # 隐藏的where过滤条件,因为on等号两边的字段要求非空;也有行数大小 isSamplingPred: false predicate: depno is not null (type: boolean) Statistics: Num rows: 3 Data size: 350 Basic stats: COMPLETE Column stats: NONE Map Join Operator # map join操作,谁和谁join condition map: Inner Join 0 to 1 keys: 0 depno (type: int) 1 deptno (type: int) outputColumnNames: _col0, _col1, _col12 # 输出的列名、数量 Position of Big Table: 0 Statistics: Num rows: 3 Data size: 385 Basic stats: COMPLETE Column stats: NONE Select Operator # 筛选列,描述中有字段名、字段类型,输出的类型、大小等 expressions: _col0 (type: int), _col1 (type: string), _col12 (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 3 Data size: 385 Basic stats: COMPLETE Column stats: NONE File Output Operator # 输出结果到临时文件中,描述介绍了压缩格式、输出文件格式,文件数量等 compressed: false GlobalTableId: 0 directory: hdfs://192.168.1.8:9000/tmp/hive/hadoop/56f4fb91-6cd0-44b1-89b5-231feb3ecbd6/hive_2018-07-29_07-03-40_556_6872713255475843225-1/-mr-10000/.hive-staging_hive_2018-07-29_07-03-40_556_6872713255475843225-1/-ext-10001 NumFilesPerFileSink: 1 Statistics: Num rows: 3 Data size: 385 Basic stats: COMPLETE Column stats: NONE Stats Publishing Key Prefix: hdfs://192.168.1.8:9000/tmp/hive/hadoop/56f4fb91-6cd0-44b1-89b5-231feb3ecbd6/hive_2018-07-29_07-03-40_556_6872713255475843225-1/-mr-10000/.hive-staging_hive_2018-07-29_07-03-40_556_6872713255475843225-1/-ext-10001/ table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: columns _col0,_col1,_col2 columns.types int:string:string escape.delim \ hive.serialization.extend.additional.nesting.levels true serialization.format 1 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe TotalFiles: 1 GatherStats: false MultiFileSpray: false Local Work: Map Reduce Local Work Path -> Alias: hdfs://192.168.1.8:9000/user/hive/warehouse/hive.db/emp [a] Path -> Partition: hdfs://192.168.1.8:9000/user/hive/warehouse/hive.db/dept Partition base file name: dept input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: COLUMN_STATS_ACCURATE true bucket_count -1 columns deptno,dname,loc columns.comments columns.types int:string:string field.delim file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location hdfs://192.168.1.8:9000/user/hive/warehouse/hive.db/dept name hive.dept numFiles 1 numRows 0 rawDataSize 0 serialization.ddl struct dept { i32 deptno, string dname, string loc} serialization.format serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe totalSize 79 transient_lastDdlTime 1532815859 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: COLUMN_STATS_ACCURATE true bucket_count -1 columns deptno,dname,loc columns.comments columns.types int:string:string field.delim file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location hdfs://192.168.1.8:9000/user/hive/warehouse/hive.db/dept name hive.dept numFiles 1 numRows 0 rawDataSize 0 serialization.ddl struct dept { i32 deptno, string dname, string loc} serialization.format serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe totalSize 79 transient_lastDdlTime 1532815859 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: hive.dept name: hive.dept hdfs://192.168.1.8:9000/user/hive/warehouse/hive.db/emp Partition base file name: emp input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: COLUMN_STATS_ACCURATE true bucket_count -1 columns empno,name,job,mgr,hiredate,salary,comm,depno columns.comments columns.types int:string:string:int:string:double:double:int field.delim file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location hdfs://192.168.1.8:9000/user/hive/warehouse/hive.db/emp name hive.emp numFiles 1 numRows 0 rawDataSize 0 serialization.ddl struct emp { i32 empno, string name, string job, i32 mgr, string hiredate, double salary, double comm, i32 depno} serialization.format serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe totalSize 700 transient_lastDdlTime 1532795621 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: COLUMN_STATS_ACCURATE true bucket_count -1 columns empno,name,job,mgr,hiredate,salary,comm,depno columns.comments columns.types int:string:string:int:string:double:double:int field.delim file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location hdfs://192.168.1.8:9000/user/hive/warehouse/hive.db/emp name hive.emp numFiles 1 numRows 0 rawDataSize 0 serialization.ddl struct emp { i32 empno, string name, string job, i32 mgr, string hiredate, double salary, double comm, i32 depno} serialization.format serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe totalSize 700 transient_lastDdlTime 1532795621 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: hive.emp name: hive.emp Truncated Path -> Alias: /hive.db/emp [a]
hive> explain extended select a.empno,a.name,b.dname from emp a join dept b on a.depno=b.deptno; OK ABSTRACT SYNTAX TREE: TOK_QUERY TOK_FROM TOK_JOIN TOK_TABREF TOK_TABNAME emp a TOK_TABREF TOK_TABNAME dept b = . TOK_TABLE_OR_COL a depno . TOK_TABLE_OR_COL b deptno TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR . TOK_TABLE_OR_COL a empno TOK_SELEXPR . TOK_TABLE_OR_COL a name TOK_SELEXPR . TOK_TABLE_OR_COL b dname STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: b Fetch Operator limit: -1 Alias -> Map Local Operator Tree: b TableScan alias: b Statistics: Num rows: 1 Data size: 79 Basic stats: COMPLETE Column stats: NONE GatherStats: false Filter Operator isSamplingPred: false predicate: deptno is not null (type: boolean) Statistics: Num rows: 1 Data size: 79 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 depno (type: int) 1 deptno (type: int) Position of Big Table: 0 Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: a Statistics: Num rows: 6 Data size: 700 Basic stats: COMPLETE Column stats: NONE GatherStats: false Filter Operator isSamplingPred: false predicate: depno is not null (type: boolean) Statistics: Num rows: 3 Data size: 350 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 depno (type: int) 1 deptno (type: int) outputColumnNames: _col0, _col1, _col12 Position of Big Table: 0 Statistics: Num rows: 3 Data size: 385 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: string), _col12 (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 3 Data size: 385 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false GlobalTableId: 0 directory: hdfs://192.168.1.8:9000/tmp/hive/hadoop/56f4fb91-6cd0-44b1-89b5-231feb3ecbd6/hive_2018-07-29_07-03-40_556_6872713255475843225-1/-mr-10000/.hive-staging_hive_2018-07-29_07-03-40_556_6872713255475843225-1/-ext-10001 NumFilesPerFileSink: 1 Statistics: Num rows: 3 Data size: 385 Basic stats: COMPLETE Column stats: NONE Stats Publishing Key Prefix: hdfs://192.168.1.8:9000/tmp/hive/hadoop/56f4fb91-6cd0-44b1-89b5-231feb3ecbd6/hive_2018-07-29_07-03-40_556_6872713255475843225-1/-mr-10000/.hive-staging_hive_2018-07-29_07-03-40_556_6872713255475843225-1/-ext-10001/ table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: columns _col0,_col1,_col2 columns.types int:string:string escape.delim \ hive.serialization.extend.additional.nesting.levels true serialization.format 1 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe TotalFiles: 1 GatherStats: false MultiFileSpray: false Local Work: Map Reduce Local Work Path -> Alias: hdfs://192.168.1.8:9000/user/hive/warehouse/hive.db/emp [a] Path -> Partition: hdfs://192.168.1.8:9000/user/hive/warehouse/hive.db/dept Partition base file name: dept input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: COLUMN_STATS_ACCURATE true bucket_count -1 columns deptno,dname,loc columns.comments columns.types int:string:string field.delim file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location hdfs://192.168.1.8:9000/user/hive/warehouse/hive.db/dept name hive.dept numFiles 1 numRows 0 rawDataSize 0 serialization.ddl struct dept { i32 deptno, string dname, string loc} serialization.format serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe totalSize 79 transient_lastDdlTime 1532815859 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: COLUMN_STATS_ACCURATE true bucket_count -1 columns deptno,dname,loc columns.comments columns.types int:string:string field.delim file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location hdfs://192.168.1.8:9000/user/hive/warehouse/hive.db/dept name hive.dept numFiles 1 numRows 0 rawDataSize 0 serialization.ddl struct dept { i32 deptno, string dname, string loc} serialization.format serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe totalSize 79 transient_lastDdlTime 1532815859 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: hive.dept name: hive.dept hdfs://192.168.1.8:9000/user/hive/warehouse/hive.db/emp Partition base file name: emp input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: COLUMN_STATS_ACCURATE true bucket_count -1 columns empno,name,job,mgr,hiredate,salary,comm,depno columns.comments columns.types int:string:string:int:string:double:double:int field.delim file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location hdfs://192.168.1.8:9000/user/hive/warehouse/hive.db/emp name hive.emp numFiles 1 numRows 0 rawDataSize 0 serialization.ddl struct emp { i32 empno, string name, string job, i32 mgr, string hiredate, double salary, double comm, i32 depno} serialization.format serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe totalSize 700 transient_lastDdlTime 1532795621 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: COLUMN_STATS_ACCURATE true bucket_count -1 columns empno,name,job,mgr,hiredate,salary,comm,depno columns.comments columns.types int:string:string:int:string:double:double:int field.delim file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location hdfs://192.168.1.8:9000/user/hive/warehouse/hive.db/emp name hive.emp numFiles 1 numRows 0 rawDataSize 0 serialization.ddl struct emp { i32 empno, string name, string job, i32 mgr, string hiredate, double salary, double comm, i32 depno} serialization.format serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe totalSize 700 transient_lastDdlTime 1532795621 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: hive.emp name: hive.emp Truncated Path -> Alias: /hive.db/emp [a] Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Time taken: 0.084 seconds, Fetched: 231 row(s) hive>
具体执行过程的描述为, stage-4的map阶段:
stage-3的阶段
总结:
参考:https://acadgild.com/blog/join-optimization-in-apache-hive
HiveQL在执行的时候,会把对应的SQL语句转成mapreduce代码来执行。想要看mapreduce执行的具体过程,可以用explain关键字。也就是HiveQL的执行计划,它是hive调优时很重要的一个工具。 语法:
执行过程中,explain会把查询语句转化成stage组成的序列,主要由三方面组成:
示例:
1. 单表简单查询
STAGE DEPENDENCIES: 这个HivQL语句执行的过程划只有一个stage,并且是一个根stage,不跟其他stage发生依赖。 stage-0是一个根stage
STAGE PLANS Stage-0:它是一个fetch操作 TableScan:from加载表,描述中有表名、表的行数和表的大小等 Select Operator:筛选列,描述中有列名、列类型,输出的列名、大小等。
SQL on Hadoop 有shuffle:common join => shuffle join => reduce join 没有shuffle:mapjoin => broadcastjoin(spark) 一般来讲,下面的性能高于上面的。前提是,下面是大表join小表,且小表的size不超过一个固定值。
2. join (common join)
shuffle:相同的key分发到一个reduce task上去执行 join的过程是发生在reduce阶段的。
STAGE DEPENDENCIES: 这个SQL语句将执行的过程划分成3个stage,并明确了三个stage之间的依赖关系 stage-4是一个根stage stage-3依赖于stage-4 stage-0依赖于stage-3
STAGE PLANS: stage-4: Map Reduce Local Work:stage-4是一个本地的mapreduce作业; TableScan:from加载表b,描述中有行数和表的大小等; Filter Operator:隐藏的where过滤条件,on的等号两边的字段deptno要求非空,同样描述了行和大小等; HashTable Sink Operator:
stage-3: Map Reduce:stage-3是一个mapreduce作业; TableScan:from加载表a,描述了表中的行数和大小 Filter Operator:隐藏的where过滤条件,on的等号两边的字段deptno要求非空,同样描述了行和大小等; Map Join Operator:map join操作,谁和谁join outputColumnNames:输出的列名、数量; Select Operator:筛选列,描述中有字段名、字段类型,输出的类型、大小等; File Output Operator:输出结果到临时文件中,描述介绍了压缩格式、输出文件格式,文件数量等; table:
Stage: Stage-3 Map Reduce # stage-3是一个mapreduce作业 Map Operator Tree: TableScan # from 加载表,描述了表中的行数和大小 alias: a Statistics: Num rows: 6 Data size: 700 Basic stats: COMPLETE Column stats: NONE GatherStats: false Filter Operator # 隐藏的where过滤条件,因为on等号两边的字段要求非空;也有行数大小 isSamplingPred: false predicate: depno is not null (type: boolean) Statistics: Num rows: 3 Data size: 350 Basic stats: COMPLETE Column stats: NONE Map Join Operator # map join操作,谁和谁join condition map: Inner Join 0 to 1 keys: 0 depno (type: int) 1 deptno (type: int) outputColumnNames: _col0, _col1, _col12 # 输出的列名、数量 Position of Big Table: 0 Statistics: Num rows: 3 Data size: 385 Basic stats: COMPLETE Column stats: NONE Select Operator # 筛选列,描述中有字段名、字段类型,输出的类型、大小等 expressions: _col0 (type: int), _col1 (type: string), _col12 (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 3 Data size: 385 Basic stats: COMPLETE Column stats: NONE File Output Operator # 输出结果到临时文件中,描述介绍了压缩格式、输出文件格式,文件数量等 compressed: false GlobalTableId: 0 directory: hdfs://192.168.1.8:9000/tmp/hive/hadoop/56f4fb91-6cd0-44b1-89b5-231feb3ecbd6/hive_2018-07-29_07-03-40_556_6872713255475843225-1/-mr-10000/.hive-staging_hive_2018-07-29_07-03-40_556_6872713255475843225-1/-ext-10001 NumFilesPerFileSink: 1 Statistics: Num rows: 3 Data size: 385 Basic stats: COMPLETE Column stats: NONE Stats Publishing Key Prefix: hdfs://192.168.1.8:9000/tmp/hive/hadoop/56f4fb91-6cd0-44b1-89b5-231feb3ecbd6/hive_2018-07-29_07-03-40_556_6872713255475843225-1/-mr-10000/.hive-staging_hive_2018-07-29_07-03-40_556_6872713255475843225-1/-ext-10001/ table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: columns _col0,_col1,_col2 columns.types int:string:string escape.delim \ hive.serialization.extend.additional.nesting.levels true serialization.format 1 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe TotalFiles: 1 GatherStats: false MultiFileSpray: false Local Work: Map Reduce Local Work Path -> Alias: hdfs://192.168.1.8:9000/user/hive/warehouse/hive.db/emp [a] Path -> Partition: hdfs://192.168.1.8:9000/user/hive/warehouse/hive.db/dept Partition base file name: dept input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: COLUMN_STATS_ACCURATE true bucket_count -1 columns deptno,dname,loc columns.comments columns.types int:string:string field.delim file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location hdfs://192.168.1.8:9000/user/hive/warehouse/hive.db/dept name hive.dept numFiles 1 numRows 0 rawDataSize 0 serialization.ddl struct dept { i32 deptno, string dname, string loc} serialization.format serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe totalSize 79 transient_lastDdlTime 1532815859 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
具体执行过程的描述为, stage-4的map阶段:
stage-3的阶段
3. join (mapjoin)
总结:
参考:https://acadgild.com/blog/join-optimization-in-apache-hive