Teradata - Githubissues

Explaining the EXPLAIN – Part 1 Joe Ramon Agenda  What is the EXPLAIN facility?  Where does the EXPLAIN output come FROM?  What does the Optimizer need to build a plan?  What does the EXPLAIN terminology mean?  What can be learned by reading the EXPLAIN text?  What can be done to influence the Optimizer?  Summary What is EXPLAIN? • The EXPLAIN facility provides an "English" translation of the plan the SQL Optimizer develops to service a request. • May be used on any SQL statement, except EXPLAIN itself. • Look for key words AND phrases • Execution time AND row count estimates depend on:

Are statistics collected? • Actual execution time depends on: Is DBS processing other requests? Is channel or network busy? How is EXPLAIN Text Generated? STATISTICS SQL REQUEST DD Dbase AccessRights RoleGrants (V2R5) TVM TVFields Indexes SYNTAXER RESOLVER SECURITY OPTIMIZER GENERATOR DISPATCHER AMP DD Cache EXPLAIN APPLY Information Known to Optimizer • Number of nodes in system • Number and type of CPU’s per node • Number of configured AMP Vprocs • Disk array configuration • Interconnect configuration • Amount and configuration of memory All are taken into account when calculating query cost. • Columns with indexes • Rows in the table • Rows per block • Values per column • Rows per value • Row length Additional Information Required by the Optimizer Optimizer – Random AMP Samples • Statistics collected by a random AMP sample apply in these cases … – row counts for the table are needed and statistics are not collected on PI. – indexed columns are used in the query and statistics are not collected. – With Teradata 12.0, statistics have been collected, but are considered stale. • By default, Teradata chooses an AMP for random AMP (or dynamic) data sampling. • Enhancement starting with Teradata 6.0. – When statistics are not available, the Optimizer can obtain random samples from more than one AMP when generating row counts for a query plan. – Random AMP sampling is controlled via a DBS Control parameter. Random AMP Sampling • For a table row count estimate, read one cylinder on 1 AMP. – Calculate the approximate number of rows in the table: • For NUSI estimates, read one cylinder from the NUSI subtable. – Uses a similar technique by counting the number of NUSI values in the cylinder. The table row count is divided by the extrapolated NUSI row count to get a rows/NUSI value. • Any skewed component in the sample skews the demographics. • For non-indexed columns without statistics, the optimizer uses fixed formulas to estimate the number of rows. For example, – Assumes 10% for one column in an equality condition – Assumes 7.5% for two columns, each in an equality condition, and ANDed together • Cost-based Optimizer - looks for lowest cost plan • Does not store plan - dynamically regenerates • As data demographics change, so may plan • Will only assign cost to steps for which there are choices • Assigns confidence factors on row estimates • Mature, large-table, decision-support optimization Optimizer Facts EXPLAIN Example EXPLAIN SELECT Last_Name, First_Name, Dept_Name, Job_Desc FROM Employee E INNER JOIN Department D ON E.Dept_Number = D.Dept_Number INNER JOIN Job J ON E.Job_code = J.Job_code ORDER BY 3, 1, 2; EXPLAIN Example (cont.) 1) First, we lock a distinct TFACT."pseudo table" for read on a RowHash to prevent global deadlock for TFACT.E. 2) Next, we lock a distinct TFACT."pseudo table" for read on a RowHash to prevent global deadlock for TFACT.J. 3) We lock a distinct TFACT."pseudo table" for read on a RowHash to prevent global deadlock for TFACT.D. 4) We lock TFACT.E for read, we lock TFACT.J for read, and we lock TFACT.D for read. 5) We execute the following steps in parallel. 1) We do an all-AMPs RETRIEVE step from TFACT.D by way of an all-rows scan with no residual conditions into Spool 2 (all_amps), which is duplicated on all AMPs. The size of Spool 2 is estimated with high confidence to be 19,642 rows (726,754 bytes). The estimated time for this step is 0.02 seconds. 2) We do an all-AMPs RETRIEVE step from TFACT.J by way of an all-rows scan with no residual conditions into Spool 3 (all_amps), which is duplicated on all AMPs. Then we do a SORT to order Spool 3 by the hash code of (TFACT.J.Job_Code). The size of Spool 3 is estimated with high confidence to be 12,166 rows (450,142 bytes). The estimated time for this step is 0.01 seconds. 6) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of an all-rows scan, which is joined to TFACT.E by way of an all-rows scan with a condition of ("NOT (TFACT.E.JobCode IS NULL)"). Spool 2 and TFACT.E are joined using a single partition hash join, with a join : : 8) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.14 seconds. ….”Pseudo Table” Locks….  Prevents two users from getting conflicting locks with all- AMP requests.  All-AMP lock requests are handled as follows: − PE determines Table ID hash for an AMP to manage the all-AMP lock request. − Put pseudo lock on the table − Acquire lock on all AMPs EXPLAIN Terminology AMP PE PE AMP AMP AMP Determine Table ID hash First request Second request ….” Pseudo Table” Locks…. EXPLAIN Terminology (cont.) EXPLAIN Terminology (cont.) Most EXPLAIN text is easy to understand. The following additional definitions may help:  ... (Last Use) … A spool file is no longer needed and will be released when this step completes.  ... with no residual conditions … All applicable conditions have been applied to the rows.  ... END TRANSACTION … Transaction locks are released, and changes are committed.  ... by way of the sort key in spool field1 (dbname.tablename.colname) … Field1 is created to allow a tag sort. Teradata 12.0 includes the column name used for the sort. EXPLAIN Terminology (cont.) 5) We execute the following steps in parallel. 1) We do an all-AMPs RETRIEVE step from TFACT.D by way of an all-rows scan with no residual conditions into Spool 2 (all_amps), which is duplicated on all AMPs. The size of Spool 2 is estimated with high confidence to be 19,642 rows (726,754 bytes). The estimated time for this step is 0.02 seconds. 2) We do an all-AMPs RETRIEVE step from TFACT.J by way of an all-rows scan with no residual conditions into Spool 3 (all_amps), which is duplicated on all AMPs. Then we do a SORT to order Spool 3 by the hash code of (TFACT.J.Job_Code). The size of Spool 3 is estimated with high confidence to be 12,166 rows (450,142 bytes). The estimated time for this step is 0.01 seconds. 6) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of an all-rows scan, which is joined to TFACT.E by way of an all-rows scan with a condition of ("NOT (TFACT.E.JobCode IS NULL)"). Spool 2 and TFACT.E are joined using a single partition hash join, with a join with a join condition of ("TFACT.E.Dept_Number = Dept_Number"). The result goes into Spool 4 (all_amps), which is built locally on the AMPs. Then we do a SORT to order Spool 4 by the hash code of (TFACT.E.Job_Code). The size of Spool 4 is estimated with low confidence to be 26,000 rows (1,690,000 bytes). The estimated time for this step is 0.04 seconds. 7) We do an all-AMPs JOIN step from Spool 3 (Last Use) by way of a RowHash match scan, which is joined to Spool 4 (Last Use) by way of a RowHash match scan. Spool 3 and Spool 4 are joined using a merge join, with a join condition of ("Job_Code = Job_Code"). The result goes into Spool 1 (group_amps), which is built locally on the AMPs. Then we do a SORT to order Spool 1 by the sort key in spool field1 (TFACT.D.Dept_Name, TFACT.E.Last_Name, TFACT.E.First_Name). The size of Spool 1 is estimated with low confidence to be 26,000 rows (3,822,000 bytes). The estimated time for this step is 0.08 seconds. 8) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. EXPLAIN Terminology (cont.) Most EXPLAIN text is easy to understand. The following additional definitions may help:  ... we do an ABORT test … Caused by an ABORT or ROLLBACK statement.  ... We execute these steps in parallel. The following indented steps are executed in parallel. 3) We lock DBC.ArchiveLoggingObjsTbl for read on a RowHash, we lock DBC.TVM for write on a RowHash, we lock DBC.TVFields for write on a RowHash, we lock DBC.Indexes for write on a RowHash, we lock DBC.DBase for read on a RowHash, and we lock DBC.AccessRights for write on a RowHash. 4) We execute the following steps in parallel. 1) We do a single-AMP ABORT test from DBC.ArchiveLoggingObjsTbl by way of the primary index. 2) We do a single-AMP ABORT test from DBC.DBase by way of the unique primary index. 3) We do a single-AMP ABORT test from DBC.TVM by way of the unique primary index. 4) We do an INSERT into DBC.TVFields (no lock required). : : 7) We do an INSERT into DBC.Indexes (no lock required). 8) We do an INSERT into DBC.TVM (no lock required). 9) We INSERT default rights to DBC.AccessRights for TFACT.Orders. EXPLAIN Terminology (cont.)  ... which is redistributed by hash code to all AMPs (dbname.tablename.colname) … Redistributing data (in SPOOL) in preparation for a join. Teradata 12.0 includes the column name.  ... which is duplicated on all AMPs … Duplicating data (in SPOOL) from the smaller table in preparation for a join.  ... (one_amp) or (group_amps) or (all_amps) … Indicates one AMP, a subset of AMPs, or all of the AMPs will participate.  ... ("NOT (table_name.column_name IS NULL)") … Feature where optimizer realizes that the column being joined to is NOT NULL or has referential integrity.  ... eliminating duplicate rows ... Duplicate rows only exist in spool files, not set tables. EXPLAIN Terminology (cont.) 4) We do an all-AMPs RETRIEVE step from TFACT.D by way of an all-rows scan with no residual conditions into Spool 2 (allamps), which is duplicated on all AMPs. The size of Spool 2 is estimated with high confidence to be 19,642 rows (726,754 bytes). The estimated time for this step is 0.02 seconds. 5) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of an all-rows scan, which is joined to TFACT.E by way of an all-rows scan. Spool 2 and TFACT.E are joined using a single partition hash join, with a join condition of ("TFACT.E.Dept_Number = Dept_Number"). The result goes into Spool 1 (group_amps), which is redistributed by the hash code of (TFACT.E.First_Name, TFACT.E.Last_Name, TFACT.E.Employee_Number, TFACT.D.Dept_Name) to all AMPs. The size of Spool 1 is estimated with low confidence to be 26,000 rows (3,614,000 bytes). The estimated time for this step is 0.09 seconds. 6) We do an all-AMPs RETRIEVE step from TFACT.D by way of an all-rows scan with a condition of ("NOT (TFACT.D.Dept_Mgr_Number IS NULL)") into Spool 3 (all_amps), which is redistributed by the hash code of (TFACT.D.Dept_Mgr_Number) to all AMPs. Then we do a SORT to order Spool 3 by row hash. The size of Spool 3 is estimated with high confidence to be 1,403 rows (51,911 bytes). The estimated time for this step is 0.01 seconds. 7) We do an all-AMPs JOIN step from Spool 3 (Last Use) by way of a RowHash match scan, which is joined to TFACT.E by way of a RowHash match scan with no residual conditions. Spool 3 and TFACT.E are joined using a merge join, with a join condition of ("TFACT.E.Employee_Number = Dept_Mgr_Number"). The result goes into Spool 1 (group_amps), which is redistributed by the hash code of (TFACT.E.First_Name, TFACT.E.Last_Name, TFACT.E.Employee_Number, TFACT.D.Dept_Name) to all AMPs. Then we do a SORT to order Spool 1 by the sort key in spool field1 eliminating duplicate rows. The size of Spool 1 is estimated with low confidence to be 27,403 rows (3,809,017 bytes). The estimated time for this step is 0.06 seconds. EXPLAIN Terminology (cont.)  ... we do a BMSMS (bit map set manipulation step) … Doing a NUSI Bit Map operation. : 3) We do a BMSMS (bit map set manipulation) step that builds a bit map for TFACT.Employee by way of index # 4 "TFACT.E.Job_Code = 3500" which is placed in Spool 2. The estimated time for this step is 0.01 seconds. 4) We do an all-AMPs RETRIEVE step from TFACT.E by way of index # 8 TFACT.E.Dept_Number = 1310" and the bit map in Spool 2 (Last Use) with a residual condition of ("TFACT.E.Job_Code = 3500") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with low confidence to be 60 rows (4620 bytes). The estimated time for this step is 0.02 seconds. 5) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.03 seconds. Note: Statistics were collected on the NUSIs Job_Code and Dept_Number. Synchronized Scanning In the case of multiple users that access the same table at the same time, the system can do a synchronized scan (sync scan) on the table. 112747 1766 100766 3001 Frankel Allan 034982 2212 106363 3005 Bench John 310229 2231 108222 3100 Palmer Carson 209181 1235 108221 3001 Smith Buster 123881 2433 101433 3007 Walton Sam 223431 2500 105200 3101 Brooks Steve 221015 1019 108222 3199 Woods Tiger 121332 2281 101281 3007 Walton John 118314 2100 101100 3002 Ramon Anne 104631 1279 100279 3002 Roberts Julie 210110 1201 101222 3003 Douglas Michael 210001 1205 105432 3022 Morgan Joe 100076 1011 104321 3021 Anderson Sparky 100045 1012 101231 3087 Michelson Phil 319116 1219 121871 3025 Crawford Cindy : : : : : : : : : : : : Query 1 Begins Query 2 Begins Query 3 Begins EXPLAIN SELECT * FROM daily_sales ORDER BY 1; Synchronized Scanning (cont.) : 3) We do an all-AMPs RETRIEVE step from TFACT.daily_sales by way of an all-rows scan with no residual conditions into Spool 1 (group_amps), which is built locally on the AMPs. Then we do a SORT to order Spool 1 by the sort key in spool field1 (TFACT.daily_sales.Item_id). The input table will not be cached in memory, but it is eligible for synchronized scanning. The result spool file will not be cached in memory. The size of Spool 1 is estimated with high confidence to be 76,685 rows (2,530,605 bytes). The estimated time for this step is 0.09 seconds. : The EXPLAIN facility may express “confidence” for a retrieve from a table. Some of the phrases used are: . . . with high confidence . . . − Restricting conditions exist on index(es) or column(s) that have collected statistics. . . . with low confidence . . . − Restricting conditions exist on index(es) having no statistics, but estimates can be based upon a sampling of the index(es). − Restricting conditions exist on index(es) or column(s) that have collected statistics but are “AND-ed” together with conditions on non-indexed columns. − Restricting conditions exist on index(es) or column(s) that have collected statistics but are “OR-ed” together with other conditions. . . . with no confidence . . . − Conditions outside the above. Understanding Row and Time Estimates The following are “confidence” phrases for a join: . . . with index join confidence . . . − A join condition via a primary index. . . . with high confidence . . . − One input relation has high confidence and the other has high or index join confidence. . . . with low confidence . . . − One input relation has low confidence and the other has low, high, or join index confidence. . . . with no confidence . . . − One input relation has no confidence. − Statistics do not exist for either join field. Understanding Row and Time Estimates Understanding Row and Time Estimates (cont.) 5) We execute the following steps in parallel. 1) We do an all-AMPs RETRIEVE step from TFACT.D by way of an all-rows scan with no residual conditions into Spool 2 (all_amps), which is duplicated on all AMPs. The size of Spool 2 is estimated with high confidence to be 19,642 rows (726,754 bytes). The estimated time for this step is 0.02 seconds. 2) We do an all-AMPs RETRIEVE step from TFACT.J by way of an all-rows scan with no residual conditions into Spool 3 (all_amps), which is duplicated on all AMPs. Then we do a SORT to order Spool 3 by the hash code of (TFACT.J.Job_Code). The size of Spool 3 is estimated with high confidence to be 12,166 rows (450,142 bytes). The estimated time for this step is 0.01 seconds. 6) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of an all-rows scan, which is joined to TFACT.E by way of an all-rows scan with a condition of ("NOT (TFACT.E.JobCode IS NULL)"). Spool 2 and TFACT.E are joined using a single partition hash join, with a join with a join condition of ("TFACT.E.Dept_Number = Dept_Number"). The result goes into Spool 4 (all_amps), which is built locally on the AMPs. Then we do a SORT to order Spool 4 by the hash code of (TFACT.E.Job_Code). The size of Spool 4 is estimated with low confidence to be 26,000 rows (1,690,000 bytes). The estimated time for this step is 0.04 seconds. 7) We do an all-AMPs JOIN step from Spool 3 (Last Use) by way of a RowHash match scan, which is joined to Spool 4 (Last Use) by way of a RowHash match scan. Spool 3 and Spool 4 are joined using a merge join, with a join condition of ("Job_Code = Job_Code"). The result goes into Spool 1 (group_amps), which is built locally on the AMPs. Then we do a SORT to order Spool 1 by the sort key in spool field1 (TFACT.D.Dept_Name, TFACT.E.Last_Name, TFACT.E.First_Name). The size of Spool 1 is estimated with low confidence to be 26,000 rows (3,822,000 bytes). The estimated time for this step is 0.08 seconds. 8) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. Row estimates: May be estimated using random samples, statistics or indexes Are assigned a confidence level - high, low or none Affect timing estimates - more rows, more time needed Timings: Used to determine the ‘lowest cost’ plan Total cost generated if all processing steps have assigned cost Not intended to predict wall-clock time, useful for comparisons Miscellaneous Notes: Estimates too large to display show 3 asterisks (**). The accuracy of the time estimate depends upon the accuracy of the row estimate. Query Cost Estimates • Low and no confidence may indicate a need to collect statistics on indexes or columns involved in restricting conditions. • You may otherwise consider a closer examination of the conditions in the query for possible changes that may improve the confidence. • Collecting statistics or altering the conditions has no real impact unless it influences the optimizer to pick a better plan. Understanding Row and Time Estimates EXPLAIN of Create Table EXPLAIN CREATE TABLE Orders (order_id INTEGER NOT NULL ,order_date DATE FORMAT 'yyyy-mm-dd' ,cust_id INTEGER) UNIQUE PRIMARY INDEX (order_id); 1) First, we lock TFACT.Orders for exclusive use. : 4) We execute the following steps in parallel. 1) We do a single-AMP ABORT test from DBC.ArchiveLoggingObjsTbl by way of the primary index. 2) We do a single-AMP ABORT test from DBC.DBase by way of the unique primary index. 3) We do a single-AMP ABORT test from DBC.TVM by way of the unique primary index. 4) We do an INSERT into DBC.TVFields (no lock required). 5) We do an INSERT into DBC.TVFields (no lock required). 6) We do an INSERT into DBC.TVFields (no lock required). 7) We do an INSERT into DBC.Indexes (no lock required). 8) We do an INSERT into DBC.TVM (no lock required). 9) We INSERT default rights to DBC.AccessRights for TFACT.Orders. 5) We create the table header. 6) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> No rows are returned to the user as the result of statement 1. Unique Primary INDEX Request (UPI) EXPLAIN SELECT FROM Employee WHERE Employee_Number = 1104066; Simplest and most efficient type of access. Spool is not used. 1) First, we do a single-AMP RETRIEVE step from TFACT.Employee by way of the unique primary index "TFACT.Employee.Employee_Number = 1104066" with no residual conditions. The estimated time for this step is 0.00 seconds. -> The row is sent directly back to the user as the result of statement 1. The total estimated time is 0.00 seconds. Residual condition does not help the query. No change to plan or time estimate. UPI Request With Residual Condition EXPLAIN SELECT FROM Employee WHERE Employee_Number = 1104066 AND Dept_Number = 1404; 1) First, we do a single-AMP RETRIEVE step from TFACT.Employee by way of the unique primary index "TFACT.Employee.Employee_Number = 1104066" with a residual condition of ("TFACT.Employee.Dept_Number = 1404"). The estimated time for this step is 0.00 seconds. -> The row is sent directly back to the user as the result of statement 1. The total estimated time is 0.00 seconds. Full Table Scan EXPLAIN SELECT FROM Employee WHERE Emp_Mgr_Number = 104043 AND Job_Code = 3405; 1) First, we lock a distinct TFACT."pseudo table" for read on a RowHash to prevent global deadlock for TFACT.Employee. 2) Next, we lock TFACT.Employee for read. 3) We do an all-AMPs RETRIEVE step from TFACT.Employee by way of an allrows scan with a condition of ("(TFACT.Employee.Emp_Mgr_Number = 104043) AND (TFACT.Employee.Job_Code = 3405)") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with low confidence to be 8 rows (616 bytes). The estimated time for this step is 0.02 seconds. 4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.02 seconds. Aggregations EXPLAIN SELECT dept_number, SUM(salary_amount) FROM Employee GROUP BY 1; 1) First, we lock a distinct TFACT."pseudo table" for read on a RowHash to prevent global deadlock for TFACT.employee. 2) Next, we lock TFACT.employee for read. 3) We do an all-AMPs SUM step to aggregate from TFACT.employee by way of an all-rows scan with no residual conditions, grouping by field1 (TFACT.employee.Dept_Number). Aggregate Intermediate Results are computed globally, then placed in Spool 3. The size of Spool 3 is estimated with high confidence to be 1,403 rows (51,911 bytes). The estimated time for this step is 0.06 seconds. 4) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of an allrows scan into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 1,403 rows (57,523 bytes). The estimated time for this step is 0.02 seconds. 5) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.08 seconds. INSERT/SELECT is the process of SELECTing data FROM one table and using it as input to be inserted into another table. Two different optimizations can occur: 1.) If the PI of the source AND destination tables are identical, an AMP local operation is used. 2.) If the target table is empty, a.) Transient Journaling is reduced b.) 127 KB block transfers are used If both conditions are satisfied, both optimizations are used. Optimized INSERT/SELECT Optimized INSERT/SELECT Example 1) First, we lock a distinct TFACT."pseudo table" for write on a RowHash to prevent global deadlock for TFACT.Employee_copy. 2) Next, we lock a distinct TFACT."pseudo table" for read on a RowHash to prevent global deadlock for TFACT.Employee. 3) We lock TFACT.Employee_copy for write, and we lock TFACT.Employee for read. 4) We do an all-AMPs MERGE into TFACT.Employee_copy from TFACT.Employee. The size is estimated with no confidence to be 25,382 rows. The estimated time for this step is 2.29 seconds. 5) We spoil the parser's dictionary cache for the table. 6) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> No rows are returned to the user as the result of statement 1. EXPLAIN INSERT INTO Employee_copy SELECT FROM Employee; If the target table has a different Primary Index, a standard insert SELECT process must be used. A BYNET operation will be used to relocate the SELECTed rows onto the target AMPs. This will require: a.) Single row inserts (vs. 127 KB blocks) b.) Transient journal entries for each row INSERT/SELECT With Different PI’s Non-Optimized INSERT/SELECT Example CREATE SET TABLE TFACT.Employee_CharPI (Employee_Number CHAR(10), Location_Number INTEGER, : Salary_Amount DECIMAL(10,2)) UNIQUE PRIMARY INDEX ( Employee_Number ); CREATE SET TABLE TFACT.Employee (Employee_Number INTEGER, Location_Number INTEGER, : Salary_Amount DECIMAL(10,2)) UNIQUE PRIMARY INDEX ( Employee_Number ); Non-Optimized INSERT/SELECT Example (cont.) EXPLAIN INSERT INTO Employee_CharPI SELECT FROM Employee; : 4) We do an all-AMPs RETRIEVE step from TFACT.Employee by way of an all-rows scan with no residual conditions into Spool 1 (all_amps), which is redistributed by the hash code of (TFACT.Employee.Employee_Number (CHAR(10), CHARACTER SET LATIN, NOT CASESPECIFIC, FORMAT '-(10)9')(CHAR(10), CHARACTER SET LATIN, NOT CASESPECIFIC, NAMED Employee_Number, FORMAT 'X(10)', NULL)) to all AMPs. Then we do a SORT to order Spool 1 by row hash. The size of Spool 1 is estimated with high confidence to be 26,000 rows (1,950,000 bytes). The estimated time for this step is 0.06 seconds. 5) We do an all-AMPs MERGE into TFACT.Employee_CharPI from Spool 1 (Last Use). The size is estimated with high confidence to be 26,000 rows. The estimated time for this step is 1.38 seconds. 6) We spoil the parser's dictionary cache for the table. 7) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> No rows are returned to the user as the result of statement 1. Unexpected Full Table Scan EXPLAIN SELECT FROM Employee_CharPI WHERE employee_number = 1104066 ; 1) First, we lock a distinct TFACT."pseudo table" for read on a RowHash to prevent global deadlock for TFACT.Employee_CharPI. 2) Next, we lock TFACT.Employee_CharPI for read. 3) We do an all-AMPs RETRIEVE step from TFACT.Employee_CharPI by way of an all-rows scan with a condition of ("(TFACT.Employee_CharPI.Employee_Number (FLOAT, FORMAT'-9.99999999999999E-999'))= 1.10406600000000E 006") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with no confidence to be 2 rows (166 bytes). The estimated time for this step is 0.02 seconds. 4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.02 seconds. Correct use of Primary INDEX EXPLAIN SELECT FROM Employee_CharPI WHERE employee_number = '1104066' ; 1) First, we do a single-AMP RETRIEVE step from TFACT.Employee_CharPI by way of the unique primary index "TFACT.Employee_CharPI.Employee_Number = '1104066 '" with no residual conditions. The estimated time for this step is 0.00 seconds. -> The row is sent directly back to the user as the result of statement 1. The total estimated time is 0.00 seconds. EXPLAIN EXEC Dept_List (1404); Explaining Macros CREATE MACRO Dept_List (dept_no INTEGER) AS (SELECT FROM Employee WHERE dept_number = :dept_no;); 1) First, we lock a distinct TFACT."pseudo table" for read on a RowHash to prevent global deadlock for TFACT.Employee. 2) Next, we lock TFACT.Employee for read. 3) We do an all-AMPs RETRIEVE step from TFACT.Employee by way of an allrows scan with a condition of ("TFACT.Employee.Dept_Number = 1404") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 40 rows (3,080 bytes). The estimated time for this step is 0.02 seconds. 4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.02 seconds. Explaining Macros (cont.) EXPLAIN USING (dept_no INTEGER) EXECUTE Dept_List (:dept_no); 1) First, we lock a distinct TFACT."pseudo table" for read on a RowHash to prevent global deadlock for TFACT.Employee. 2) Next, we lock TFACT.Employee for read. 3) We do an all-AMPs RETRIEVE step from TFACT.Employee by way of an allrows scan with a condition of ("TFACT.Employee.Dept_Number = :dept_no") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with low confidence to be 19 rows (1,463 bytes). The estimated time for this step is 0.02 seconds. 4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.02 seconds. COLLECTED STATISTICS can help the Optimizer make better decisions using actual row counts and data distribution information. Collect Statistics on:  Non-unique indexes  Non-index join columns  Primary Index of small tables Collect Statistics considerations:  Requires a full table scan  Must be kept current  May be unnecessary for very large tables How to Influence the Optimizer • Proper index choices at physical design time • Add secondary, join, or hash indexes where helpful • Use equality-based join conditions • Experiment using EXPLAIN Other Factors To Help Optimizer • EXPLAIN is a tool to help you plan query resources • Teradata uses a cost-based optimizer • Adding Secondary, Join, or Hash Indexes gives optimizer more choices • Collecting Statistics allows better plan estimates • Most mature optimizer for mixed workload environments in the industry Summary Explaining the EXPLAIN – Part 2 Joe Ramon • What is the EXPLAIN facility? • The Optimizer and Collecting Statistics • Joins • PPI Tables • Join Indexes • Summary • Teradata 12.0 Enhancements Agenda • The EXPLAIN facility provides an "English" translation of the plan the SQL Optimizer develops to service a request. • May be used on any SQL statement, except EXPLAIN itself. • Look for key words AND phrases • Execution time AND row count estimates depend on: − Are statistics collected? • Actual execution time depends on: − Is DBS processing other requests? − Is channel or network busy? What is EXPLAIN? Teradata uses a “cost-based optimizer”. • The Optimizer evaluates the “costs” of all reasonable execution plans and the best choice is used. • Parallelism is automatic and unconditional • What does the “optimizer” optimize? − Access Path (Use index, table scan, dynamic bitmap, etc.) − Join Method (How tables are joined - merge join, product join, hash join, nested join) − Join Geography (How rows are relocated - redistribute, duplicate, AMP local, etc.) − Join Order (Sequence of table joins) Teradata Optimizer Optimizer Statistics • Statistics basically tell the Optimizer how many rows/value there are. • The Optimizer uses statistics to plan the best way to access data. – Stale statistics may mislead the Optimizer into poor decisions. • Helpful in accessing a column or index with uneven value distribution. – NUSI Bit Mapping is much more likely to be considered if there are collected statistics. • Statistics remain valid across a reconfiguration of the system. • COLLECT/DROP STATISTICS places an access lock on the data table and a row-hash write lock on DBC.TVFields or DBC.Indexes. – DBC.TVFields – holds statistics collected for single column or single column index – DBC.Indexes – holds statistics collected for multi-column or multicolumn index Statistics Data – What is Collected? Before 12.0, the statistics have a frequency distribution of 100 intervals. Starting with Teradata 12.0, statistics for a column or index reside in a frequency distribution of 200 intervals (internal histogram). Each interval represents about .5% (or 1%) of the table’s rows or high bias values. Summary Section (Interval #0) – Table Level Information • Represents domain across entire table • Most frequent value for the column or index – modal value • # of rows with the most frequent value • # of values not equal to the most frequent value – non-modal values • # of rows not equal to the most frequent value • # of NULLs • Minimum value 200 (or 100) Intervals Summary Section (Interval #0) Statistics Data – What is Collected? (cont.) If statistics are collected, the histogram will have 200 intervals. Intervals – Range Level Information – represents ranges within the domain • Each range has approximately the same number of rows • Maximum or highest value • Most frequent value – value that occurs mostly frequently in the range – modal value • Number of rows with the most frequent value • Number of other values not equal to the most frequent – non-modal values • Number of rows not equal to the most frequent 200 Intervals Summary Section (Interval #0) Statistics Data – What is Collected? (cont.) • In order to reduce the statistical impact of values that occur frequently, values that occur frequently are treated specially. – If a specific value in a column occurs in more than .25% (1/400) rows, that value is considered a “Loner Value”. – A “Loner” value and its statistics are kept in a High Bias interval. – Up to 2 Loner values (and statistics) may be stored in 1 High Bias interval. • This effectively helps to “even out” the variance of the remaining intervals. • Tools such as Teradata Statistics Wizard can be used to display the statistical information in the intervals. High Bias Intervals (contain 1 or 2 Loner Values) 198 Intervals Summary Section (Interval #0) Statistics Example Assume a table has 200,000 rows and statistics are collected on col1. Each interval will represent about 1000 rows. 197 more Intervals Summary Section 1125 1375 ~ 1000 rows ~ 1000 rows ~ 1000 rows Max Value 1605 SQL Statement Optimizer assumes SELECT FROM tabx WHERE col1 = 1200; 400 SELECT FROM tabx WHERE col1 = 1075; 180 SELECT FROM tabx WHERE col1 = 1492; 7 SELECT FROM tabx WHERE col1 = 1300; 10 SELECT FROM tabx WHERE col1 BETWEEN 1150 AND 1250; 700 SELECT * FROM tabx WHERE col1 BETWEEN 1150 AND 1550; 1350 ~ Interval #1 Maximum Value - 1125 Most Frequent Value - 1075 Most Frequent Rows - 180 Other Values - 41 Other Rows - 820 Interval #2 Maximum Value - 1375 Most Frequent Value - 1200 Most Frequent Rows - 400 Other Values - 60 Other Rows - 600 Interval #3 Maximum Value - 1605 Most Frequent Value - 1490 Most Frequent Rows - 300 Other Values - 100 Other Rows - 700 ~ Interval

0

Summary Section • Merge Joins usually result from INNER JOIN operations • Joins the rows of two tables based on an equality condition of one or more COLUMNs • Is optimized when the join COLUMN(s) are the primary INDEX COLUMN(s) of both tables • If not, some redistribution or duplication of rows is necessary Rule of all Joins Two rows must be on the same AMP to be joined Merge Joins Merge Joins AND the Primary INDEX A.) The join COLUMN is a Primary INDEX of both tables. SELECT .... FROM T1, T2 WHERE T1.A = T2.A; No Redistribution Needed A B C PI A B C PI B.) The join COLUMN is a Primary INDEX of one of the tables. SELECT .... FROM T1, T2 WHERE T1.A = T2.B; Redistribute T2 based on hash of COLUMN B. A B C PI A B C PI A B C PI Spool file C.) The join COLUMN is a Primary INDEX of neither table. SELECT .... FROM T1, T2 WHERE T1.B = T2.C; A B C PI A B C PI A B C PI A B C PI (Duplication is also a possibility.) Spool file Spool file Redistribute both tables. * Case A - Join COLUMN is Primary INDEX of both tables No movement necessary - best case Case B - Join COLUMN is Primary INDEX of one table Hash redistribute the non-Primary INDEX table or Duplicate the 'smaller' table on all AMPs then locally build AND sort the non-PI table into join COLUMN hash sequence Case C - Join COLUMN is Primary INDEX of neither table Hash redistribute both tables or Duplicate the 'smaller' table on all AMPs then locally build AND sort both tables into join COLUMN hash sequence Join costs increase with the number of rows which must be moved AND sorted A join plan for the same query will vary over time as the table demographics change Merge Join Categories Show an employee’s name AND their manager's name. Merge Join and Redistribution EXPLAIN SELECT M.Last_Name, M.First_Name, E.Last_Name, E.First_Name FROM Employee M INNER JOIN Employee E ON M.Employee_Number = E.Emp_Mgr_Number; : (Locking step) 2) Next, we lock TFACT.E for read. 3) We do an all-AMPs RETRIEVE step from TFACT.E by way of an all-rows scan with a condition of ("NOT (TFACT.E.Emp_Mgr_Number IS NULL)") into Spool 2 (all_amps), which is redistributed by hash code of (TFACT.E.Emp_Mgr_Number) to all AMPs. Then we do a SORT to order Spool 2 by row hash. The size of Spool 2 is estimated with high confidence to be 27,000 rows (1,215,000 bytes). The estimated time for this step is 0.05 seconds. 4) We do an all-AMPs JOIN step from TFACT.M by way of a RowHash match scan with no residual conditions, which is joined to Spool 2 (Last Use) by way of a RowHash match scan. TFACT.M and Spool 2 are joined using a merge join, with a join condition of ("TFACT.M.Employee_Number = Emp_Mgr_Number"). The result goes into Spool 1 (group_amps), which is built locally on the AMPs. Then we do a SORT to order Spool 1 by the sort key in spool field1 (TFACT.M.Last_Name, TFACT.E.Last_Name). The size of Spool 1 is estimated with low confidence to be 27,000 rows (2,079,000 bytes). The estimated time for this step is 0.05 seconds. : Hash Join With Duplication Show all employees and their department names. EXPLAIN SELECT Last_Name, First_Name, Dept_Name FROM Employee E INNER JOIN Department D ON E.Dept_Number = D.Dept_Number; : (Locking steps) 4) We do an all-AMPs RETRIEVE step from TFACT.D by way of an all-rows scan with no residual conditions into Spool 2 (all_amps), which is duplicated on all AMPs. The size of Spool 2 is estimated with high confidence to be 19,642 rows (726,754 bytes). The estimated time for this step is 0.02 seconds. 5) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of an all-rows scan, which is joined to TFACT.E by way of an all-rows scan. Spool 2 and TFACT.E are joined using a single partition hash_join, with a join condition of ("TFACT.E.Dept_Number = Dept_Number"). The result goes into Spool 1 (group_amps), which is built locally on the AMPs. Then we do a SORT to order Spool 1 by the sort key in spool field1 (TFACT.E.Last_Name, TFACT.E.First_Name). The size of Spool 1 is estimated with low confidence to be 27,000 rows (1,863,000 bytes). The estimated time for this step is 0.06 seconds. : Exclusion Merge Join Show all employees who have no assigned phone. EXPLAIN SELECT employee_number FROM Employee EXCEPT SELECT employee_number FROM Emp_Phone; 4) We execute the following steps in parallel. 1) We do an all-AMPs RETRIEVE step from TFACT.Employee by way of an all-rows scan with no residual conditions into Spool 1 (all_amps), which is built locally on the AMPs. Then we do a SORT to order Spool 1 by the hash code of (TFACT.Employee.Employee_Number) the sort key in spool field1 eliminating duplicate rows. The size of Spool 1 is estimated with high confidence to be 27,000 rows (675,000 bytes). The estimated time for this step is 0.03 seconds. 2) We do an all-AMPs RETRIEVE step from TFACT.Emp_Phone by way of an all-rows scan with no residual conditions into Spool 2 (all_amps), which is built locally on the AMPs. Then we do a SORT to order Spool 2 by the hash code of (TFACT.Emp_Phone.Employee_Number) the sort key in spool field1 eliminating duplicate rows. The size of Spool 2 is estimated with high confidence to be 26,000 rows (650,000 bytes). The estimated time for this step is 0.04 seconds. 5) We do an all-AMPs JOIN step from Spool 1 (Last Use) by way of an all-rows scan, which is joined to Spool 2 (Last Use) by way of an all-rows scan. Spool 1 and Spool 2 are joined using an exclusion merge join, with a join condition of ("Field_1 = Field_1"). The result goes into Spool 3 (group_amps), which is built locally on the AMPs. The size of Spool 3 is estimated with low confidence to be 1,000 rows (25,000 bytes). The estimated time for this step is 0.02 seconds. Product Joins EXPLAIN SELECT dept_name, last_name, first_name FROM Employee E INNER JOIN Department D ON (E.dept_number = D.dept_number OR E.employee_number = D.dept_mgr_number); 4) We do an all-AMPs RETRIEVE step from TFACT.D by way of an all-rows scan with no residual conditions into Spool 2 (all_amps), which is duplicated on all AMPs. The size of Spool 2 is estimated with high confidence to be 19,642 rows (805,322 bytes). The estimated time for this step is 0.02 seconds. 5) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of an all-rows scan, which is joined to TFACT.E by way of an all-rows scan with no residual conditions. Spool 2 and TFACT.E are joined using a product join, with a join condition of ("(TFACT.E.Dept_Number = Dept_Number) OR (TFACT.E.Employee_Number = Dept_Mgr_Number)"). The result goes into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with low confidence to be 28,403 rows (1,959,807 bytes). The estimated time for this step is 1.58 seconds. 6) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 1.60 seconds. Cartesian Product Join EXPLAIN SELECT E.employee_number ,E.last_name ,E.first_name ,D.dept_name ,P.phone_number FROM Employee E, Department D, Emp_Phone P; Cartesian Product Join 5) We execute the following steps in parallel. 1) We do an all-AMPs RETRIEVE step from TFACT.D by way of an all-rows scan with no residual conditions into Spool 2 (all_amps), which is duplicated on all AMPs. The size of Spool 2 is estimated with high confidence to be 19,642 rows (648,186 bytes). The estimated time for this step is 0.02 seconds. : 6) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of an all-rows scan, which is joined to TFACT.E by way of an all-rows scan with no residual conditions. Spool 2 and TFACT.E are joined using a product join, with a join condition of ("(1=1)"). The result goes into Spool 4 (all_amps), which is built locally on the AMPs. The result spool file will not be cached in memory. The size of Spool 4 is estimated with high confidence to be 37,881,000 rows (2,462,265,000 bytes). The estimated time for this step is 30.07 seconds. 7) We do an all-AMPs JOIN step from Spool 3 (Last Use) by way of an all-rows scan, which is joined to Spool 4 (Last Use) by way of an all-rows scan. Spool 3 and Spool 4 are joined using a product join, with a join condition of ("(1=1)"). The result goes into Spool 1 (group_amps), which is built locally on the AMPs. The result spool file will not be cached in memory. The size of Spool 1 is estimated with high confidence to be 1,969,812,000,000 rows (151,675,524,000,000 bytes). The estimated time for this step is 491 hours and 59 minutes. 8) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 491 hours and 59 minutes. EXPLAIN Terminology for PPI tables "a single partition of" or "n partitions of" • Indicates that an AMP or AMPs only need to access a single partition or n partitions of a table – indicates partition elimination occurred. • Partition elimination can occur for SELECTs, UPDATE, and DELETEs. – For a DELETE, Optimizer recognizes partitions in which all rows are deleted. – Rows in such partitions are deleted without using the transient journal. "SORT to partition Spool m by rowkey" • The spool is to be sorted by rowkey (partition and hash). • Partitioning the spool file in this way allows for a faster join with the partitioned table. "a rowkey-based" • Indicates an equality join on the rowkey. • In this case, there are equality constraints on the partitioning columns and primary index columns. "enhanced by dynamic partition …" • Indicates a join condition (V2R5.1) where dynamic partition elimination has been used. EXPLAIN Terminology for PPI tables Access of Table with PPI vs. NPPI EXPLAIN SELECT FROM Claim_PPI WHERE c_claimdate BETWEEN DATE '2008-01-01' AND DATE '2008-01-31'; 3) We do an all-AMPs RETRIEVE step from a single partition of DS.Claim_PPI with a condition of ("(DS.Claim_PPI.c_claimdate <= DATE '2008-01-31') AND (DS.Claim_PPI.c_claimdate >= DATE '2008-01-01')") into Spool 1 (group_amps), which is built locally on the AMPs. The input table will not be cached in memory, but it is eligible for synchronized scanning. The size of Spool 1 is estimated with high confidence to be 21,100 rows. The estimated time for this step is 0.44 seconds. EXPLAIN SELECT FROM Claim_NPPI WHERE c_claimdate BETWEEN DATE '2008-01-01' AND DATE '2008-01-31'; 3) We do an all-AMPs RETRIEVE step from DS.Claim_NPPI by way of an all-rows scan with a condition of ("(DS.Claim_NPPI.c_claimdate <= DATE '2008-01-31') AND (DS.Claim_NPPI.c_claimdate >= DATE '2008-01-01')") into Spool 1 (group_amps), which is built locally on the AMPs. The input table will not be cached in memory, but it is eligible for synchronized scanning. The size of Spool 1 is estimated with high confidence to be 21,100 rows. The estimated time for this step is 49.10 seconds. Access via Primary Index EXPLAIN SELECT FROM Claim_PPI WHERE c_claimid = 260221; 1) First, we do a single-AMP RETRIEVE step from all partitions of DS.Claim_PPI by way of the primary index "DS.Claim_PPI.c_claimid = 260221" with a residual condition of ("DS.Claim_PPI.c_claimid = 260221") into Spool 1 (one-amp), which is built locally on that AMP. The input table will not be cached in memory, but it is eligible for synchronized scanning. The size of Spool 1 is estimated with high confidence to be 1 row. The estimated time for this step is 0.09 seconds. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.09 seconds. EXPLAIN SELECT FROM Claim_NPPI WHERE c_claimid = 260221; 1) First, we do a single-AMP RETRIEVE step from DS.Claim_NPPI by way of the unique primary index "DS.Claim_NPPI.c_claimid = 260221" with no residual conditions. The estimated time for this step is 0.01 seconds. -> The row is sent directly back to the user as the result of statement 1. The total estimated time is 0.01 seconds. Place a USI on Claim Number If the partitioning columns are not part of the Primary Index, the Primary Index cannot be unique (e.g., Claim_Date is not part of the PI). To maintain uniqueness on the Primary Index, you can create a USI on the PI (e.g., Claim_Number). This allows faster access to specific claim rows. CREATE UNIQUE INDEX (c_claimid) ON Claim_PPI ; 1) First, we do a two-AMP RETRIEVE step from DS.Claim_PPI by way of unique index # 4 "DS.Claim_PPI.c_claimid = 260221" with no residual conditions. The estimated time for this step is 0.01 seconds. -> The row is sent directly back to the user as the result of statement 1. The total estimated time is 0.01 seconds. EXPLAIN SELECT FROM Claim_PPI WHERE c_claimid = 260221; Place a NUSI on Claim Number You can create a NUSI on the PI (e.g., Claim_Number) in order to allow faster access to specific claim rows. This is a special NUSI case where only one-AMP is accessed. CREATE INDEX (c_claimid) ON Claim_PPI ; EXPLAIN SELECT FROM Claim_PPI WHERE c_claimid = 260221; 1) First, we do a single-AMP RETRIEVE step from DS.Claim_PPI by way of index # 4 "DS.Claim_PPI.c_claimid = 260221" with no residual conditions into Spool 1 (group_amps), which is built locally on that AMP. The input table will not be cached in memory, but it is eligible for synchronized scanning. The size of Spool 1 is estimated with high confidence to be 1 row. The estimated time for this step is 0.01 seconds. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.01 seconds. Access with Partition Elimination EXPLAIN SELECT FROM Sales_PPI WHERE sales_date BETWEEN '2008-01-05' AND '2008-04-15' ; 3) We do an all-AMPs RETRIEVE step from 4 partitions of TFACT.Sales_PPI with a condition of ("(TFACT.Sales_PPI.sales_date >= DATE '2008-01-01') AND (TFACT.Sales_PPI.sales_date <= DATE '2008-04-15')") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 450 rows (57,600 bytes). The estimated time for this step is 0.03 seconds. EXPLAIN SELECT FROM Sales_NPPI WHERE sales_date BETWEEN '2008-01-01' AND '2008-04-15' ; 3 We do an all-AMPs RETRIEVE step from TFACT.Sales_NPPI by way of an all-rows scan with a condition of ("(TFACT.Sales_NPPI.sales_date >= DATE '2008-01-01') AND (TFACT.Sales_NPPI.sales_date <= DATE '2008-04-15')") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 450 rows (57,600 bytes). The estimated time for this step is 0.42 seconds. Delete of Partition Data Note: The ranges of dates includes 3 complete partitions and a portion of a partition. EXPLAIN DELETE FROM Sales_PPI WHERE sales_date BETWEEN '2001-01-01' AND '2001-04-15' ; 3) We do an all-AMPs DELETE from a single partition of TFACT.Sales_PPI with a condition of ("(TFACT.Sales_PPI.sales_date >= DATE '2001-01-01') AND (TFACT.Sales_PPI.sales_date <= DATE '2001-04-15')"). 4) We do an all-AMPs DELETE of 3 partitions from TFACT.Sales_PPI with a condition of ("(TFACT.Sales_PPI.sales_date >= DATE '2001-01-01') AND (TFACT.Sales_PPI.sales_date <= DATE '2001-04-15')"). The size is estimated with high confidence to be 41,383 rows. The estimated time for this step is 2.67 seconds. Join Indexes A Join Index is an optional index which may be created by the user. The basic types of Join Indexes will be described first. Multi-table Join Index – Pre-join multiple tables; stores and maintains the result from joining two or more tables. Single-table Join Index – Distribute the rows of a single table on the hash value of a foreign key value. Aggregate Join Index (AJI) – Aggregate (SUM or COUNT) one or more columns of a single table or multiple tables into a summary table. • If possible, the optimizer will use a Join Index rather than access the tables directly. − For known queries, this typically will result in much better performance. • A Single-Table Join Index can eliminate data distribution. • A Multi-Table Join Index pre-joins the data and can eliminate join processing. • An Aggregate Join Index can eliminate aggregation processing. • Provides the optimizer with additional options and the optimizer may use the join index if it “covers” the query. • Join Indexes are automatically updated as the table rows are updated. − A Join Index may not be accessed directly. Benefits of Join Indexes Join Index Example - Customer and Order Tables CREATE SET TABLE Customer ( c_custid INTEGER NOT NULL, c_lname VARCHAR(15), c_fname VARCHAR(10), c_address VARCHAR(50), c_city VARCHAR(20), c_state CHAR(2), c_zipcode INTEGER) UNIQUE PRIMARY INDEX ( c_custid ); CREATE SET TABLE Orders ( o_orderid INTEGER NOT NULL, o_custid INTEGER NOT NULL, o_orderstatus CHAR(1), o_totalprice DECIMAL(9,2) NOT NULL, o_orderdate DATE FORMAT 'YYYY-MM-DD' NOT NULL, o_orderpriority SMALLINT, o_clerk CHAR(16), o_shippriority SMALLINT, o_comment VARCHAR(79)) UNIQUE PRIMARY INDEX ( o_orderid ); Creating a Join Index – Multiple Tables CREATE JOIN INDEX Cust_Ord_JI AS SELECT (c_custid, c_lname), (o_orderid, o_orderstatus, o_orderdate) FROM Customer C INNER JOIN Orders O ON c_custid = o_custid PRIMARY INDEX (c_custid); How many valid customers have assigned orders on May 16, 2004? SELECT o_orderdate ,COUNT (c_custid) FROM Customer INNER JOIN Orders ON c_custid= o_custid WHERE o_orderdate = '2004-05-16' GROUP BY 1; Does a Join Index Help? Does a Join Index Help? This EXPLAIN is without a Join Index. 4) We do an all-AMPs RETRIEVE step from TFACT.Orders by way of an all-rows scan with a condition of ("TFACT.Orders.o_orderdate = DATE '2004-05-16'") into Spool 4 (all_amps), which is redistributed by the hash code of (TFACT.Orders.o_custid) to all AMPs. Then we do a SORT to order Spool 4 by row hash. The size of Spool 4 is estimated with high confidence to be 31 rows (651 bytes). The estimated time for this step is 0.02 seconds. 5) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of a RowHash match scan, which is joined to TFACT.Customer by way of a RowHash match scan with no residual conditions. Spool 4 and TFACT.Customer are joined using a merge join, with a join condition of ("TFACT.Customer.c_custid = o_custid"). The result goes into Spool 3 (all_amps), which is built locally on the AMPs. The size of Spool 3 is estimated with low confidence to be 31 rows (713 bytes). The estimated time for this step is 0.02 seconds. : 8) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.09 seconds. Does a Join Index Help? This EXPLAIN is with a Join Index. 3) We do an all-AMPs SUM step to aggregate from TFACT.CUST_ORD_JI by way of an all-rows scan with a condition of ("TFACT.CUST_ORD_JI.o_orderdate = DATE '2004-05-16'"), grouping by field1 ( TFACT.CUST_ORD_JI.o_orderdate). Aggregate Intermediate Results are computed globally, then placed in Spool 4. The size of Spool 4 is estimated with high confidence to be 1 row (25 bytes). The estimated time for this step is 0.03 seconds. 4) We do an all-AMPs RETRIEVE step from Spool 4 (Last Use) by way of an all-rows scan into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 1 row (29 bytes). The estimated time for this step is 0.00 seconds. 5) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.04 seconds. Join Index – Single Table • The Single Table Join Index is useful for resolving joins on large tables without having to redistribute the joined rows across the AMPs. • In some cases, this may perform better than building a multitable join index on the same columns. CREATE JOIN INDEX Orders_JI AS SELECT (o_custid), (o_orderid, o_orderstatus, o_totalprice, o_orderdate) FROM Orders PRIMARY INDEX (o_custid); Does a Join Index Help? Name the valid customers and their addresses who have open orders on May 16, 2004? SELECT c_custid, c_lname, c_address, o_orderdate FROM Customer C INNER JOIN Orders O ON c_custid = o_custid WHERE o_orderstatus = 'O' AND o_orderdate = '2004-05-16' ORDER BY 1; Does a Join Index Help? This EXPLAIN is without a single table Join Index. 4) We do an all-AMPs RETRIEVE step from TFACT.O by way of an all-rows scan with a condition of ("(TFACT.O.o_orderdate = DATE '2004-05-16') AND (TFACT.O.o_orderstatus = 'O')") into Spool 2 (all_amps), which is redistributed by the hash code of (TFACT.O.o_custid) to all AMPs. Then we do a SORT to order Spool 2 by row hash. The size of Spool 2 is estimated with low confidence to be 23 rows (483 bytes). The estimated time for this step is 0.02 seconds. 5) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of a RowHash match scan, which is joined to TFACT.C by way of a RowHash match scan with no residual conditions. Spool 2 and TFACT.C are joined using a merge join, with a join condition of ("TFACT.C.c_custid = o_custid"). The result goes into Spool 1 (group_amps), which is built locally on the AMPs. Then we do a SORT to order Spool 1 by the sort key in spool field1 (TFACT.C.c_custid). The size of Spool 1 is estimated with low confidence to be 23 rows (1,242 bytes). The estimated time for this step is 0.02 seconds. 6) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.05 seconds. Does a Join Index Help? This EXPLAIN is with a single table Join Index. 4) We do an all-AMPs JOIN step from TFACT.C by way of a RowHash match scan with no residual conditions, which is joined to TFACT.ORDERS_JI by way of a RowHash match scan with a condition of ("(TFACT.ORDERS_JI.o_orderdate = DATE '2004-05-16') AND (TFACT.ORDERS_JI.o_orderstatus = 'O')"). TFACT.C and TFACT.ORDERS_JI are joined using a merge join, with a join condition of ("TFACT.C.c_custid = TFACT.ORDERS_JI.o_custid"). The result goes into Spool 1 (group_amps), which is built locally on the AMPs. Then we do a SORT to order Spool 1 by the sort key in spool field1 (TFACT.C.c_custid). The size of Spool 1 is estimated with low confidence to be 23 rows (1,242 bytes). The estimated time for this step is 0.03 seconds. 5) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.03 seconds. Note: The join index eliminated the need to redistribute the Orders table into spool. EXPLAIN Terminology (cont.)  ... Joined using a row id join … Indicates a join back condition with a join index. : 3) We do a single-AMP RETRIEVE step from TFACT.Orders_GI by way of the primary index "TFACT.Orders_GI.o_custid = 1500" with no residual conditions into Spool 2 (group_amps), which is redistributed by hash code to all AMPs. Then we do a SORT to order Spool 2 by the sort key in spool field1. The size of Spool 2 is estimated with high confidence to be 14 rows. The estimated time for this step is 0.00 seconds. 4) We do a group-AMPs JOIN step from Spool 2 (Last Use) by way of an all-rows scan, which is joined to TFACT.Orders. Spool 2 and TFACT.Orders are joined using a row id join, with a join condition of ("Field_1 = TFACT.Orders.RowID"). The result goes into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with index join confidence to be 4 rows. The estimated time for this step is 0.08 seconds. : • Use EXPLAIN: − to get costs of different approaches − to find unexpected product joins − for best approach for using join indexes − ensure that sync scanning is enabled • Collect statistics: − On non-unique indexes − On non-index join columns − On small tables − to get better costs − to get better plans Summary • Cost enhancements: will help TDWM achieve the goals of providing timely and reliable capacity planning for sales purposes through the new Query Estimator tool . The Insert, Update, Delete, Merge Delete, Merge Update steps now have their estimated running times. • Detailed enhancements: will allow workload definition classification rules to be set based on the estimated spool usage of DML statements and also help users to debug and refine their queries with redistributed columns, grouping/sorting columns printed. Teradata 12.0 changes • Add hashed/redistributed columns. • Add sorted columns. • Add grouped by columns. • Add view names. • Add spool size estimates. Info Enhancements (12.0) Potential problems & work around • The info printed may be very huge, if users do not want to see hashed / sorted / grouped by columns, they have 2 choices: • Run this command: Diagnostic NoDetailedExplain on for session; Or • Turn on the internal DBS Control Flag

DisableDetailedExplain = TRUE Informational enhancements (12.0)

ANPopov / Rep2

Teradata #2

0