Open EmmyMiao87 opened 3 years ago
Show table statistics
If table_name is filled in, the information of a single table will be displayed; if not, the information of all tables in the entire library will be displayed.
show table stats {db_name.table_name };
table_name | row_count | data_size |
---|---|---|
store | 100000 | 1MB |
Show column statistics
show column stats table_name;
column_name | num_distinct_value | min | max | num_nulls | avg_col_len | max_col_len |
---|---|---|---|---|---|---|
s_store_sk | 10 | 1 | 100 | 0 | 4 | 4 |
s_store_id | -1 | -1 | -1 | -1 | 4 | 4 |
In the first stage, only consider using statements to manually trigger collection tasks. The user triggers the collection task through ANALYZE statement
, and collects the statistical information of the specified table or column according to the provided parameters. The collected statistics are stored in Doris.
ANALYZE [ table_and_columns [, ...] ]
table_and_columns is:
table_name [ ( column_name [, ...] ) ]
The statistical values that can be injected manually include: row_count, data_size etc。
Manually inject statistics into the table
ALTER TABLE table_name set stats ("key" = "value");
For example: manually set the number of rows in the table store to 10000
ALTER TABLE store set stats ("row_count" = "10000");
After the update, you can see that the row count of the table store has been updated to 10000 by show table stats store . |
table_name | row_count | data_size |
---|---|---|---|
store | 10000 | 1MB |
Manually inject column statistics
ALTER TABLE table_name MODIFY COLUMN column_name set stats ("key" = "value");
For example: manually set the num_distinct_value of the column s_store_sk of the table store to 5.
ALTER TABLE store modify column s_store_sk set stats ("num_distinct_value" = "5");
After the update, you can see that the num_distinct_value value of the column s_store_id of the table store is updated to 5.
column_name | num_distinct_value | min | max | num_nulls | avg_col_len | max_col_len |
---|---|---|---|---|---|---|
s_store_sk | 10 | 1 | 100 | 0 | 4 | 4 |
s_store_id | 5 | -1 | -1 | -1 | 4 | 4 |
I have some questions about detailed design:
1 Currently, how much column statistics we need to collect in total? In addition to the column listed above.
2 Does the current design have good scalability? For example, I want to add a new column statistic, named avg_char_len
, how to make it a lightweight operation. It seems that we need to do a schema change in current design.
1 Currently, how much column statistics we need to collect in total? In addition to the column listed above.
In the first stage, we only collect these column statistics listed above. Is there anything you want to add?
2 Does the current design have good scalability? For example, I want to add a new column statistic, named
avg_char_len
, how to make it a lightweight operation. It seems that we need to do a schema change in current design.
Although the column statistical information listed here is in the form of a table structure, it is not actually a real table. So there is no need for schema change
I am interested in the CBO project in the OSPP,can anyone tell me where can I find the code that persist stats when manually inject stats, I only found update TableStats
in memory.
Background
Doris's current statistics are only inaccurate table rows and table size. Due to the lack of some more accurate column statistics, the Doris query plan cannot evaluate and select a better plan. Also the current statistical information is not a complete structure.
I propose to add a statistical information collection module in the query layer. Mainly responsible for the collection, storage, use and other functions of statistical information, so as to assist the optimizer to obtain better query planning.
Requirements