apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.
https://doris.apache.org
Apache License 2.0
12.79k stars 3.29k forks source link

[Proposal] Collect richer statistical information #6370

Open EmmyMiao87 opened 3 years ago

EmmyMiao87 commented 3 years ago

Background

Doris's current statistics are only inaccurate table rows and table size. Due to the lack of some more accurate column statistics, the Doris query plan cannot evaluate and select a better plan. Also the current statistical information is not a complete structure.

I propose to add a statistical information collection module in the query layer. Mainly responsible for the collection, storage, use and other functions of statistical information, so as to assist the optimizer to obtain better query planning.

Requirements

  1. Statistics include table stat (row count, data size), column stat (ndv, min, max, null_nums etc)
  2. Show statistics
  3. Collect statistics (sample, all)
  4. Storage statistics (memory and persistence)
  5. Up-to-date or out-to-date
  6. Compatible with old statistical logic
EmmyMiao87 commented 3 years ago

Show statistics

EmmyMiao87 commented 3 years ago

Collect Statistics

grammar

In the first stage, only consider using statements to manually trigger collection tasks. The user triggers the collection task through ANALYZE statement, and collects the statistical information of the specified table or column according to the provided parameters. The collected statistics are stored in Doris.

ANALYZE [ table_and_columns [, ...] ]

table_and_columns is:
    table_name [ ( column_name [, ...] ) ]
EmmyMiao87 commented 3 years ago

Manually inject statistics

grammar

The statistical values that can be injected manually include: row_count, data_size etc。

  1. Manually inject statistics into the table

    ALTER TABLE table_name set stats ("key" = "value");

    For example: manually set the number of rows in the table store to 10000

    ALTER TABLE store set stats ("row_count" = "10000");
    After the update, you can see that the row count of the table store has been updated to 10000 by show table stats store. table_name row_count data_size
    store 10000 1MB
  2. Manually inject column statistics

    ALTER TABLE table_name MODIFY COLUMN column_name set stats ("key" = "value");

    For example: manually set the num_distinct_value of the column s_store_sk of the table store to 5.

    ALTER TABLE store modify column s_store_sk set stats ("num_distinct_value" = "5");

    After the update, you can see that the num_distinct_value value of the column s_store_id of the table store is updated to 5.

    column_name num_distinct_value min max num_nulls avg_col_len max_col_len
    s_store_sk 10 1 100 0 4 4
    s_store_id 5 -1 -1 -1 4 4
wangbo commented 3 years ago

I have some questions about detailed design: 1 Currently, how much column statistics we need to collect in total? In addition to the column listed above. 2 Does the current design have good scalability? For example, I want to add a new column statistic, named avg_char_len, how to make it a lightweight operation. It seems that we need to do a schema change in current design.

EmmyMiao87 commented 3 years ago

1 Currently, how much column statistics we need to collect in total? In addition to the column listed above.

In the first stage, we only collect these column statistics listed above. Is there anything you want to add?

2 Does the current design have good scalability? For example, I want to add a new column statistic, named avg_char_len, how to make it a lightweight operation. It seems that we need to do a schema change in current design.

Although the column statistical information listed here is in the form of a table structure, it is not actually a real table. So there is no need for schema change

EmmyMiao87 commented 3 years ago

抽样库 https://datasketches.apache.org/

waruto210 commented 2 years ago

I am interested in the CBO project in the OSPP,can anyone tell me where can I find the code that persist stats when manually inject stats, I only found update TableStats in memory.