dolthub / dolt

Dolt – Git for Data
Apache License 2.0
17.82k stars 504 forks source link

[statspro] Bootstrap database statistics once on startup #8036

Closed max-hoffman closed 3 months ago

max-hoffman commented 3 months ago

Load database statistics once on sql engine startup. If auto refresh is enabled, bootstrap is not performed. Behavior is on by default and can be turned off:

    dolt sql -q "set @@PERSIST.dolt_stats_bootstrap_enabled = 1;"

(calling the command above with non-empty tables will still bootstrap statistics once)

This includes a small change to the way we encode column types for stats. We previously split using a comma",", but enums and others can include commas so we use a line break now "/n". Old versions of stats will fail to load with the newer version.

coffeegoddd commented 3 months ago

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
a362bdc ok 5937457
version total_tests
a362bdc 5937457
correctness_percentage
100.0
coffeegoddd commented 3 months ago

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
2fb4d54 ok 5937457
version total_tests
2fb4d54 5937457
correctness_percentage
100.0
coffeegoddd commented 3 months ago

@coffeegoddd DOLT

comparing_percentages
100.000000 to 100.000000
version result total
b9910d3 ok 5937457
version total_tests
b9910d3 5937457
correctness_percentage
100.0
coffeegoddd commented 3 months ago

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
2849138 ok 5937457
version total_tests
2849138 5937457
correctness_percentage
100.0
coffeegoddd commented 3 months ago

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
4cde46a ok 5937457
version total_tests
4cde46a 5937457
correctness_percentage
100.0
coffeegoddd commented 3 months ago

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
2a0f836 ok 5937457
version total_tests
2a0f836 5937457
correctness_percentage
100.0
max-hoffman commented 3 months ago

I benchmarked the startup cost for this, and it seems like a similar penalty to rebuilding a journal index.

Testing on a 50 million row database, startup without stats is 1.5 minutes, with stats after bootstrapping is about 3 minutes, first bootstrap is 20 minutes.

coffeegoddd commented 3 months ago

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
298b5f9 ok 5937457
version total_tests
298b5f9 5937457
correctness_percentage
100.0
coffeegoddd commented 3 months ago

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000
version result total
486897d ok 5937457
version total_tests
486897d 5937457
correctness_percentage
100.0