cmu-db / peloton

The Self-Driving Database Management System
http://pelotondb.io
Apache License 2.0
2.03k stars 623 forks source link

Modified pg_column_stats initialization #1352

Open poojanilangekar opened 6 years ago

poojanilangekar commented 6 years ago

This PR modifies the ColumnStatsCatalog constructor to use a predefined schema instead of using DDL. Additionally, it creates the pg_column_stats table to a per database basis. Initially this was not done because of the dependencies of the StatsStorage on the view of "Global Stats".

Additionally, it changes the AnalyzeStatsForAllTables to AnalyzeStatsForAllTablesWithDatabaseOid. Now that we maintain ColumnStats on a per database basis, it makes sense to also collect the stats on a per database basis.

apavlo commented 6 years ago

@poojanilangekar Nice!

poojanilangekar commented 6 years ago

@mengranwo @GustavoAngulo We need to merge this in ASAP as its blocking the performance benchmarks. Please take a look at it when you get the time.

coveralls commented 6 years ago

Coverage Status

Coverage increased (+0.01%) to 77.565% when pulling 7dfbc59fe4c79f5ebb32efd746313f8e09ba1f5d on poojanilangekar:column_stats_fix into 5686479c5f33087031e1b68b7832245c7886b712 on cmu-db:master.

GustavoAngulo commented 6 years ago

I think it might be good to keep AnalyzeStatsForAllTables for now, my group in the class this semester noticed that we don't support the default ANALYZE behavior like Postgres does, where not specifying a table analyzes all tables. We might want to support this in the future, so that function/logic might come in handy.

As for the the schema, I'm not sure which inconsistency you're referring to @poojanilangekar ? @chenboy do you see an inconsistency?

poojanilangekar commented 6 years ago

@GustavoAngulo I have modified the function to AnalyzeStatsForAllTablesWithDatabaseOid. I checked the Postgres documentation, the ANALYZE command is run on a per database basis. So I think this should solve the problem.

Regarding the inconsistency, the declaration creates two SKEY indexes and the table has no primary key. While the header file states that the table should contain only one index with the primary key.

chenboy commented 6 years ago

The ColumnStatsCatalog use both indexes.

https://github.com/cmu-db/peloton/blob/d68ab719fdd9499f23ac5887f907ed07ce4d2644/src/catalog/column_stats_catalog.cpp#L135

https://github.com/cmu-db/peloton/blob/d68ab719fdd9499f23ac5887f907ed07ce4d2644/src/catalog/column_stats_catalog.cpp#L186

So I think we should stick to the declaration, not the header.

poojanilangekar commented 6 years ago

@GustavoAngulo @camellyx Can you please review these changes?

tli2 commented 6 years ago

@pervazea Can you take a look since these other people are out of town?

apavlo commented 6 years ago

This is an important PR that we should merge. We don't want to lose track of this.