XtremeCurling / nextbus2pg

Save nextbus real-time vehicle location data in a postgres database
MIT License
0 stars 0 forks source link

Indexing all columns in `vehicle_location` takes up too much space #5

Open XtremeCurling opened 6 years ago

XtremeCurling commented 6 years ago

Currently, in the branch prepare_data_for_analysis, I index all 5 columns (remaining after resolving #4) in vehicle_location.

However, these 5 indexes end up taking more storage space than the table itself. This is problematic in general, and is especially an issue for me because with the data I've archived, the unindexed table is already taking up more than 50% of my storage allocation in the AWS Free Tier.

So, I need to look into dropping some of these indexes. In particular, I need to write sample queries that perform the most common types of actions I expect to be performing during the analysis phase, and test their performance under a number of different "index regimes" (i.e. combinations of different indexes being present vs. not present).

I have a separate RDS instance exceeding the Free Tier limit that already has all 5 indexes created; I can selectively choose to ignore subsets of these indexes and test performance under each different "regime", then decide which combination of indexes gets the best "bang for its buck" in terms of query performance for the types of queries I'll be running during the analysis phase.

XtremeCurling commented 6 years ago

Note: when running complex queries, Common Table Expressions (CTEs) (i.e. WITH clauses in Postgres) use up much more disk space than simple subqueries.

I discovered this when running a query to obtain the average distance, time, and speed of vehicles by route-direction (e.g. "San Francisco Muni 5 Inbound"). Due to low remaining storage space, I'm unable to run this query on the full dataset, so I've been using LIMIT to test the number of rows I can include in the query before running out of storage space.

With CTEs, in a DB instance with about 12 GB remaining free space, that limit is about 15 million rows (out of 86 million total) - the query just about exhausts those 12 GB of storage space.

However, when rewriting the query with simple subqueries (i.e. FROM ... instead of WITH ... AS), I noticed that much less storage space is eaten up. For that same 15m-row query, only about 4 GB of storage space was used up - 30-40% as much as was used up with CTEs. Time performance was similar (7m45s with CTEs vs. 8m0s with subqueries).

So, since storage space is limited, during the analysis phase I should be writing queries using simple subqueries rather than CTEs. This is a little unfortunate because CTEs are more readable.