Add indexes needed by common queries

vegetabill commented 4 years ago

This mostly only applies to postgres since we don't really care about this for local dev using sqlite.

For two commonly run queries, we probably want to add indexes to the DB to make them more performant:

/user/answers/question/:questionNumber looks up the most recent 3 answers by user and question number, so it likely needs an ordered index on created_at along with user id and question number
/user/progress will need to know the most recent answer, regardless of question number which would be similar to the above index but without question number

Use EXPLAIN to make sure those queries are optimized and will use those indexes.

Sequelize indexes doc

Hausekey commented 4 years ago

Hi there, I'd like to tackle this issue.

alodahl commented 4 years ago

Thanks @Hausekey ! Please go through our contributing.md, then you can begin.

Hausekey commented 4 years ago

Hi Alina and Bill,

I've followed through the contributing.md, and have been familiarizing myself with this project; I had a question about how to test the code with Postgres? I believe I add the indexes code to orm.js, and I'd like to run some EXPLAIN queries, and report my findings. I see a dev-db.sqlite database, but haven't found any postgres ones. It's my first time working with Postgres 😅 In the meantime, I can play around with the sqlite database.

vegetabill commented 4 years ago

@Hausekey if you set DB_URL=postgres://localhost/your-db-name in the .env file or as an environment variable the app should connect to postgres and then you can proceed. HTH!

Hausekey commented 4 years ago

Thanks @vegetabill , I got it to work! I found out how to generate dummy data for both users and user_answers, so I can run some EXPLAIN and do some benchmarks.

When you wrote "so it likely needs an ordered index on created_at along with user id and question number", were you thinking of creating multi-column indexes or would single-column suffice?

vegetabill commented 4 years ago

Yes @Hausekey I was thinking of a multi-column index. Intuitively, you'd think the DB would just combined indexes but IIRC it needs one for all of the WHERE or ORDER BY columns in order to make sure it uses an index fully. Obviously, our data isn't so big but it's more so we get snappy performance on the free DB tier. I think each important, commonly run query needs its own index in our case since the where clauses are different.

I think Sequelize should facilitate multi-column indexes but if you have any questions, lmk!

Hausekey commented 4 years ago

I see, to get the best performance on obtaining most recent progress, multi-column indexes are the way to go.

Below are some findings from the Postgres database that I wanted to report. I currently have two indexes on user_answers: created_at,user_id,question_number and created_at,user_id

To ensure that the indexes will be be used, what I've done was generate about 116 random users, and 198,000 rows of user answers, varying in different times and question numbers, thanks to the open source generatedata script. I picked out a random user_id and question number (e.g scudrural and 14)

To get the most recent 3 answers by user and question number, my query was:

EXPLAIN ANALYZE SELECT * FROM "user_answers" WHERE 
user_id = 'scudrural' AND question_number='14' ORDER BY created_at DESC limit 3;

Output:

Limit (cost=0.42..371.52 rows=3 width=34) (actual time=0.072..0.131 rows=3 loops=1) -> Index Scan Backward using user_answers_created_at_user_id_question_number on user_answers (cost=0.42..6803.88 rows=55 width=34) (actual time=0.071..0.129 rows=3 loops=1) " Index Cond: (((user_id)::text = 'scudrural'::text) AND (question_number = 14))" Planning Time: 0.095 ms Execution Time: 0.144 ms

For most recent answer, the query was:

EXPLAIN ANALYZE SELECT * FROM "user_answers" WHERE 
user_id = 'scudrural' ORDER BY created_at DESC limit 1;

Output:

-> Index Scan Backward using user_answers_created_at_user_id on user_answers (cost=0.42..10396.70 rows=1713 width=34) (actual time=0.029..0.029 rows=1 loops=1) Limit (cost=0.42..6.49 rows=1 width=34) (actual time=0.029..0.030 rows=1 loops=1) " Index Cond: ((user_id)::text = 'scudrural'::text)" Planning Time: 0.128 ms Execution Time: 0.047 ms

For comparison, output without the indexes, rerunning the 2nd query:

EXPLAIN ANALYZE SELECT * FROM "user_answers" WHERE 
user_id = 'scudrural' ORDER BY created_at DESC limit 1;

Execution Time: 27.244 ms Planning Time: 0.072 ms Rows Removed by Filter: 198291 " Filter: ((user_id)::text = 'scudrural'::text)" -> Seq Scan on user_answers (cost=0.00..4496.64 rows=1713 width=34) (actual time=0.029..26.625 rows=1601 loops=1) Sort Method: top-N heapsort Memory: 25kB Sort Key: created_at DESC -> Sort (cost=4505.20..4509.49 rows=1713 width=34) (actual time=27.226..27.227 rows=1 loops=1) Limit (cost=4505.20..4505.20 rows=1 width=34) (actual time=27.227..27.229 rows=1 loops=1)

I'm still learning how to read the output of the EXPLAIN ANALYZE query, but it looks like it's using the indexes! Please let me know if there are questions or comments. When it looks good, I'll go ahead and make a pull request.

vegetabill commented 4 years ago

That looks great @Hausekey !

Please go ahead and make a PR when you get a chance.

Techtonica / keyboard-shortcuts-practice

Add indexes needed by common queries #136