DataSystemsLab / recdb-postgresql

RecDB is a recommendation engine built entirely inside PostgreSQL
357 stars 43 forks source link

Should the DB columns be indexed? I noticed none of the example code was. #7

Closed Skylion007 closed 8 years ago

Skylion007 commented 8 years ago

So, I am trying to experiment on using RecDB to deal with a MASSIVE database with about 170 million entries and have had some rather slow query times, as you can imagine. I was wondering if indexing userID or itemID columns of my database would provide any noticeable speed advantages since none of the examples scripts appeared to create indexes on the DB. Also for a DB of a 170 million reviews what would you expect the runtime for creating the APRIORI recommender? Hours? Days? Weeks?

Any information would be extremely helpful.

Sarwat commented 8 years ago

Have you tried creating a recommender apriori. This will reduce the response time tremendously. Creating a recommender on a 1 million ratings takes about ~200 seconds on a machine with 3.6 Ghz Quad-Core processor, 16 GB RAM, 500 GB storage, and running Ubuntu Linux 12.04. Make sure to run the experiments on a machine with large memory for 170 million ratings. Otherwise, PostgreSQL will run out of memory.

Skylion007 commented 8 years ago

I see. Are you aware of which settings in PostgreSQL I can use to optimize my DB for the certain eventuality where we will have to rely heavily on PostgreSQL tmp files? One of the real attractions of RecDB is the ability to complete queries, all be it rather slowly, by writing data to disk meaning the massive amounts RAM would be unnecessary for non-time sensitive recommendation queries. Maybe I'll try to create the recommenders APRIORI using a high ram EC2 instance and then copy the DB to run the recommenders locally. Any other suggestions? Thank you so much for your informative and speedy reply.

Sarwat commented 8 years ago

You need to set up the PostgreSQL buffer size to be as large as possible. Once the recommender is created, the recommendation queries are not supposed to take too much memory or time to run. Plus, you can totally pre-compute the recommendation queries answer and store them in a different table for your application to access them directly (and faster).