After running some course updates on the data-rearchitecture instance for a heavy course (Pharmacology_at_Bar_IlanUniversity-_2024), I noticed that updating the caches for every article course takes ~ 40 minutes. According to production, Pharmacology at Bar Ilan University - 2024 course has 63.4K articles courses. With that number of articles courses, I think it’s reasonable for the ArticlesCourses.update_all_caches_from_timeslices(@course.articles_courses) step to take a lot of time. While updating caches for a specific article course is not a heavy task, we need to retrieve all its timeslices, and with 63.4K articles courses, this means (at least) 63.4k queries.
In order to try to speed up the cache update for articles courses I'm trying the following strategy:
Instead of updating every article course cache at any single course update, I retrieve the articles that have at least one timeslice that was updated after the last course update. That means, we only update caches for articles courses that have any new revision. If no timeslice for that article was updated after the last course update, that means that there were no new revisions in the last revision ingestion, so we don't need to re-calculate the cache for that article. For heavy courses with thousands of articles, this should reduce the amount of updated caches considerably. Notice that while the course has many articles, most of the articles have nothing more than a handful of revisions associated with it, so most of the time there is no point in updating its cache.
A new index is added to the article_course_timeslices table to retrieve the articles that require a cache update quickly.
view_count is no longer calculated for articles courses when updating its caches. If this strategy speeds up the update process, then we can revisit how to calculate view_count.
Open questions and concerns
As I'm not entirely sure this is going to work, I commented some lines of code (didn't remove anything). After testing this in my instance with heavy courses, I'll clean the code.
What this PR does
After running some course updates on the data-rearchitecture instance for a heavy course (Pharmacology_at_Bar_IlanUniversity-_2024), I noticed that updating the caches for every article course takes ~ 40 minutes. According to production, Pharmacology at Bar Ilan University - 2024 course has 63.4K articles courses. With that number of articles courses, I think it’s reasonable for the
ArticlesCourses.update_all_caches_from_timeslices(@course.articles_courses)
step to take a lot of time. While updating caches for a specific article course is not a heavy task, we need to retrieve all its timeslices, and with 63.4K articles courses, this means (at least) 63.4k queries.In order to try to speed up the cache update for articles courses I'm trying the following strategy:
view_count
is no longer calculated for articles courses when updating its caches. If this strategy speeds up the update process, then we can revisit how to calculateview_count
.Open questions and concerns
As I'm not entirely sure this is going to work, I commented some lines of code (didn't remove anything). After testing this in my instance with heavy courses, I'll clean the code.