At the moment, we store all the revision raw data in our Revisions table, and query historical revision data in different parts of the app, particularly for calculate course stats caches. These are one of the heaviest queries we currently do.
Desired Behavior:
We want to store stats for each course by time period (for example, week-by-week stats), and only use revision data during the update for a given time period, so that we can remove the Revisions table altogether. This will dramatically reduce the storage requirements of the system and remove one of the major database performance bottlenecks.
Tasks:
Non-exhaustive list of things to do:
[x] Prepare deployment
[x] Create timeslices models
[ ] Add indexes for models
[x] Implement cache updates for ArticleCourseWikiTimeslices and ArticlesCourses.
[x] Implement cache updates for CourseUserWikiTimeslices and CoursesUsers.
[ ] Implement cache updates for CourseWikiTimeslices and Courses.
[ ] Replace the logic on ArticlesCourses.update_from_course. Currently, this takes care of removing all the ArticlesCourses that do not correspond to course revisions (due to updates on the course dates or tracked wikis). It also adds new ArticlesCourses. This should be modified to calculate new ArticlesCourses based on revisions in RAM. Changes on course dates or tracked wikis should trigger recalculations or calculations of new timeslices. Think about queueing tasks that recalculate timeslices when the dates/tracked wikis are updated (similar to UpdateCourseWorker.schedule_edits).
[ ] Replace the logic on DuplicateArticleDeleter.resolve_duplicates. Instead of removing "limbo revisions", we should try to identify if any timeslice needs to be recalculated.
[ ] Replace the logic derivated from ArticleStatusManager.update_article_status_for_course(@course). This ends up removing "limbo revisions" through the ModifiedRevisionsManager class. Related to the point mentioned above.
[ ] Create a TimesliceManager class that takes care of the logic around creating new/ updating existing timeslice records. Supposing the Revision data was imported and we have it in RAM, the criteria is the following:
For every entity (course_wiki, course_user, article_course) that was updated, find the last existing timeslice record for it. If no timeslice is found or the timeslice is finished (end - start range is more than 1 day), create a new record for it with the new data. If the timeslice is still current, then update it.
For every entity not updated (retrieve all the entities for the course and calculate the difference against the updated ones), find the last existing timeslice for it. If no timeslice exists (for example, a new course user was added recently, but the user didn’t make any edit yet), then create a new empty timeslice (from beginning of the course to now). If a timeslice is found, then extend it (increase the end field).
[ ] Replace all the places where revision data is used, by hitting the API instead. Read this.
Current Behavior:
At the moment, we store all the revision raw data in our Revisions table, and query historical revision data in different parts of the app, particularly for calculate course stats caches. These are one of the heaviest queries we currently do.
Desired Behavior:
We want to store stats for each course by time period (for example, week-by-week stats), and only use revision data during the update for a given time period, so that we can remove the Revisions table altogether. This will dramatically reduce the storage requirements of the system and remove one of the major database performance bottlenecks.
Tasks:
Non-exhaustive list of things to do:
ArticleCourseWikiTimeslices
andArticlesCourses
.CourseUserWikiTimeslices
andCoursesUsers
.CourseWikiTimeslices
andCourses
.ArticlesCourses.update_from_course
. Currently, this takes care of removing all theArticlesCourses
that do not correspond to course revisions (due to updates on the course dates or tracked wikis). It also adds newArticlesCourses
. This should be modified to calculate newArticlesCourses
based on revisions in RAM. Changes on course dates or tracked wikis should trigger recalculations or calculations of new timeslices. Think about queueing tasks that recalculate timeslices when the dates/tracked wikis are updated (similar toUpdateCourseWorker.schedule_edits
).DuplicateArticleDeleter.resolve_duplicates
. Instead of removing "limbo revisions", we should try to identify if any timeslice needs to be recalculated.ArticleStatusManager.update_article_status_for_course(@course)
. This ends up removing "limbo revisions" through theModifiedRevisionsManager
class. Related to the point mentioned above.TimesliceManager
class that takes care of the logic around creating new/ updating existing timeslice records. Supposing the Revision data was imported and we have it in RAM, the criteria is the following:course_wiki
,course_user
,article_course
) that was updated, find the last existing timeslice record for it. If no timeslice is found or the timeslice is finished (end
-start
range is more than 1 day), create a new record for it with the new data. If the timeslice is still current, then update it.end
field).