This PR creates CourseWikiTimeslice, ArticleCourseTimeslice and CourseUserWikiTimeslice models and migrations.
The new timeslices layer represents discrete unit of times, for which course statistics have been calculated. The current model of course stats calculates absolute stats, i.e stats are calculated for the entire duration of courses -from beginning to end-. The new timeslices layer would allow us to calculate relative stats, i.e stats scoped to the range of time specified by the given timeslice. The three new timeslices models are created based on existing Course, ArticlesCourses, and CoursesUsers models.
CourseWikiTimeslice
Records in:
course_wiki_id
start: a timestamp indicating the beginning of the timeslice. Likely an invariant field.
end: a timestamp indicating the end of the timeslice. It could be partial if the timeslice records gets updated frequently. Once end is set to start + timeslice_duration, then likely it will remain unchanged (unless the timeslice is empty -0 revisions/uploads were created during the range time- and we want to use that same record to store data from the next timeslot).
last_mw_rev_id: last revision id ingested. Not trivial to determine because CourseUserTimeslice model is not aggregated by wiki.
character_sum: currently calculated based on course users records, which would be a problem as we want to aggregate by wiki in this case, but we don’t have aggregated data for course users. It could be calculated based on raw revision data I think.
references_count: currently calculated based on course users records, which would be a problem as we want to aggregate by wiki in this case, but we don’t have aggregated data for course users. It could be calculated based on raw revision data I think.
revision_count: calculated from raw revision data ✅
upload_count: calculated from raw uploads data ✅
uploads_in_use_count: calculated from raw uploads data ✅
upload_usages_count: calculated from raw uploads data ✅
Records out:
view_sum: calculated based on view_count article course attribute, which we don’t want to include as a timeslice attribute. It should remain as a course attribute.
user_count: it won’t sum up as a useful number. It should remain as a course attribute.
trained_count: represents the count of students who don't have assigned training modules that are overdue. It’s not accumulative. It won’t sum up as a useful number. It should remain as a course attribute.
recent_revision_count: it counts revisions from the last 7 days. It doesn’t make sense in a timeslice context. It should remain as a course user attribute, and it could be calculated form revision_count timeslice attribute.
article_count: it won’t sum up as a useful number. It should remain as a course attribute.
new_article_count: calculated from article course data. I don’t think it makes sense as a timeslice attribute. It should remain as a course attribute.
ArticleCourseTimeslice
Records in:
article_course_id: if we consider the idea of removing ArticleCourse model, then having article_id and course_id as individual properties could be better.
start: a timestamp indicating the beginning of the timeslice. Likely an invariant field.
end: a timestamp indicating the end of the timeslice. It could be partial if the timeslice records gets updated frequently. Once end is set to start + timeslice_duration, then likely it will remain unchanged (unless the timeslice is empty -0 revisions/uploads were created during the range time- and we want to use that same record to store data from the next timeslot).
last_mw_rev_id: last revision id ingested. Useful in this case because ArticleCourseTimeslice is involuntarily aggregated by wiki.
character_sum: calculated from raw revision data ✅
references_count: calculated from raw revision data ✅
user_ids: calculated from raw revision data ✅
Records out:
view_count: calculated based on the average_views article attribute. average_views is the daily average views of the total views in the last 50 days (retrieved from wikimedia pageviews API). view_count is estimated by the multiplication of the average_views article attribute by the number of days since the first revision to that article until today. I don’t think it makes sense to keep this field as a timeslice attribute. It should remain as an article course attribute (we may need to store the first revision date for the article course in some place).
new_article: It’s an invariant field. It should remain as an article course attribute.
Notes:ArticleCourseTimeslice is involuntarily aggregated by wiki, as an article belongs to a single wiki.
CourseUserWikiTimeslice
Records in:
course_user_id: if we consider the idea of removing CourseUser model, then having user_id and course_id as individual properties could be better.
wiki_id
start: a timestamp indicating the beginning of the timeslice. Likely an invariant field.
end: a timestamp indicating the end of the timeslice. It could be partial if the timeslice records gets updated frequently. Once end is set to start + timeslice_duration, then likely it will remain unchanged (unless the timeslice is empty -0 revisions/uploads were created during the range time- and we want to use that same record to store data from the next timeslot).
last_mw_rev_id: last revision id ingested. Useful in this case because CourseUserTimeslice is aggregated by wiki.
total_uploads: calculated from raw uploads data ✅
character_sum: calculated from raw revision data ✅
character_sum_us: calculated from raw revision data ✅
character_sum_draft: calculated from raw revision data ✅
references_count: calculated from raw revision data ✅
revision_count: calculated from raw revision data ✅
Records out:
recent_revisions: it counts revisions from the last 7 days. It doesn’t make sense in a timeslice context. It should remain as a course user attribute, and it could be calculated form revision_count timeslice attribute.
assigned_article_title: It looks like a course user could be assigned more than one article but we only keep the first here. It’s not clear for me how this changes during time, but probably not worth having as a timeslice field, as it doesn’t depend on revisions/uploads. It should remain as a course user attribute.
Open questions and concerns
CourseWikiTimeslice has a single course_wiki_id from CoursesWikis model (instead of using course_id and wiki_id fields).
Check singular/plural name conventions on ArticleCourseTimeslice and article_course_id
Check singular/plural name conventions on CourseUserTimeslice and course_user_id
CourseUserWikiTimeslice is aggregated by wiki through a wiki_id field.
Foreign keys?
Having a timeslice_duration duration field is considered unnecessary because (at least, for now) we want to have a unique timeslice duration value set at code level and we don’t consider the timeslice duration a change item.
What this PR does
This PR creates
CourseWikiTimeslice
,ArticleCourseTimeslice
andCourseUserWikiTimeslice
models and migrations.The new timeslices layer represents discrete unit of times, for which course statistics have been calculated. The current model of course stats calculates absolute stats, i.e stats are calculated for the entire duration of courses -from beginning to end-. The new timeslices layer would allow us to calculate relative stats, i.e stats scoped to the range of time specified by the given timeslice. The three new timeslices models are created based on existing
Course
,ArticlesCourses
, andCoursesUsers
models.CourseWikiTimeslice
Records in:
end
is set tostart
+timeslice_duration
, then likely it will remain unchanged (unless the timeslice is empty -0 revisions/uploads were created during the range time- and we want to use that same record to store data from the next timeslot).CourseUserTimeslice
model is not aggregated by wiki.Records out:
view_count
article course attribute, which we don’t want to include as a timeslice attribute. It should remain as a course attribute.revision_count
timeslice attribute.ArticleCourseTimeslice
Records in:
ArticleCourse
model, then havingarticle_id
andcourse_id
as individual properties could be better.end
is set tostart
+timeslice_duration
, then likely it will remain unchanged (unless the timeslice is empty -0 revisions/uploads were created during the range time- and we want to use that same record to store data from the next timeslot).ArticleCourseTimeslice
is involuntarily aggregated by wiki.Records out:
average_views
article attribute.average_views
is the daily average views of the total views in the last 50 days (retrieved from wikimedia pageviews API).view_count
is estimated by the multiplication of theaverage_views
article attribute by the number of days since the first revision to that article until today. I don’t think it makes sense to keep this field as a timeslice attribute. It should remain as an article course attribute (we may need to store the first revision date for the article course in some place).Notes:
ArticleCourseTimeslice
is involuntarily aggregated by wiki, as an article belongs to a single wiki.CourseUserWikiTimeslice
Records in:
CourseUser
model, then havinguser_id
andcourse_id
as individual properties could be better.end
is set tostart
+timeslice_duration
, then likely it will remain unchanged (unless the timeslice is empty -0 revisions/uploads were created during the range time- and we want to use that same record to store data from the next timeslot).CourseUserTimeslice
is aggregated by wiki.Records out:
Open questions and concerns
CourseWikiTimeslice
has a singlecourse_wiki_id
fromCoursesWikis
model (instead of usingcourse_id
andwiki_id
fields).ArticleCourseTimeslice
andarticle_course_id
CourseUserTimeslice
andcourse_user_id
CourseUserWikiTimeslice
is aggregated by wiki through awiki_id
field.timeslice_duration
duration field is considered unnecessary because (at least, for now) we want to have a unique timeslice duration value set at code level and we don’t consider the timeslice duration a change item.