Open ragesoss opened 3 years ago
Hey @ragesoss, from the approaches you have mentioned, I feel that 3rd one is the best, but looking at the latency issue, 2nd one can also be executed. I have some questions for the 2nd approach -
average_page_views
)average_page_views
? Once daily?Also, does pageviews api not store data for sandbox articles, or articles which are not in mainspace?
The pageviews API does provide data for non-mainspace pages —for example, https://pageviews.wmcloud.org/?project=en.wikipedia.org&platform=all-access&agent=user&redirects=0&range=all-time&pages=User:Ragesoss/sandbox — , but I think we don't want to store any data for non-mainspace pages.
average_page_views
and earliest_edit
.(This is a pretty big issue!)
Also, am not getting it when you said
The Dashboard calculates its average daily views based on data since it hit mainspace
Because I thought that average views are calculated from the last 50 days views, so it must have considered the views when it was in sandbox at some time. https://github.com/WikiEducationFoundation/WikiEduDashboard/blob/9f531b858559804173136eb8254ad47775ee09d3/lib/wiki_pageviews.rb#L41-L57
What I wrote in the description is not very clear, but the API only returns data for non-zero entries. Here's a better link to illustrate what I meant: https://pageviews.wmcloud.org/?project=sv.wikipedia.org&platform=all-access&agent=user&redirects=0&range=all-time&pages=Jugend_i_Sverige
You can see that pageview data only starts from the day it was moved to mainspace.
I believe this is because the pageview data is based on URLs rather than Page records. So if you wanted pageview data from the sandbox period, you'd need to query for the name it had while it was still in sandbox.
Okk...then there will always be issues in case of articles coming from sandboxes. So, don't you think we should try to store the time when an article gets shifted to mainspace?
That's... a good point I hadn't thought of. I guess storing some kind of start date will be necessary.
Yup. So, is there a place where we update namespace of the article when it leaves the sandbox? Or something from where this start date could be found out.
No, but I could incorporate that into the bigger project I'm working on to monitor sandbox progress for assigned topics. It might be best to leave this issue until then.
What is happening?
See https://meta.wikimedia.org/w/index.php?title=Talk:Programs_%26_Events_Dashboard&oldid=21178355#Pageviews_compared_to_pageviews
The affected course worked on this article: https://sv.wikipedia.org/wiki/Jugend_i_Sverige
It was created in a sandbox in October 2020, but wasn't moved to mainspace until February 2021. The Dashboard calculates its average daily views based on data since it hit mainspace, but incorrectly extrapolates that back to October for showing the total pageviews. https://pageviews.toolforge.org/?project=sv.wikipedia.org&platform=all-access&agent=user&redirects=0&range=latest-90&pages=Jugend_i_Sverige
Expected behavior
For a course like that one, the Dashboard should show pageview data that is very close to the exact actual pageviews, without dramatic increases due to bad extrapolation.
Additional context
We switched to estimated pageviews, extrapolated from an article's average pageviews, because the old method of fetching complete pageview stats and continually updating them wasn't sustainable. (These stats were previously stored and updated on Revision records, and updating them all would take days or weeks and was failure-prone.)
One possible approach to solving this would be to add an
earliest_views
column toArticle
, calculate average views for each article based on data stretching back to the earliest available pageview data, and take that earliest views data into account when calculating the pageviews for an ArticlesCourses record (using the latest of either first edit or earliest views for determining the view count).Another possible approach would be to store averages on a per-ArticlesCourses basis rather than a per-Article basis, so that each ArticlesCourses record stores an average pageviews value based on data going back to the date of the first course edit.
Another possible approach would be directly calculate and store the cumulative pageviews for each ArticlesCourses record as part of the course update process (doing so, at most, once per day). This would be the most accurate, but might introduce a lot of additional latency to the update queues.