WikiEducationFoundation / WikiEduDashboard

Wiki Education Foundation's Wikipedia course dashboard system
https://dashboard.wikiedu.org
MIT License
387 stars 599 forks source link

Pageview estimates may be off by a lot for newly-created articles that start in sandboxes #4370

Open ragesoss opened 3 years ago

ragesoss commented 3 years ago

What is happening?

See https://meta.wikimedia.org/w/index.php?title=Talk:Programs_%26_Events_Dashboard&oldid=21178355#Pageviews_compared_to_pageviews

The affected course worked on this article: https://sv.wikipedia.org/wiki/Jugend_i_Sverige

It was created in a sandbox in October 2020, but wasn't moved to mainspace until February 2021. The Dashboard calculates its average daily views based on data since it hit mainspace, but incorrectly extrapolates that back to October for showing the total pageviews. https://pageviews.toolforge.org/?project=sv.wikipedia.org&platform=all-access&agent=user&redirects=0&range=latest-90&pages=Jugend_i_Sverige

Expected behavior

For a course like that one, the Dashboard should show pageview data that is very close to the exact actual pageviews, without dramatic increases due to bad extrapolation.

Additional context

We switched to estimated pageviews, extrapolated from an article's average pageviews, because the old method of fetching complete pageview stats and continually updating them wasn't sustainable. (These stats were previously stored and updated on Revision records, and updating them all would take days or weeks and was failure-prone.)

One possible approach to solving this would be to add an earliest_views column to Article, calculate average views for each article based on data stretching back to the earliest available pageview data, and take that earliest views data into account when calculating the pageviews for an ArticlesCourses record (using the latest of either first edit or earliest views for determining the view count).

Another possible approach would be to store averages on a per-ArticlesCourses basis rather than a per-Article basis, so that each ArticlesCourses record stores an average pageviews value based on data going back to the date of the first course edit.

Another possible approach would be directly calculate and store the cumulative pageviews for each ArticlesCourses record as part of the course update process (doing so, at most, once per day). This would be the most accurate, but might introduce a lot of additional latency to the update queues.

vaidehi44 commented 1 year ago

Hey @ragesoss, from the approaches you have mentioned, I feel that 3rd one is the best, but looking at the latency issue, 2nd one can also be executed. I have some questions for the 2nd approach -

  1. It will require to add a new column to ArticlesCourses, right? (like average_page_views)
  2. How frequently will we update average_page_views? Once daily?
  3. It will use WikiPageViews API for calculating average, right?
vaidehi44 commented 1 year ago

Also, does pageviews api not store data for sandbox articles, or articles which are not in mainspace?

ragesoss commented 1 year ago

The pageviews API does provide data for non-mainspace pages —for example, https://pageviews.wmcloud.org/?project=en.wikipedia.org&platform=all-access&agent=user&redirects=0&range=all-time&pages=User:Ragesoss/sandbox — , but I think we don't want to store any data for non-mainspace pages.

  1. Yes, to work efficiently I think it would probably require two new columns, average_page_views and earliest_edit.
  2. I think once per week would be fine here. (Updating the out-of-date averages should be part of the course update process.)
  3. Yes

(This is a pretty big issue!)

vaidehi44 commented 1 year ago

Also, am not getting it when you said

The Dashboard calculates its average daily views based on data since it hit mainspace

Because I thought that average views are calculated from the last 50 days views, so it must have considered the views when it was in sandbox at some time. https://github.com/WikiEducationFoundation/WikiEduDashboard/blob/9f531b858559804173136eb8254ad47775ee09d3/lib/wiki_pageviews.rb#L41-L57

ragesoss commented 1 year ago

What I wrote in the description is not very clear, but the API only returns data for non-zero entries. Here's a better link to illustrate what I meant: https://pageviews.wmcloud.org/?project=sv.wikipedia.org&platform=all-access&agent=user&redirects=0&range=all-time&pages=Jugend_i_Sverige

You can see that pageview data only starts from the day it was moved to mainspace.

I believe this is because the pageview data is based on URLs rather than Page records. So if you wanted pageview data from the sandbox period, you'd need to query for the name it had while it was still in sandbox.

vaidehi44 commented 1 year ago

Okk...then there will always be issues in case of articles coming from sandboxes. So, don't you think we should try to store the time when an article gets shifted to mainspace?

ragesoss commented 1 year ago

That's... a good point I hadn't thought of. I guess storing some kind of start date will be necessary.

vaidehi44 commented 1 year ago

Yup. So, is there a place where we update namespace of the article when it leaves the sandbox? Or something from where this start date could be found out.

ragesoss commented 1 year ago

No, but I could incorporate that into the bigger project I'm working on to monitor sandbox progress for assigned topics. It might be best to leave this issue until then.