WikiEducationFoundation / WikiEduDashboard

Wiki Education Foundation's Wikipedia course dashboard system
https://dashboard.wikiedu.org
MIT License
392 stars 631 forks source link

[Data rearchitecture] Do not count system revisions #5949

Closed gabina closed 2 months ago

gabina commented 2 months ago

What this PR does

This PR updates the timeslices behavior to exclude revisions where the system field is set to true. After debugging course stats discrepancies between Latinx Past Wiki Scholars in production and in the data-rearchitecture instance having access to the production db, I discovered that production version was excluding system revisions from metrics while the timeslice version doesn't. This PR also modifies the specs to guarantee that system revisions are now discarded.

Open questions and concerns

I was able to test the changes locally using the Latinx Past Wiki Scholars course and I got the same results as in production. However, I noticed something pretty important. The revision query that we make against the replica to retrieve revisions takes a tags parameter, which is determined in the dashboard side based on the oauth_ids env variable. The system value is defined in the revisions query as follows: case when ct.ct_tag_id IS NULL then 'false' else 'true' end as system The tags parameter is used in a LEFT JOIN:

LEFT JOIN change_tag_def ctd
      ON ctd.ctd_name IN ($tags)
LEFT JOIN change_tag ct
    ON ct.ct_rev_id = c.rev_id
    AND ct.ct_tag_id = ctd.ctd_id

This means that the revision query results (particularly, the system value) depend on the tags parameter. This is something pretty important to take into account when comparing course stats from different instances (if they have oauth_ids env variable set to different values, then stats will differ).

Last but not least, if the oauth_ids is set to multiple values (for example, '218,306,542,4978'), then the revision raw data has duplicated mw_rev_id records. For example, when we retrieve rev 1239145084 for en.wikipedia with oauth_ids set to 218,306,542,4978, we get the revision 4 times, while only one has system set to true. This is due to the Tag: dashboard.wikiedu.org [2.3] which is related to OAuth CID 4978 according to tag definitions.

image

If we use multiple values for oauth_ids in production, then we should guarantee that always keep the revision version that has system set to true (if any).

[
  [
    "77560761",
    {
      "article" => {
        "mw_page_id" => "77560761",
        "title" => "Agritodeguerra",
        "namespace" => "2",
        "wiki_id" => 1
      },
      "revisions" => [
        {
          "mw_rev_id" => "1239145084",
          "date" => Wed, 07 Aug 2024 15:54:10 +0000,
          "characters" => "136",
          "mw_page_id" => "77560761",
          "username" => "Agritodeguerra",
          "new_article" => "true",
          "system" => "false",
          "wiki_id" => 1
        },
        {
          "mw_rev_id" => "1239145084",
          "date" => Wed, 07 Aug 2024 15:54:10 +0000,
          "characters" => "136",
          "mw_page_id" => "77560761",
          "username" => "Agritodeguerra",
          "new_article" => "true",
          "system" => "false",
          "wiki_id" => 1
        },
        {
          "mw_rev_id" => "1239145084",
          "date" => Wed, 07 Aug 2024 15:54:10 +0000,
          "characters" => "136",
          "mw_page_id" => "77560761",
          "username" => "Agritodeguerra",
          "new_article" => "true",
          "system" => "true",
          "wiki_id" => 1
        },
        {
          "mw_rev_id" => "1239145084",
          "date" => Wed, 07 Aug 2024 15:54:10 +0000,
          "characters" => "136",
          "mw_page_id" => "77560761",
          "username" => "Agritodeguerra",
          "new_article" => "true",
          "system" => "false",
          "wiki_id" => 1
        }
      ]
    }
  ]
]
gabina commented 2 months ago

Failing specs are not related to these changes:

1) article finder performs searches and returns results
     Failure/Error: expect(error.level).not_to eq('SEVERE'), error.message

       http://127.0.0.1:41553/assets/javascripts/app_assets_javascripts_components_app_jsx.js 48683:12 "Error: " Error: JSONP request to https://en.wikipedia.org/w/api.php?action=query&format=json&origin=*&prop=revisions&rvprop=userid%7Cids%7Ctimestamp&titles=Snapchat%20dysmorphia failed
           at jsonpScript.onerror (http://127.0.0.1:41553/assets/javascripts/vendors.js:61876:16)
     # ./spec/rails_helper.rb:118:in `block (4 levels) in <top (required)>'
     # ./spec/rails_helper.rb:109:in `each'
     # ./spec/rails_helper.rb:109:in `block (3 levels) in <top (required)>'
     # ./spec/rails_helper.rb:108:in `block (2 levels) in <top (required)>'

  2) LiftWingApi#get_revision_data fetches json from api.wikimedia.org for wikipedia
     Failure/Error: expect(subject0.dig('829840085', 'wp10').to_f).to eq(29.15228958136511656)

       expected: 29.152289581365117
            got: 29.15228958136513

       (compared using ==)
     # ./spec/lib/lift_wing_api_spec.rb:43:in `block (4 levels) in <top (required)>'
     # ./spec/lib/lift_wing_api_spec.rb:35:in `block (3 levels) in <top (required)>'