Closed gabina closed 2 months ago
Failing specs are not related to these changes:
1) article finder performs searches and returns results
Failure/Error: expect(error.level).not_to eq('SEVERE'), error.message
http://127.0.0.1:41553/assets/javascripts/app_assets_javascripts_components_app_jsx.js 48683:12 "Error: " Error: JSONP request to https://en.wikipedia.org/w/api.php?action=query&format=json&origin=*&prop=revisions&rvprop=userid%7Cids%7Ctimestamp&titles=Snapchat%20dysmorphia failed
at jsonpScript.onerror (http://127.0.0.1:41553/assets/javascripts/vendors.js:61876:16)
# ./spec/rails_helper.rb:118:in `block (4 levels) in <top (required)>'
# ./spec/rails_helper.rb:109:in `each'
# ./spec/rails_helper.rb:109:in `block (3 levels) in <top (required)>'
# ./spec/rails_helper.rb:108:in `block (2 levels) in <top (required)>'
2) LiftWingApi#get_revision_data fetches json from api.wikimedia.org for wikipedia
Failure/Error: expect(subject0.dig('829840085', 'wp10').to_f).to eq(29.15228958136511656)
expected: 29.152289581365117
got: 29.15228958136513
(compared using ==)
# ./spec/lib/lift_wing_api_spec.rb:43:in `block (4 levels) in <top (required)>'
# ./spec/lib/lift_wing_api_spec.rb:35:in `block (3 levels) in <top (required)>'
What this PR does
This PR updates the timeslices behavior to exclude revisions where the
system
field is set totrue
. After debugging course stats discrepancies between Latinx Past Wiki Scholars in production and in the data-rearchitecture instance having access to the production db, I discovered that production version was excluding system revisions from metrics while the timeslice version doesn't. This PR also modifies the specs to guarantee that system revisions are now discarded.Open questions and concerns
I was able to test the changes locally using the Latinx Past Wiki Scholars course and I got the same results as in production. However, I noticed something pretty important. The revision query that we make against the replica to retrieve revisions takes a
tags
parameter, which is determined in the dashboard side based on theoauth_ids
env variable. The system value is defined in the revisions query as follows:case when ct.ct_tag_id IS NULL then 'false' else 'true' end as system
The tags parameter is used in a LEFT JOIN:This means that the revision query results (particularly, the
system
value) depend on thetags
parameter. This is something pretty important to take into account when comparing course stats from different instances (if they haveoauth_ids
env variable set to different values, then stats will differ).Last but not least, if the
oauth_ids
is set to multiple values (for example, '218,306,542,4978'), then the revision raw data has duplicatedmw_rev_id
records. For example, when we retrieve rev 1239145084 for en.wikipedia withoauth_ids
set to 218,306,542,4978, we get the revision 4 times, while only one hassystem
set totrue
. This is due to theTag: dashboard.wikiedu.org [2.3]
which is related to OAuth CID 4978 according to tag definitions.If we use multiple values for
oauth_ids
in production, then we should guarantee that always keep the revision version that hassystem
set totrue
(if any).