WikiEducationFoundation / WikiEduDashboard

Wiki Education Foundation's Wikipedia course dashboard system
https://dashboard.wikiedu.org
MIT License
385 stars 600 forks source link

Recent Wikidata revisions cannot be imported #5802

Closed ragesoss closed 1 month ago

ragesoss commented 1 month ago

What is happening?

The revision IDs on Wikidata recently passed 2.14 billion, and can no longer be represented by a signed 4-byte integer.

The Dashboard (at least, in production) uses a 4-byte signed integer for the Revisions table for mw_rev_id, so attempting to import recent Wikidata revisions will cause a course update to error out.

We'll need to update the Revisions table... although production may be too large to do this.

ragesoss commented 1 month ago

@gabina this is related to your project, and I'd like to discuss it with you. I'm afraid to try to update the revisions table schema in Programs & Events Dashboard production, but maybe we could delete everything in the Revisions table first, then do a migration to change the schema. Whether there are any problems you can foresee if we delete the all the contents of the Revisions table (and then, start re-importing revisions for courses that are still going on) is what I'd want to get your view on.

gabina commented 1 month ago

wow interesting problem

Whether there are any problems you can foresee if we delete the all the contents of the Revisions table (and then, start re-importing revisions for courses that are still going on) is what I'd want to get your view on.

well... if I'm not mistaken, we don't have DB restrictions like primary/foreign keys (only some unique constraints), so removing all the historical Revisions records shouldn't be a big deal from that point of view. At first glance I would say that historically data that depends on revision table would stop working. For example, the Revisions table that you see here will be empty I think (because it's for an ended course that won't have any Revisions data once the data in the table is removed). So I think that all the points that I listed in this doc (not the cache things, that would persist on the other tables) would be affected, at least for already ended courses. The good thing about this is that we were going to solve that problem anyway, because sooner rather than later we were going to get rid of the revision table. I think there would always be the possibility of releasing a manual update for those already ended courses, in case a user requests it (but doing so for many courses at once would saturate the system I guess).

I think the best way to have a better answer for this question is to try to simulate the process. I can try to do that tomorrow if that makes sense. I could download some courses locally, explore them, removing the Revisions rows for them to see what happens, and then try to make a manual update, or make the usual update for ongoing courses to be sure that works smoothly. It won't be perfect but I think it's worth it?

gabina commented 1 month ago

My two cents on the general problem.

ragesoss commented 1 month ago

Thanks! It would be great if you can do the simulation you have in mind tomorrow. I'm not worried about (for example) the revisions table on the Student view, and I'll review the other endpoints that do rely on the revisions table and I'll also prepare and test a migration from the perspective of making new Wikidata revisions work. I could also make a duplicate of the Wiki Education production database and see how long it takes to a) migrate without dropping the revisions, and b), drop the revisions. Both of those will give us an idea of how best to proceed for each production server.

ragesoss commented 1 month ago

It looks like Revision.connection.truncate('revisions') is very efficient at clearing the table, and the lack of foreign key relationships at the database level means there shouldn't be any problem with that step. I tested this on our staging server (which only had a relatively small number of revisions).

We also have surprisingly few places that point to Revision rows even just on the Rails side of the database behavior. The only case of revision_id in the schema is on the Alerts table. We have several alert types that will store a Revision ID... although even in those cases, I don't think we actually use that anywhere that will break. The generic alert mailer will include a diff URL if alert.revision is present, but deleting the revision without removing the alert.revision_id value will just result in alert.revision => nil, and won't introduce any errors.

I am pretty confident that new updates on any still-current course will just result in re-importing all the revisions. So it will go very slowly while it catches up with all the courses that are still ungoing updates, but no manual updates should be required.

I haven't found any other things that will be disrupted except the revisions tables on the Students tab of Course pages, but I'm not confident that I haven't missed something. But, at least, I think this won't badly break the site.

ragesoss commented 1 month ago

The one piece of somewhat important data that we won't be able to recover is the ithenticate_id field. This is used to indicate that a revision got flagged as potential plagiarism, and the ID is used to access the Turnitin (aka iThenticate) report about matching text content.

This is not important data on Programs & Events Dashboard.

It's data I'd like to keep for Wiki Education Dashboard, so I'll write a script to collect all the mw_rev_ids and ithenticate_ids from production to save offline.

gabina commented 1 month ago

I downloaded a course this morning locally, and then deleted the revisions rows for them. I wrote a bit about the experience here, but I did not find anything out of the expected. Re-running an update for ended course would fix the missing things if someone needs it.

The one piece of somewhat important data that we won't be able to recover is the ithenticate_id field. This is used to indicate that a revision got flagged as potential plagiarism, and the ID is used to access the Turnitin (aka iThenticate) report about matching text content.

Yes, I agree with that. The plagiarism is ingested during a process different from the course stats update, and the tool we use "gets the most recent instances of plagiarism", and then we match those data with existing revisions in the table. I didn't find more details about "most recent instances", so I don't know if there is any way to get previous "instances of plagiarism" in an easy way.

gabina commented 1 month ago

I am pretty confident that new updates on any still-current course will just result in re-importing all the revisions. So it will go very slowly while it catches up with all the courses that are still ungoing updates, but no manual updates should be required.

Yes, looking at the logic in the RevisionImporter code, I think that even for non-full updates, all the revisions would be re-imported if there are no revisions for the course users. I even simulated that situation for a course locally, and results are the same when calling UpdateCourseStats for full or non-full updates. If I'm not mistaken, manual updates should only be required for already ended courses (in case someone is interested in them).