WikiEducationFoundation / WikiEduDashboard

Wiki Education Foundation's Wikipedia course dashboard system
https://dashboard.wikiedu.org
MIT License
385 stars 600 forks source link

[WIP] Migrate Training Update Process to Sidekiq #5754

Open ujjwalpathaak opened 3 months ago

ujjwalpathaak commented 3 months ago

fixes #4712 This PR aims to migrate the training update process to Sidekiq. So that there are free resources available on the main thread.

ujjwalpathaak commented 3 months ago

Hello @ragesoss, I have tried to move the process to Sidekiq it is working fine. LMKWYT if this seems feasible, I'll continue with this approach to add other features.

ujjwalpathaak commented 2 months ago

Hey, I tried a few things to identify if there is a Sidekiq job already running? I am not able to figure out a clean way to do it.

You suggested to handle this on the database level but if I add a boolean column to every training content, will it be a nice way to do things? As needs_update only updates those courses which have this bool set to true, whereas libraries and modules gets updated on each ./reload request?

ujjwalpathaak commented 2 months ago

Few of the options I tried-

  1. Adding "redis" as a gem

    • using redis.set('training_update_in_progress', true) as a check if an older update is already running? This would need cleanup code in case of unexpected server termination as training_update_in_progress is set false after loading of slides is done by sidekiq.
  2. Get all jobs from the default queue and check if "TrainingBaseWorker" job exists.

    • This is not consistent ad it takes time for TrainingBaseWorker to show up int he default queue.
  3. We are using the gem 'sidekiq-unique-jobs' as sidekiq_options lock: :until_executed, it has a few options Link to SidekiqUniqueJobs documentation to solve these conflicts

    • but the problem is that it would not give feedback to the user on client that another job is running but will notify this conflict on server side
ragesoss commented 2 months ago

I don't know the best approach here, but one that might work would be:

Include a column for update status on TrainingLibrary and TrainingModule records, but make it able to handle more details than just a boolean state. (It might or might not make sense to include more than one column.) The idea would be to be able to keep track of whether it was not scheduled for an update, scheduled for an update that hasn't started yet, or started an update that has not successfully completed yet (with a timestamp for when it started). So the update process would update that status in the database as soon as it starts, and then we could use some logic like 'if it's been more than X amount of time since the start of an update, check whether an update process is running and if not, it probably errored'.

Then, except for that case where an update failed but the status still shows it being updated, we wouldn't need to check Redis to show state to the user.

ujjwalpathaak commented 2 months ago

Hi @ragesoss, So I tried the approach you suggested, let me know if you suggest any changes in it. I have added a few tasks in the PR desc that i think are still needed to close this PR, I'll add more as i encounter them.

Here is the working ( not written tests for them till now but it should work fine after i write tests and do minor changes accordingly )

For errors like InvalidWikiContentError & NoMatchingWikiPagesFound - all the rows in the table will have the error as this is not slug specific. For errors like DuplicateSlugError - only the rows which have the errored_slug would contain the message.

After hitting on /reload_trainings a timer (currently for 15min) would run up with all the rows in training_libraries changes their update_status to 1 ( scheduled ).. as we will start to fetch content from wikimedia for a slug it will change state to 2 ( stared updating ) then after calling .inflate on that slug if no errors occur it will again default to 0 showing update complete

if errors occurs it is handled by not setting state of 2 to 0 so we could catch if error occured. After 15min the timer would check for all rows if a row has 2 as update_status then it would render the error stored in update_error of that row to the user

we can also check if a thread of this 15min timer is currently running to see if a update is already running?

I still have a few blockers as to how to deal with yaml content should i entierly exclude it from the new workflow or just add conditions in between and make it work & how to deal with ModuleNotFound error as to where to store if this occurs.

I'll work on them and add new commits as I figure out the issues.

ujjwalpathaak commented 2 months ago

@ragesoss I had a doubt. I was getting a DuplicateSlugError error for "editing-basics" slug while loading it from wiki_pages. what is the expected behaviour here? should the update_process skip it and go ahead or it should show the error to user and terminal the process?

ragesoss commented 2 months ago

That typically happens because you have done training updates in both wiki_education true and false modes, and there is one module with that slug that gets pulled from the wiki (in false mode) and another from the .yml files (in true mode). Delete all the TrainingModule and TrainingSlide records and you should be able to proceed.

ujjwalpathaak commented 2 months ago

I wanted to ask how should these erros in prod behave? if an error is raised should it stop the workflow or that slug should be skipped and the workflow should continue.

ragesoss commented 2 months ago

Hopefully such errors won't occur in production, but skipping that slug and continuing would be nice

ujjwalpathaak commented 2 months ago

Hey @ragesoss, For errors like NoMatchingWikiPageFound, NoModuleFound and more... I cannot target a single row, currenlty I am solving this problem using a tacky solution but I am not finding that implementation nice.

I was wondering if it is feasible to make a new table to record the errors that are raised during the background update process so we dont need to iterate on all the rows in all 3 tables to do any operations.

Table Name - training_update_status
------ 1. row ------
content_class: TrainingLibrary
update_status: 0 // not_scheduled
error_message: NULL

------ 2. row ------
content_class: TrainingModule
update_status: 1 // scheduled
error_message: NULL

------ 3. row ------
content_class: TrainingSlide
update_status: 2 // error_occured
error_message: could not get links from '#{@wiki_base_page}'

This would reduce the complexity of the workflow + reduce iterations over the dashboard

ragesoss commented 2 months ago

I'd rather not add a table just to hold this information. If the idea would be to limit the data to just one entry per record type, perhaps a Setting record could be used to store the status.

ujjwalpathaak commented 2 months ago

Okay I'll use setting records to store the data, and I'll start to write tests to test the functionality

ujjwalpathaak commented 2 months ago

Hey @ragesoss I have been trying to tweek a lot of things here and there, but still there are a few loose ends to tie up. If you could can you have a look and maybe guide me what all should I keep in mind ahead from here in terms of testing and expected behaviour?

ujjwalpathaak commented 2 months ago

Hey @ragesoss, Most of the functionality is now working for me locally, but there is one issue I need some assistance with.

I am adding a new initializer - WikiEduDashboard/config/initializers/training_update_process_cleanup.rb

require_dependency "#{Rails.root}/lib/data_cycle/batch_update_logging"
require_dependency "#{Rails.root}/lib/training/training_base"

include BatchUpdateLogging

delete_pid_file(:training)
TrainingBase.new.update_process_state(0)
TrainingBase.new.clear_error_messages

This is to basically clean any leftovers from the last run ( in development ), this is working locally for me but when I push this to github I get this error -

Run mkdir tmp -m 777
  mkdir tmp -m 777
  cp config/application.example.yml config/application.yml
  cp config/database.example.yml config/database.yml
  bin/rails db:migrate RAILS_ENV=test
  shell: /usr/bin/bash -e {0}
  env:
    RAILS_ENV: test
    DATABASE_PORT: 3306
rails aborted!
ActiveRecord::StatementInvalid: Mysql2::Error: Table 'dashboard_testing.settings' doesn't exist

am I missing a step or doing something wrong?

ragesoss commented 2 months ago

I don't see that error in the CI log. Where during the build is it happening?

ujjwalpathaak commented 2 months ago

Commit - https://github.com/WikiEducationFoundation/WikiEduDashboard/pull/5754/commits/e00b7e0bb168459450b08d875a1ecaa5d184c145 Test Logs for the above commit - https://github.com/WikiEducationFoundation/WikiEduDashboard/pull/5754/checks?check_run_id=24057578970

Started to after this commit