DistributedProofreaders / dproofreaders

Distributed Proofreaders is a web application intended to ease the process of converting public domain books into e-texts.
https://www.pgdp.net
GNU General Public License v2.0
46 stars 28 forks source link

Standardize cronjob scripts and surface errors more easily #1233

Open cpeel opened 2 weeks ago

cpeel commented 2 weeks ago

We currently have a selection of various cronjob in crontab/ that we tell admins to add to a user's crontab (see SETUP/dp.cron) for various background processing activities.

Each of these is their own special snowflake that outputs different information. The only way to determine if one of them has failed is to look at the cron email and parse the results. Moreover, the email is the only record that they ran and the results. All of the scripts run through Apache which requires us to set a long TimeOut value (upwards of 30 to 60 minutes) to ensure these long-running scripts complete.

We need to create a standard way to run these jobs such that:

It probably makes sense to keeping using cron to manage which job runs at what time -- why recreate that wheel? However, we could have a wrapper script that runs a named job such that there's one entrypoint for these background activities.

cpeel commented 2 weeks ago

One possible solution would be to create a new BackgroundJob class that individual cron job activities would inherit from. This parent class would handle logging the create time and end time and detecting failures.

The child classes would output any status messages to a logging file somewhere -- not stdout. Upon failure the child class would throw an exception with a useful error message.

There would then be single PHP script that accepts a class name as the first parameter. The script would check if the specified class exists and is inherited from BackgroundJob and if so it instantiates one and runs its go() function (or somesuch). If the function succeeds (no exceptions) it doesn't output anything. If it fails, it outputs information to stdout that the cronjob would then see and email to the user.

The entrypoint script would be called from the crontab.

jmdyck commented 2 weeks ago

Moreover, the email is the only record that they ran and the results.

automodify.php is one exception to this, as the results of each run are saved to a file in d/stats/automodify_logs/

Also, automodify.php and take_tally_snapshots.php add entries to the job_logs table to record that they ran. (There used to be a few other jobs that added entries to that table, but not since June 2005.)

* We are notified via email if the script fails, but not if it succeeds

Currently, I'd rather receive an empty email on success (for jobs that run daily or less often), because a lack of email could be for various reasons (cron failure, email failure, job is still running/is stuck), and I don't want those to look the same as success. However, I'm guessing that the new framework would include an easy way (dashboard?) to see the status/logs of recent cron jobs, so that would probably take care of my concern.

(It occurs to me that I'd probably want such a dashboard to highlight jobs that should have run but didn't, which is an interesting problem.)

* The scripts are run outside of Apache (note that this may result in some file permissions issues to work through)

I found an email from Charlz (from 2003) that says "due to the way PHP was installed on texts01 we cannot run php as a cron directly". So that was presumably the reason for cron jobs going through the web server.

But somehow I also remember file permission problems (e.g., a file created by a non-Apache invocation of PHP couldn't be read by Apache processes). Maybe that wasn't DP.

Another problem might be that a direct invocation of PHP doesn't have all the environment stuff that php_mod sets up (or doesn't have it the way our code is expecting). So the wrapper might need to fake some of that.

cpeel commented 1 week ago

@jmdyck - my current prototype leans into using the job_logs table for all background jobs. That table grows unbounded right now which seems not-ideal and unnecessary -- and will only get worse if we start logging more into it.

Seems reasonable to start pruning that table for records older than some period of time. I'm thinking 13 months (a full year with some extra). Thoughts?

jmdyck commented 1 week ago

I suppose it depends on what we envision the data being used for. E.g., if someone wondered "Has the duration of task T changed over time?", then they might want multiple years of log data. That seems pretty unlikely though.

In the normal run of things, it seems like we'd typically only be interested in a couple days of log data (e.g., to investigate a recent anomaly).

So 13 months of past data sounds okay, but I could also imagine 2 months or 2 weeks sounding equally reasonable. Maybe we should wait until we have experience with how we use the new job infrastructure.

Ultimately, I'm guessing data retention policy is a GM decision.