DistributedProofreaders / dproofreaders

Distributed Proofreaders is a web application intended to ease the process of converting public domain books into e-texts.
https://www.pgdp.net
GNU General Public License v2.0
46 stars 28 forks source link

Make past_tallies a sparse table #1199

Closed cpeel closed 3 weeks ago

cpeel commented 2 months ago

Background

Background pulled from #1190

The past_tallies table stores historical round, user, and team statistics (current_tallies stores the current day's statistics) including the number of pages done that day (tally_delta). Once a day the site takes a snapshot of current_tallies, calculates the delta since the last snapshot, and adds a row in the past_tallies table for every [tally_type, tally_name, holder_id] tuple.

After a user has done a page in a round, they have an entry for that round in current_tallies and the site adds a new record in past_tallies for that user/round in every daily snapshot going forward. Said another way: we keep daily records of every user that has touched a round forever, even if they never proofread another page.

At pgdp.net, 24 years of this has caught up with us. The past_tallies table is 50GB in size and contains 598,611,362 rows. Of those 594,322,594 are records where tally_delta=0 -- 99%.

This PR

This PR converts past_tallies to a sparse table wherein we do not store entries where tally_delta==0. It includes an upgrade script to delete those records.

We use past_tallies in a few primary ways:

  1. To get the sum of pages done over a specific time span eg: last week, last month
    • A sparse table format where we do not store rows where tally_delta == 0 isn't a problem for this one. We just sum(tally_delta) over a time range anyway -- we don't need the zeros.
  2. To get the number of pages done at a specific time (usually "yesterday")
    • This one we could calculate from a sparse past_tallies pretty easily.
  3. To get the number of pages done every day over some span for a table or a chart
    • Without some sort of secondary table that records all of the snapshot times we can't accurately generate the data because we don't know what days are in-between the records where tally_delta <> 0. past_tallies itself isn't set up to query against timestamp and there's nothing that guarantees that every snapshot will have at least one non-zero tally_delta. (This might be practically true on pgdp.net but isn't guaranteed to be true for all sites.)
  4. In concert with current_tallies to determine how many pages they did between yesterday and today to add new entries into `past_tallies
    • We can programmatically determine this from a sparse past_tallies but we add a whole lot of computation for every snapshot in ways that just don't make sense.

This PR introduces 1 new tables:

And adds additional columns to another:

past_tallies is now touched in only the following places:

Testing

Testing/validating this is going to be a bit tricky because it involves changing data. I've created a separate database on TEST that is a copy of the current one. On this copy I've run the upgrade scripts and the sandbox below is using it. There is a crontab that refreshes the sparse DB after UTC midnight so they will only have at most a day of drift.

The sandbox has a nightly cronjob to take snapshots of the database but it is currently disabled. What this means is that the database will drift throughout a day and catch up at midnight UTC.

I suggest we test this in two phases:

  1. Validate that the overall approach is sound, the non-snapshot-taking code works, and that the sparse DB sandbox matches the main TEST sandbox. This is going to be confirming that the user's "yesterday" stats are correct and that the stats (graphs and tables) match across the rounds.
  2. After we feel good with the code and the sparse format functionality I'll disable the sync and enable the snapshot crontab and we can validate that the nightly snapshot code is working like we want.

Sandbox: https://www.pgdp.org/~cpeel/c.branch/revamp-past-tallies-sandbox/

Current past_tallies sizes on TEST

# Main DB
mysql> select count(*) from past_tallies;
+----------+
| count(*) |
+----------+
|  2304294 |
+----------+
1 row in set (1.69 sec)

# Sparse DB
mysql> select count(*) from past_tallies;
+----------+
| count(*) |
+----------+
|     9027 |
+----------+
1 row in set (0.01 sec)
cpeel commented 2 months ago

I've created this testing code for a much more automated testing approach. The goal is to check that the data gathering functions in the current code against the main TEST database matches the new code against the sparse DB:

<?php
$relPath = "./pinc/";
include_once($relPath."base.inc");

include($relPath."udb_user.php");
echo "DB: $db_name\n";

foreach (get_all_current_tallyboards() as $tallyboard) {
    if ($tallyboard->holder_type != 'S') {
        continue;
    }
    echo "tally_name: {$tallyboard->tally_name}, holder_type: {$tallyboard->holder_type}\n";
    $time_of_latest_snapshot = $tallyboard->get_time_of_latest_snapshot();
    echo "    get_time_of_latest_snapshot: $time_of_latest_snapshot\n";
    echo "    get_info_from_latest_snapshot:\n";
    dump_array($tallyboard->get_info_from_latest_snapshot(1));
    echo "    get_info_re_largest_delta:\n";
    dump_array($tallyboard->get_info_re_largest_delta(1));
    $timestamps = [
        [0, $time_of_latest_snapshot + 100],
        [1430438400, $time_of_latest_snapshot + 100],
        [1639958400, $time_of_latest_snapshot + 100],
        [1430438400, 1639958400],
    ];
    echo "    get_delta_sum:\n";
    foreach ($timestamps as [$start, $end]) {
        echo "        $start-$end: " . $tallyboard->get_delta_sum(1, $start, $end) . "\n";
    }
    echo "    get_deltas:\n";
    dump_array($tallyboard->get_deltas(1, 0));
}

function dump_array($array)
{
    echo preg_replace("/^/m", "        ", json_encode($array, JSON_PRETTY_PRINT | JSON_NUMERIC_CHECK));
    echo "\n";
}

I ran this against both and did a diff between them:

cpeel@ip-172-31-12-95:~/public_html/c.branch/revamp-past-tallies-sandbox$ php test-Tallyboard.php  > /tmp/sparse.txt
cpeel@ip-172-31-12-95:/data/htdocs/c$ php test-Tallyboard.php > /tmp/full.txt
cpeel@ip-172-31-12-95:/data/htdocs/c$ diff /tmp/full.txt /tmp/sparse.txt
1c1
< DB: dp_utf8test
---
> DB: dp_project_shrink
cpeel commented 1 month ago

I've rebased the branch to fix the merge conflict.

I've updated the cronjob on TEST so that just before every UTC midnight the sandbox DB gets updated from the main TEST DB and the migration scripts run on it. Then just past midnight the nightly snapshot happens for the snapshot DB.

cpeel commented 1 month ago

These are great feedback Michael, thanks. I'll work through them tonight.

cpeel commented 1 month ago

Latest commit addresses the good feedback from @jmdyck.

The aforementioned validation script still shows no functional differences between the functions in master and those in this sandbox.

cpeel commented 1 month ago

My intent is to get this merged in before June 8th and rolled out in the June 12th deployment.