Closed cpeel closed 3 weeks ago
I've created this testing code for a much more automated testing approach. The goal is to check that the data gathering functions in the current code against the main TEST database matches the new code against the sparse DB:
<?php
$relPath = "./pinc/";
include_once($relPath."base.inc");
include($relPath."udb_user.php");
echo "DB: $db_name\n";
foreach (get_all_current_tallyboards() as $tallyboard) {
if ($tallyboard->holder_type != 'S') {
continue;
}
echo "tally_name: {$tallyboard->tally_name}, holder_type: {$tallyboard->holder_type}\n";
$time_of_latest_snapshot = $tallyboard->get_time_of_latest_snapshot();
echo " get_time_of_latest_snapshot: $time_of_latest_snapshot\n";
echo " get_info_from_latest_snapshot:\n";
dump_array($tallyboard->get_info_from_latest_snapshot(1));
echo " get_info_re_largest_delta:\n";
dump_array($tallyboard->get_info_re_largest_delta(1));
$timestamps = [
[0, $time_of_latest_snapshot + 100],
[1430438400, $time_of_latest_snapshot + 100],
[1639958400, $time_of_latest_snapshot + 100],
[1430438400, 1639958400],
];
echo " get_delta_sum:\n";
foreach ($timestamps as [$start, $end]) {
echo " $start-$end: " . $tallyboard->get_delta_sum(1, $start, $end) . "\n";
}
echo " get_deltas:\n";
dump_array($tallyboard->get_deltas(1, 0));
}
function dump_array($array)
{
echo preg_replace("/^/m", " ", json_encode($array, JSON_PRETTY_PRINT | JSON_NUMERIC_CHECK));
echo "\n";
}
I ran this against both and did a diff between them:
cpeel@ip-172-31-12-95:~/public_html/c.branch/revamp-past-tallies-sandbox$ php test-Tallyboard.php > /tmp/sparse.txt
cpeel@ip-172-31-12-95:/data/htdocs/c$ php test-Tallyboard.php > /tmp/full.txt
cpeel@ip-172-31-12-95:/data/htdocs/c$ diff /tmp/full.txt /tmp/sparse.txt
1c1
< DB: dp_utf8test
---
> DB: dp_project_shrink
I've rebased the branch to fix the merge conflict.
I've updated the cronjob on TEST so that just before every UTC midnight the sandbox DB gets updated from the main TEST DB and the migration scripts run on it. Then just past midnight the nightly snapshot happens for the snapshot DB.
These are great feedback Michael, thanks. I'll work through them tonight.
Latest commit addresses the good feedback from @jmdyck.
The aforementioned validation script still shows no functional differences between the functions in master
and those in this sandbox.
My intent is to get this merged in before June 8th and rolled out in the June 12th deployment.
Background
Background pulled from #1190
The
past_tallies
table stores historical round, user, and team statistics (current_tallies
stores the current day's statistics) including the number of pages done that day (tally_delta
). Once a day the site takes a snapshot ofcurrent_tallies
, calculates the delta since the last snapshot, and adds a row in thepast_tallies
table for every[tally_type, tally_name, holder_id]
tuple.After a user has done a page in a round, they have an entry for that round in
current_tallies
and the site adds a new record inpast_tallies
for that user/round in every daily snapshot going forward. Said another way: we keep daily records of every user that has touched a round forever, even if they never proofread another page.At pgdp.net, 24 years of this has caught up with us. The
past_tallies
table is 50GB in size and contains 598,611,362 rows. Of those 594,322,594 are records wheretally_delta
=0 -- 99%.This PR
This PR converts
past_tallies
to a sparse table wherein we do not store entries wheretally_delta
==0. It includes an upgrade script to delete those records.We use
past_tallies
in a few primary ways:tally_delta
== 0 isn't a problem for this one. We justsum(tally_delta)
over a time range anyway -- we don't need the zeros.past_tallies
pretty easily.tally_delta
<> 0.past_tallies
itself isn't set up to query againsttimestamp
and there's nothing that guarantees that every snapshot will have at least one non-zerotally_delta
. (This might be practically true on pgdp.net but isn't guaranteed to be true for all sites.)current_tallies
to determine how many pages they did between yesterday and today to add new entries into `past_talliespast_tallies
but we add a whole lot of computation for every snapshot in ways that just don't make sense.This PR introduces 1 new tables:
tally_snapshot_times
stores a timestamp of every snapshot. That's it. This one is useful for recreating charts and tables that might be filled with zeros.And adds additional columns to another:
current_tallies
now stores the most recent snapshot as well. This allows us to take a snapshot very quickly and then populatepast_tallies
. It is also used as a very fast (index-based) lookup for "yesterday's" metrics which we do very often for rounds and users. And finally it is also a fast lookup for the most recent snapshot timestamp.past_tallies
is now touched in only the following places:TallyBoard->take_snapshot()
-- update to only insert records wheretally_delta
!= 0TallyBoard->get_info_re_largest_delta()
-- unchanged, we still return info about the largesttally_delta
. It already handles the case where the tally_holder does not exist in the table.TallyBoard->get_deltas()
-- we initialize all possible snapshot times with zeros and fill in any non-zero entries.TallyBoard->get_delta_sum()
-- unchangedget_site_tally_grouped()
-- we initialize the snapshot times to zeros and then fill in the non-zero values frompast_tallies
.Testing
Testing/validating this is going to be a bit tricky because it involves changing data. I've created a separate database on TEST that is a copy of the current one. On this copy I've run the upgrade scripts and the sandbox below is using it. There is a crontab that refreshes the sparse DB after UTC midnight so they will only have at most a day of drift.
The sandbox has a nightly cronjob to take snapshots of the database but it is currently disabled. What this means is that the database will drift throughout a day and catch up at midnight UTC.
I suggest we test this in two phases:
Sandbox: https://www.pgdp.org/~cpeel/c.branch/revamp-past-tallies-sandbox/
Current
past_tallies
sizes on TEST