DistributedProofreaders / dproofreaders

Distributed Proofreaders is a web application intended to ease the process of converting public domain books into e-texts.
https://www.pgdp.net
GNU General Public License v2.0
46 stars 28 forks source link

Change past_tallies to a sparse format #1190

Closed cpeel closed 3 weeks ago

cpeel commented 2 months ago

The past_tallies table stores historical round, user, and team statistics (current_tallies stores the current day's statistics) including the number of pages done that day (tally_delta). Once a day the site takes a snapshot of current_tallies, calculates the delta since the last snapshot, and adds a row in the past_tallies table for every [tally_type, tally_name, holder_id] tuple.

After a user has done a page in a round, they have an entry for that round in current_tallies and the site adds a new record in past_tallies for that user/round in every daily snapshot going forward. Said another way: we keep daily records of every user that has touched a round forever, even if they never proofread another page.

At pgdp.net, 24 years of this has caught up with us. The past_tallies table is 50GB in size and contains 598,611,362 rows. Of those 594,322,594 are records where tally_delta=0 -- 99%.

We need to convert the table to a sparse format where we only store rows where tally_delta <> 0 and adjust all of the code accordingly. This will drastically reduce the size of the table and improve our DB backup times considerably.

cpeel commented 2 months ago

We use past_tallies in a few primary ways

  1. To get the sum of pages done over a specific time span eg: last week, last month
    • A sparse table format where we do not store rows where tally_delta == 0 isn't a problem for this one. We just sum(tally_delta) over a time range anyway -- we don't need the zeros.
  2. To get the number of pages done at a specific time (usually "yesterday")
    • This one we could calculate from a sparse past_tallies pretty easily.
  3. To get the number of pages done every day over some span for a table or a chart
    • Without some sort of secondary table that records all of the snapshot times we can't accurately generate the data because we don't know what days are in-between the records where tally_delta <> 0. past_tallies itself isn't set up to query against timestamp and there's nothing that guarantees that every snapshot will have at least one non-zero tally_delta. (This might be practically true on pgdp.net but isn't guaranteed to be true for all sites.)
  4. In concert with current_tallies to determine how many pages they did between yesterday and today to add new entries into `past_tallies
    • We can programmatically determine this from a sparse past_tallies but we add a whole lot of computation for every snapshot in ways that just don't make sense.

I propose introducing 2 new tables:

CREATE TABLE `latest_snapshot_tallies` (
  `tally_name` char(2) NOT NULL default '',
  `holder_type` char(1) NOT NULL default '',
  `holder_id` int(6) unsigned NOT NULL default '0',
  `timestamp` int(10) unsigned NOT NULL default '0',
  `tally_delta` int(8) NOT NULL default '0',
  `tally_value` int(8) NOT NULL default '0',
  PRIMARY KEY (`tally_name`,`holder_type`,`holder_id`)
);

CREATE TABLE `tally_snapshot_times` (
  `timestamp` int(10) unsigned NOT NULL default '0',
  PRIMARY KEY (`timestamp`)
);

latest_snapshot_tallies stores the most recent snapshot -- including tally_delta=0 entries -- for every holder. There is only one entry for every holder however, the timestamp changes at every snapshot. This table can be used with current_tallies to calculate new past_tallies records. It can also be used as a very fast (index-based) lookup for "yesterday's" metrics which we do very often for rounds and users. And finally it is also a fast lookup for the most recent snapshot timestamp.

tally_snapshot_times just stores a timestamp of every snapshot. That's it. This one is useful for recreating charts and tables that might be filled with zeros.

I've got a theoretically working (read: coded but untested) version of the above coded up and it looks like this will work for all of the main code use-cases as well as noncvs use-cases.

cpeel commented 3 weeks ago

Resolved with https://github.com/DistributedProofreaders/dproofreaders/pull/1199

cpeel commented 1 week ago

Follow-up: this decreased the number of rows in the past_tallies table at pgpd.net:

Before: 598,611,362 rows
Now:      4,320,420 rows

And after an optimize, the size of the tablespace file:

Before: 53,464MB (53GB)
After:     293MB