Closed cpeel closed 3 weeks ago
We use past_tallies
in a few primary ways
tally_delta
== 0 isn't a problem for this one. We just sum(tally_delta)
over a time range anyway -- we don't need the zeros.past_tallies
pretty easily.tally_delta
<> 0. past_tallies
itself isn't set up to query against timestamp
and there's nothing that guarantees that every snapshot will have at least one non-zero tally_delta
. (This might be practically true on pgdp.net but isn't guaranteed to be true for all sites.)current_tallies
to determine how many pages they did between yesterday and today to add new entries into `past_tallies
past_tallies
but we add a whole lot of computation for every snapshot in ways that just don't make sense.I propose introducing 2 new tables:
CREATE TABLE `latest_snapshot_tallies` (
`tally_name` char(2) NOT NULL default '',
`holder_type` char(1) NOT NULL default '',
`holder_id` int(6) unsigned NOT NULL default '0',
`timestamp` int(10) unsigned NOT NULL default '0',
`tally_delta` int(8) NOT NULL default '0',
`tally_value` int(8) NOT NULL default '0',
PRIMARY KEY (`tally_name`,`holder_type`,`holder_id`)
);
CREATE TABLE `tally_snapshot_times` (
`timestamp` int(10) unsigned NOT NULL default '0',
PRIMARY KEY (`timestamp`)
);
latest_snapshot_tallies
stores the most recent snapshot -- including tally_delta
=0 entries -- for every holder. There is only one entry for every holder however, the timestamp changes at every snapshot. This table can be used with current_tallies
to calculate new past_tallies
records. It can also be used as a very fast (index-based) lookup for "yesterday's" metrics which we do very often for rounds and users. And finally it is also a fast lookup for the most recent snapshot timestamp.
tally_snapshot_times
just stores a timestamp of every snapshot. That's it. This one is useful for recreating charts and tables that might be filled with zeros.
I've got a theoretically working (read: coded but untested) version of the above coded up and it looks like this will work for all of the main code use-cases as well as noncvs use-cases.
Follow-up: this decreased the number of rows in the past_tallies
table at pgpd.net:
Before: 598,611,362 rows
Now: 4,320,420 rows
And after an optimize
, the size of the tablespace file:
Before: 53,464MB (53GB)
After: 293MB
The
past_tallies
table stores historical round, user, and team statistics (current_tallies
stores the current day's statistics) including the number of pages done that day (tally_delta
). Once a day the site takes a snapshot ofcurrent_tallies
, calculates the delta since the last snapshot, and adds a row in thepast_tallies
table for every[tally_type, tally_name, holder_id]
tuple.After a user has done a page in a round, they have an entry for that round in
current_tallies
and the site adds a new record inpast_tallies
for that user/round in every daily snapshot going forward. Said another way: we keep daily records of every user that has touched a round forever, even if they never proofread another page.At pgdp.net, 24 years of this has caught up with us. The
past_tallies
table is 50GB in size and contains 598,611,362 rows. Of those 594,322,594 are records wheretally_delta
=0 -- 99%.We need to convert the table to a sparse format where we only store rows where
tally_delta
<> 0 and adjust all of the code accordingly. This will drastically reduce the size of the table and improve our DB backup times considerably.