Yelp / doloop

Task loop for keeping things updated
Other
10 stars 10 forks source link

include locked rows in delayed, new, updated, etc. #5

Closed coyotemarin closed 13 years ago

coyotemarin commented 13 years ago

We currently don't include locked rows in the delayed stats returned by doloop.stats() because we're trying to avoid doing anything more than an index scan. However, in most cases, locked rows are going to be a small minority of all the rows in the doloop table, so it seems like we could do an index scan for the unlocked rows, and then walk through all the locked rows.

coyotemarin commented 13 years ago

After EXPLAINing some queries, it looks like the cleanest way to do this is:

SELECT COUNT(*)
     FROM `...`
     WHERE (`lock_until` IS NULL
           AND `last_updated` <= UNIX_TIMESTAMP() - ...) OR
           (`lock_until` IS NOT NULL
            AND `last_updated` <= UNIX_TIMESTAMP() - ...)

which looks pretty silly. But works, and is probably more portable than index hints.

coyotemarin commented 13 years ago

After thinking about various things that can go wrong with a loop table, and trying to design monitoring tools that can catch them, I think we should probably worry less about which queries are more efficient, and more about warning users when IDs have gone a long time since being updated. Under the current system, we could get() and lock a bad ID (one that causes a crash before we can call did()) over and over again, and we wouldn't notice, because we wouldn't included locked/bumped rows in update stats.

In practice, only a small fraction of rows should be locked, so it should be fine to use an index scan on most of the table, and then scan over all locked rows, using a "silly" query like the one above.

coyotemarin commented 13 years ago

Been messing with various permutations of these queries on a table with about 30 million rows, on a wimpy database server (my 4-year-old MacBook Pro).

It looks like there's no cheap way to do a COUNT(*); it kind of looks like I can get some speedup by using the index in some cases, but actually MySQL uses (lock_until, last_updated) as a covering index no matter what I do.

So I really might as well roll these all into one monstrous covering-index-scanning query.

coyotemarin commented 13 years ago

Checked in a new stats() function after trying a few variations. It turns out that SUM(IF(...)) is kind of expensive, so it's better to keep the min/max ID queries and locked/bumped queries separate.