Optimization for queries that have no end date

JDGrimes commented 7 years ago

We introduced a basic caching API in #6, and it works fine for queries that extend over a fixed period, from date one to date two. However, the more common use-case will generally be open-ended queries, that extend from some past point to the present.

While working on the widgets (#1), we have optimized the queries that it uses so that for queries measured in anything longer than seconds, there will be a particular start date that is reused over a period of time. That is, there will still be a fixed start date for the query, that will remain the same for several minutes, hours, days, weeks, or months, depending on how long of a period of time the query is for.

So we have queries that are anchored at the start end, meaning that only the end date is dynamic. The end date is basically rolling, remaining always at the current second, while the start date will jump in discrete steps, points that we can anchor a cache to.

The problem is that our caching currently always takes into account both the start and end dates for the query. It doesn't treat queries which have no specified end date as special (in fact, the present date is just filled in). On the positive side, this means that the cache does not need to ever be invalidated, because the queries are always anchored on both ends, and the log history is unlikely to change (and we couldn't really detect if it did anyway). On the other hand, it also means that after the query is cached it is unlikely to be reused, because even if the start date is the same, the end date will have moved on, even if only by a few seconds.

So what we need to do to optimize for this is to provide a way for open-ended, to-the-present, queries, to be cached only based on their start date.

JDGrimes commented 7 years ago

There are two ways that we could then go about keeping the cache up-to-date as new transactions take place.

The simplest thing, perhaps, would be to just invalidate the cache after every points transaction. However, this would have a downside in that the cache would have a fairly short lifecycle, and would be invalidated unnecessarily at times, since not every transaction would affect who the top users were or how many points they had.

However, we can't just take into account transactions affecting only the top users either, because of course another user who is earning points could then become one of the top users. (And conversely, if one of them looses points, they may drop out, and so we'd need to fill their place with somebody else.)

This means that ideally it would be better if we could check if a transaction would affect who was among the top users, and only invalidate the cache then. Or, better yet, to just update the cache right there.

The problem, of course, is that we can't really check that without rerunning the query again anyway, which defeats the purpose.

JDGrimes commented 7 years ago

So what I've been contemplating is that we could introduce a special kind of block that is utilized only in this kind of query that ends in the present.

Right now, we don't fill in the points for a block until the block's end date has passed. But in this case what we could do is fill in the block data as the transactions take place, so that it would represent a sort of running tab.

We'd still have to invalidate the cache, but instead of having to query the raw points logs when the time came, we could use the active block in our query, which would be faster.

JDGrimes commented 7 years ago

I just thought of a problem with any scheme like that though. It would require us to take all of the query args into account, and check if a particular transaction matched a particular query. That could get pretty complex, especially when there are IN and LIKE conditions.

We might eventually find that it would be worth it, but I'm not inclined to build it at this point. Although, I suppose we could easily just take in some of the more common cases, like when a particular points type is requested, etc.

JDGrimes commented 7 years ago

Whatever we do though, cache invalidation is likely going to require that we be able to get a list of all of the cached queries. As it currently stands, we have no way of doing that.

I suppose though, that an alternative would be to use a key for the cache, so that when the key was reset the cache would be stored under a different key. That wouldn't actually invalidate the old cache, just mean that we'd no longer use it, because a different key would be used. However, that would mean that we couldn't invalidate just certain query caches based on the query args, we'd have to invalidate them all at once.

JDGrimes commented 7 years ago

So I guess that we really do need an API for managing the query caches, that would allow us to clear the caches based on query args, or really just to actually clear them at all. Because as we said, right now we have to way of knowing what query caches even exist.

JDGrimes commented 7 years ago

However, at present we only need this for open-ended queries, it doesn't have to handle all of them.

WordPoints / top-users-in-period

Optimization for queries that have no end date #7