doctrine / dbal

Doctrine Database Abstraction Layer
https://www.doctrine-project.org/projects/dbal.html
MIT License
9.44k stars 1.33k forks source link

Cache key matters #2391

Open Seldaek opened 8 years ago

Seldaek commented 8 years ago

This is a post-mortem of an issue I hit yesterday on Packagist. I did a lot of reading of the Doctrine cache code in the process of understanding exactly what happened and why, and here are some thoughts/notes that hopefully can lead to improvements. It's a bit lengthy, sorry :)

The failure

A few days ago I moved the Doctrine cache to use the PredisCache driver instead of ApcCache. Yesterday the https://packagist.org/explore/ page was suddenly returning a 500. I saw that it failed with an out of memory error in the Predis connection, so that narrowed it down quite a bit to this query: https://github.com/composer/packagist/blob/f457090/src/Packagist/WebBundle/Entity/VersionRepository.php#L94-106

As we had an out of memory failure while decoding a redis response I checked for large keys and figured out that the [new_releases][1] key was taking 60MB. That seemed odd as it only caches 10 records but I thought fair enough let's see what's in there. I wiped the key and the site got back on track.

Invalid Assumptions

As you can see in the query, we run it with ->setParameter('now', date('Y-m-d H:i:s')); to get results that are older than the current time. Since I want to perform the query only every 15minutes I set the cache like this ->useResultCache(true, 900, 'new_releases') which the way I understood it means run that query and cache it for 900 seconds in that cache key. As I did not include the parameters in the cache key, I assumed that it would work.

That assumption is incorrect due to the fact doctrine creates an internal "real cache key" at https://github.com/doctrine/dbal/blob/f0eaaf/lib/Doctrine/DBAL/Cache/QueryCacheProfile.php#L101 which includes the query parameters. This is used together with the cache key, and means that my cache never worked as it re-creates it new every time the now param changes (which is every second..).

That was a first surprise, but fair enough you can write that down to a misunderstanding on my part, although I would also say that it's quite unintuitive that if I specify a cache key it's not actually taken for what it is. If I rely on automatic cache key generation using the params makes sense I guess. I realize this is hard to fix without BC break though but it'd be great to have a way to really set a cache key + real cache key.

Compound Problems

The above explains that I was caching for nothing, fair enough. However there is an additional issue that really caused this to go to hell. The way the cache is then stored is also very unintuitive, and I would argue quite buggy. As you see in https://github.com/doctrine/dbal/blob/3210835/lib/Doctrine/DBAL/Cache/ResultCacheStatement.php#L108-L114 what happens is, the cache key is fetched, and it contains an array of real cache keys and their values. Then the real cache key and result are added to it, and it's written again in the cache.

This causes two problems:

As a summary/TL;DR, here are a few ideas:

Thanks for reading.

Ocramius commented 8 years ago

Fix the TTL issue by using a compound of cache key + real cache key as the key sent to the cache backend. No more bundling of real cache keys in one big array (I didn't really understand why it was done that way.. might be a valid reason I can't think of).

I really don't know about the reasons behind it, but I suspect changing the code will break some tests, and there we'll discover why.

Allow users to specify a real cache key somehow, or just drop that concept entirely and use my cache key when I give you one without adding the parameters info to it.

Very error-prone. The caching system in ORM seems to work reliably, we'll just need to push it down into DBAL too, I suppose

Discourage specifying cache keys, as it's actually harmful right now, and it doesn't do anything useful really.

Fully support this approach: it is a detail, and the user should really avoid specifying the key. If a user needs to interact with the cache, he can retrieve the QueryCacheProfile and then operate on the generated key instead.

jenschude commented 8 years ago

Btw when you would use a NOW() SQL function in the query. The query would not change anymore with every request.

Ocramius commented 8 years ago

@jayS-de I would say that such a query is not cachable anyway, as one of the parameters relies on global state

jenschude commented 8 years ago

It is. Cause it's by intention that the result should only be updated every 15 minutes. But it's a bad idea in SQL to use parameters where the DBMS supports you with internal functions. Cause a DBMS can also be optimized to usw query cache. But the query cache only works for the same queries. Which would be the case with NOW()

jenschude commented 8 years ago

@Ocramius when changing the query to the following:

$qb->select('v')
            ->from('Packagist\WebBundle\Entity\Version', 'v')
            ->where('v.development = false')
            ->andWhere('v.releasedAt < NOW()')
            ->orderBy('v.releasedAt', 'DESC')
            ->setMaxResults($count);

@Seldaek would never have run into this issue, cause he would not have needed any parameter which had been used for the cache-key calculation. As well as the DBMS would have the possibility, when setup for result caching, to cache this query.

And as you also can see now you don't have a global state anymore. :)

maglnet commented 8 years ago

Just as a side note: imho CURRENT_TIMESTAMP is preferred over NOW(), because it's part of the SQL-92 standard. btw: @jayS-de nice pic :)

guilhermeblanco commented 8 years ago

@Seldaek I quickly read your issue, and I promise to re-read and provide a more throughly comment later. We store the parameters together with the query because IN (:param) can be expanded into multiple IN (?, ?, ?), depending of the provided parameters. Sounds simple, but remember that a single parameter binding also may be reused multiple times in the same query too (like WHERE u.email = :email OR u.username = :email).

Seldaek commented 8 years ago

@maglnet @jayS-de if you don't wanna be helpful, don't feel like you have to answer :) Both CURRENT_TIMESTAMP and NOW() can not be used within doctrine (at least without defining new query type stuff, which I am not gonna waste my time on when using date() works fine..).

@Ocramius @guilhermeblanco I still think if you specify a key it should just take it without overriding it with this real cache key, because otherwise AFAIK there is no way to use the cache key to "suppress" one of the query's parameters like I was trying to do with the date. I didn't figure out a way to override the key as you say by using the QueryCacheProfile, but maybe I missed smth.

Anyway I fixed it by handling caching myself now as you can see in https://github.com/composer/packagist/blob/29017b6d041619de1435864796c665ac7912efcf/src/Packagist/WebBundle/Entity/VersionRepository.php#L102

Not an issue for me anymore, I definitely learned my lesson, but for others to avoid WTFs I think the cache key param should be deprecated if it can't be fixed or at least the docs should clearly say that specifying it is discouraged and that it does not have any actual effect on caching it's just used for grouping real cache keys together, which lets you blast the cache by name I guess, but also means that cache operations get less efficient for that one cache key as all real cache keys are stored together within it.

jenschude commented 8 years ago

@Seldaek Okay have to admit, I didn't tried the snippet with doctrine. Tested it now and I see what was the problem. As Now is mysql specific, the doctrine query builder doesn't allows it to be used.

$qb->select('v')
            ->from('Packagist\WebBundle\Entity\Version', 'v')
            ->where('v.development = false')
            ->andWhere('v.releasedAt < CURRENT_TIMESTAMP()')
            ->orderBy('v.releasedAt', 'DESC')
            ->setMaxResults($count);

with CURRENTTIMESTAMP as suggested by magInet at least Doctrine don't complains and creates a valid SQL statement. `SELECT p0.name AS name0, p0.version AS version1, p0.description AS description_2 FROM packageversion p0 WHERE p0.development = 0 AND p0.releasedAt < CURRENTTIMESTAMP ORDER BY p0.releasedAt DESC LIMIT 10`

See also DQL Functions

Seldaek commented 8 years ago

Ah, I tried CURRENT_TIMESTAMP without ().. but anyway it's fixed.

jenschude commented 8 years ago

Yep. Only improvement would be, that you could use the doctrine result cache again.