DDC-892: Caches can potentially return false positives due to use of MD5 hash codes as keys. A classic.

doctrinebot commented 13 years ago

Jira issue originally created by user dalvarez:

I was just checking the execution path for efficiency and bumped into this block of code in /Doctrine/ORM/AbstracQuery.php, starting on line 560:

protected function _getResultCacheId()
{
    if ($this->_resultCacheId) {
        return $this->_resultCacheId;
    } else {
        $sql = $this->getSql();
        ksort($this->_hints);
        return md5(implode(";", (array)$sql) . var*export($this->*params, true) .
            var*export($this->_hints, true)."&hydrationMode=".$this->*hydrationMode);
    }
}

Of course, MD5 hashes can never be guaranteed to be unique, so this block of the doExecute-Method starting on line 508 can return false positives and thus break the user's assumptions about the ORM's behavior:

    // Check result cache
    if ($this->_useResultCache && $cacheDriver = $this->getResultCacheDriver()) {
        $id = $this->_getResultCacheId();
        $cached = $this->_expireResultCache ? false : $cacheDriver->fetch($id);

        if ($cached === false) {
            // Cache miss.
            $stmt = $this->_doExecute();

            $result = $this->*em->getHydrator($this->*hydrationMode)->hydrateAll(
                    $stmt, $this->*resultSetMapping, $this->*hints
                    );

            $cacheDriver->save($id, $result, $this->_resultCacheTTL);

            return $result;
        } else {
            // Cache hit.
            return $cached;
        }
    }

It is not likely to happen, but possible. Rely on it a couple of million times in a production system, and one day, eventually, it will fail and hell will break loose.

I think it is prohibitive to try to save memory by giving up correctness. If your program does not work, it's simply broken and there is no need to tune it unless you fix it functionally.

The only safe solution would be to use a non-lossy hash (i. e. compression) method, if any. Plain query text would do fine too.

Unless you expect the user to anticipate false positives, i. e.

doctrinebot commented 13 years ago

Comment created by dalvarez:

Has anyone taken note of this one?

As it seems, this critical bug has been sitting here for a month, untouched, and apparently even made it into the final release.

I would offer to fix it myself, but as the fix would consist in removing just one or two MD5 hash method calls, and it would probably have to be validated anyway by a core developer, there is actually not much technical effort I could save you.

Just a reminder, thanks for your time.

doctrinebot commented 13 years ago

Comment created by @beberlei:

The problem is rather that cache keys have a limited length depending on the cache driver.I know that Memcache limits at 255, have to look at the others.

Plain Text is not a solution, but we could build a range of several md5 or sha1 hashes to reduce the risk of false/positives even more.

doctrinebot commented 13 years ago

Comment created by dalvarez:

Hey Ben,

thanks for taking care.

I see the point about the length limitation. Still, I believe is a bad thing to subject correctness to probability considerations. Merely making it more unlikely to happen would only procrastinate a definite solution. Staying with a solution that relies on hash values being unique, is gambling with the probability of software systems malfunctioning randomly for no apparent reason, which is not a desirable outcome.

I figured out an alternate approach that IMO solves the length limit issue, while, to my knowledge, being correct. This is my first shot at it. But right now I do not see why this, or something like it, should not be feasible.

You could keep on indexing by MD5 hash value (just to guarantee an upper bound on the key length and circumvent the limits regarding the maximum key length). I think MD5 does that part quite well. But instead of just storing the value, store a map, mapping all of the original full-length keys that resolve to that common MD5 hash value to their individual values. In almost all but a few very rare cases, this map will have exactly one entry. Therefore in almost all cases, all that is retrieved from the cache is the value, plus the original key, enclosed in a map wrapper object. Needless to say, this map can be optimized for one entry. It could even be a list, with only a few bytes overhead. This solution would be almost as fast for read accesses as the most naïve cache implementation, but still work in cases where keys resolve to the same hash value.

The details are easy:

Fetching: Fetch the cache entry (the map) by the key's MD5 hash value. Then, look the correct value up in the map using the full-length key. Remember this is not any expensive operation. The map contains almost always exactly one value.

Creating: Fetch the cache entry (the map) by the key's MD5 hash value. If there is no cache entry at all, create a new map with a single entry in it and store it into the cache. If there is an entry, get the map, add the new entry to it using the full-length key and the value, and store the map back into the cache using the key's MD5 hash value. Adding a value to the map would also replace any existing entry for the full-length key in case it is already present in the map.

Deleting: Fetch the cache entry (the map) by the key's MD5 hash value. Then, use the full-length key to delete the corresponding value from the map. If the map is empty after removing the entry (and it will be almost always), delete the whole cache entry using the key's MD5 hash value.

This technique is correct always, and gives almost no overhead for reads. For writes and deletes, you have to fetch first, so there is an overhead, but right now I can't come up with anything faster that isn't broken.

An optimization would be implementing an additional direct full-key mapping, too, for those cache providers supporting arbitrary length keys. This could give slightly better write performance when arbitrary-length keys are supported, as there would not be any map to be read in case of writing and deleting, just a direct cache operation.

doctrinebot commented 13 years ago

Comment created by @beberlei:

The unhashed key could potentially be very large, which could lead to problems with the value limit of the caches. Plus the unhashed value relies on print_r(), thus on whitespace formatting which i wouldn't call exactly reliable.

We need to find a better solution to fix this :-)

doctrinebot commented 13 years ago

Comment created by dalvarez:

As for the size limit on cached values, I think it is an independent problem, because any value size limit applies in any case, whether or not you implement unique keys or not. Currently, it also applies e. g. to query results, for which an upper size limit can not be guaranteed in general. Consequently, a strategy for handling such a limit is required anyway.

The naïve solution for handling the value size limit would be to simply not cache any values with a total size larger than the value size limit of the respective cache. As the value size limit should actually be reasonably large (memcached e.g. has a default limit of 1 MB), and the limitation is inherent to the cache, this should be perfectly plausible, because it is the caches key functionality that imposes the limit, not Doctrine. I believe the naïve solution is on-the-spot here, and would recommend to strictly opt against taking further measures (like splitting cache entries into buckets) to work around the problem. Working around this limit should have been implemented in the respective cache in the first place, not in Doctrine, as it concerns the caches key functionality. IMO the ORM should not duplicate cache functionality, rather use it in a proper way.

But after all, this is an unrelated problem, and it does not affect the point of this bug in any way.

As for the reliance on print_r: If you consider it to be non-deterministic (producing different output formattings for identical objects) it would be broken (possibly by design). That being said, I do not believe print_r() is non-deterministic by design, though the exact workings of it might not be formally specified. The formatting in the return value string should not introduce ambiguity into the representations of otherwise unique values, and neither produce different values for identical objects. As long as these conditions are met, you should be safe using it for generating cache keys. If not, use something that does fit. This is a mere implementation issue, anyway. You could also strip whitespaces (provided this does not affect uniqueness), to eliminate the whitespace factor.

If cache size efficiency of stored values is an issue, you could be well-advised to support compressing cache values (losslessly, of course) as an option.

doctrinebot commented 13 years ago

Comment created by @beberlei:

Ok, the common recommendation to avoid cache collisions is using a multitude of different hashes and combine them, or use a very large hash.

So the best solution here is probably to switch to a cache key that has "sha1 md5 crc32" of the input and if ext/hash is installed use something much more powerful like sha256.

doctrinebot commented 13 years ago

Comment created by dalvarez:

That "recommendation" you are referring to might be just the common practice of using a combination of various hash values to increase the accuracy of ensuring data integrity and authenticity (e.g. by providing various different hash values for downloaded files). IMO this practice is feasible only for purposes like those, and can not be relied on if what you want to do is obtain correct results on cache lookups. While a change in hash value means that the file has definitely been modified (compared to the original) it does not imply e. g. that various modified files that have the same hash value have actually the same modified content.

What you are trying to do could be used in situations where a Bloom Filter (http://en.wikipedia.org/wiki/Bloom_filter) could be appropriately used, which is basically such a combination of various hashes. This could perfectly be used to determine whether there is a high chance that a specific value is in the cache. Such information could be useful to optimize disk accesses or costly requests to remote caches. Bloom filters are used e.g. in the Squid Caching HTTP Proxy, but certainly not for keying, rather for cache digests (read: "is it likely that this key is in the cache?"). This is not the case with the Doctrine 2 caches - these should better be 100% reliable.

Increasing the number or size of hash values will just decrease the probability of a crash, but will not ultimately fix the problem (unless you make the hash values include all original information, but that would be pointless in the first place). And that something is unlikely to happen does not mean it will necessarily take a long time to actually happen. It could as well break an application tomorrow.

Seriously, I do not know of any theoretically backed recommendation that tolerates wrong results when correctness is expected. There are a couple of approximation techniques appropriate in situations where false positives are anticipated, but I believe no such technique would be an appropriate solution for your situation, simply because no one expects the programming contract to imply having to anticipate false positives.

But of course, various hashes would still be better than one, though still worth a bug. Personally I would always opt for correctness. Do as you like.

doctrinebot commented 13 years ago

Comment created by @beberlei:

Implemented separate chaining to avoid hash key collisions.

doctrinebot commented 13 years ago

Issue was closed with resolution "Fixed"

doctrine / orm

DDC-892: Caches can potentially return false positives due to use of MD5 hash codes as keys. A classic. #5417